Summary

Overview

Work History

Skills

Certification

Timeline

Ashish

Denton,TX

Summary

Dynamic and motivated IT professional with around 12 years of experience as a Data Engineer with expertise in designing data-intensive applications using Hadoop Ecosystem, Big Data Analytical, Cloud Data Engineering, Data Warehouse / Data Mart, Data Visualization, Reporting, and Data Quality solutions.

Overview

years of professional experience

Certification

Work History

Senior Data Engineer

HCL Tech

08.2023 - Current

Developed real-time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka and JMS
Knowledge of PySpark and used Hive to analyze sensor data and cluster users based on their behavior in the events
Designed, developed, and maintained end-to-end data solutions on Azure, utilizing Azure Data Factory, Azure Databricks, and Azure Synapse Analytics
Implemented robust ETL processes to move and transform data from various sources into Azure, ensuring data accuracy and integrity
Conducted data quality testing, implementing automated checks to identify and rectify anomalies, resulting in a reduction in data-related issues
Leveraged unit testing and integration testing frameworks to validate the functionality and reliability of data pipelines, ensuring accurate data movement and transformation
Developed and executed comprehensive unit and integration tests for ETL processes, ensuring accurate data movement and transformation and minimizing data-related issues
Conducted performance testing on data pipelines, optimizing data processing time through timely performance enhancements
Implemented complex data transformations and business logic within Matillion, ensuring data quality and consistency
Designed and implemented Matillion jobs to process and load high-volume data streams from external APIs in real-time
Maintained and monitored Matillion instances, ensuring system availability and optimal performance
Enhanced data security by implementing encryption and access control policies in Matillion
Leveraged Python scripting to create automated testing scripts, streamlining data validation procedures and increasing testing efficiency
Tested and validated scalability and resilience of data solutions under varying workloads, contributing to the development of reliable and high-performing data architectures
Created BigQuery authorized views for row-level security or exposing the data to other teams
Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipeline system
Experienced in writing Spark Applications in Scala and Python
Developed spark applications in Spark on the distributed environment to load a huge number of CSV files with different schema into Hive ORC tables
Experience in building and architecting multiple Data pipelines, end-to-end ETL process for Data ingestion and transformation in GCP and coordinating tasks among the team
Orchestrated complex data workflows in Matillion by integrating with orchestration tools such as Apache Airflow
Implemented data security measures and compliance protocols in alignment with organizational policies and regulatory requirements
Developed interactive and visually appealing dashboards using Power Apps to provide real-time insights and data visualization for business stakeholders
Monitored and optimized system performance post-migration, leveraging Google Cloud Monitoring and logging tools to ensure ongoing operational efficiency
Documented migration processes and lessons learned to support future data migration initiatives and facilitate knowledge sharing within the organization
Coordinated with the team and developed a framework to generate Daily Adhoc reports & Extracts from enterprise data from BigQuery
Developed custom aggregate functions using Spark SQL and performed interactive querying
Knowledge of BigQuery, Cloud functions, & GCP Dataproc Using HDFS MapReduce, Kafka, Spark, HBase, Hive UDF, & Spark to analyze massive important datasets & used the Scala Kafka Consumer API to get data from Kafka topics
Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
Developed Python code to gather the data from HBase (Cornerstone) and designed the solution to implement using PySpark
Created custom aggregate functions using Spark SQL and carried out interactive querying
Worked with the team to develop a framework to generate daily ad hoc reports and extracts from enterprise data from BigQuery
Used AWS Athena to query the data on top s3 bucket (files in inbound) which helps the business for data analysis
Ran data formatting scripts in Java and created terabyte CSV files to be consumed by Hadoop MapReduce jobs
Implemented Kafka model which pulls the latest records into Hive external tables
Loaded all datasets into Hive from Source CSV files using spark and Cassandra from Source CSV files using Spark/PySpark
Develop Cloud Functions in Python to process JSON files from source and load the files to BigQuery
Involved in ETL, Data Integration, and Migration by writing Pig scripts
Proficient in designing and implementing automated testing strategies tailored for cloud-based applications and services, ensuring comprehensive test coverage and scalability
Skilled in integrating automated tests into CI/CD pipelines, enabling rapid and reliable delivery of software updates to cloud environments
Proficient in testing applications deployed within containers (Docker) and orchestration platforms (Kubernetes) to ensure consistent behavior across environments
Collaborated with cross-functional teams to design interactive dashboards using Power BI, visualizing Synapse Analytics data for key stakeholders
Exported the analyzed data to Teradata using Sqoop for visualization and to generate reports for the BI team
Migrated the computational code in HQL to PySpark
Capable of establishing key performance indicators (KPIs) and metrics to track and evaluate progress toward achieving key results, facilitating data-driven decision-making
Imported data into HDFS from various SQL databases and files using Sqoop & from streaming systems using Storm into Big Data Lake
Worked on downloading BigQuery data into pandas or Spark data frames for advanced ETL capabilities
Written Pig scripts for ETL, Data Integration, and Migration
Completed data extraction, aggregation, and analysis in HDFS by using PySpark and storing the data needed to Hive
Exposure to the usage of Apache Kafka develops a data pipeline of logs as a stream of messages using producers and consumers
Sound knowledge in programming Spark using Scala
Populated HDFS and HBase with huge amounts of data using Apache Kafka
Worked and troubleshot in creating the connections between Cognos BI and Amazon Athena
Wrote Pig Scripts for sorting, joining, filtering, and grouping the data
Built PIG and Hive UDFs in Java for enhanced use of PIG and Hive
Jenkins-scheduled automatic nightly builds and Maven-automated builds
Developed a Jenkins workflow to deploy every microservice build to the Docker registry
Involved in writing test cases and implementing test classes using MR Unit and mocking frameworks
Experienced in working with various kinds of data sources such as Teradata and Oracle
Successfully loaded files to HDFS from Teradata and load loaded from HDFS to Hive and Impala
Experienced in analyzing source data to identify patterns, anomalies, and potential issues, ensuring high-quality test scenarios and accurate testing outcomes
Build a program with Python and Apache beam and execute it in cloud Dataflow to run Data validation between the raw source files and BigQuery tables
Involved in loading data from Linux file systems, servers, and web services using Kafka producers and partitions
Used Spark SQL with Scala for creating data frames and performed transformations on data frames
Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala
Involved in loading data from Linux file systems, servers, and web services using Kafka producers and partitions
Experienced in optimizing Hive queries and joins to handle different data sets
Involved in ETL, Data Integration, and Migration by writing Pig scripts
Involved in writing test cases and implementing test classes using MR Unit and mocking frameworks
Developed Shell, Perl, and Python scripts to automate and provide Control flow to Pig scripts
Developed Spark applications using Scala for easy Hadoop transitions and optimized the code using PySpark for better performance
Involved in ETL, Data Integration, and Migration by writing Pig scripts
Developed MapReduce programs to cleanse the data in HDFS obtained from heterogeneous data sources
Designed and implemented MongoDB and associated RESTful web service
Developed a plan and moved databases to the AWS cloud from on-premises locations effectively
Designed and implemented complex data analytics solutions using Google BigQuery, harnessing its capabilities to process and analyze large datasets
Developed optimized SQL queries and data transformations within BigQuery, enabling efficient extraction of insights and facilitating data-driven decision-making
Made a Splunk dashboard to store the logs for the entire data import process
Created unit test plans for the Spark code used in the CICD process
Loaded JSON data into SparkSQL, created a Schema RDD, and loaded it into Hive tables while handling structured data with SparkSQL
ETL jobs were created to load server data and data from other sources into buckets and move S3 data into the data warehouse
Used Talend tool to create workflows for processing data from multiple source systems
Utilizing both organized and unstructured data, reports and dashboards were produced
Developing controllers that perform data transformation tasks in accordance with business needs
Configured spark jobs for continuous integration and deployment to EMR clusters
Spark tasks/applications scheduled in the AWS EMR cluster
Created an ETL pipeline to use in extracting historical logs from various sources and storing them in an S3 data lake
Environment: HDFS, Hive, Sqoop, Pig, Oozie, Cassandra, MySQL, AlloyDB, Kafka, Spark, Redshift, Amazon S3, Snowflake, Data modeling, Scala, Cloudera Manager (CDH4), HDFS, GCP, SNS, GLUE, Python, GIT, Data Modeling and Analysis, ETL, Azure Databricks, Azure Data Factory, Selenium, Bash, Postman, Docker, SVN, Kubernetes, Matillion, Impala, Teradata, Kibana, PowerShell Scripting, Azure SQL Data Warehouse, Data Lake, Power BI, Hadoop, HBase, SSIS, T-SQL, Jenkins.

Data Engineer

Republic Services

03.2022 - 07.2023

Created and managed various types of Snowflake tables, including transient, temporary, and persistent tables, to cater to specific data storage and processing needs
Implemented advanced partitioning techniques in Snowflake to significantly enhance query performance and expedite data retrieval
Defined robust roles and access privileges within Snowflake to enforce strict data security and governance protocols
Designed solutions to process high volume data stream ingestion, processing and low latency data provisioning using Hadoop Ecosystems Hive, Pig, Scoop and Kafka, Python, Spark, Scala, NoSql, Nifi, Druid
Implemented regular expressions in Snowflake for seamless pattern matching and data extraction tasks
Developed and implemented Snowflake scripting solutions to automate critical data pipelines, ETL processes, and data transformations
Developed and optimized ETL workflows using AWS Glue to extract, transform, and load data from diverse sources into Redshift for efficient data processing
Configured and fine-tuned Redshift clusters to achieve high-performance data processing and streamlined querying
Integrated AWS SNS and SQS to enable real-time event processing and efficient messaging
Implemented AWS Athena for ad-hoc data analysis and querying on data stored in AWS S3
Designed and implemented data streaming solutions using AWS Kinesis, enabling real-time data processing and analysis
Designed and developed Flink pipelines to consume streaming data from kafka and applied business logic to massage and transform and serialize raw data
Effectively managed DNS configurations and routing using AWS Route53, ensuring efficient deployment of applications and services
Implemented robust IAM policies and roles to ensure secure user access and permissions for AWS resources
Developed and optimized data processing pipelines using Hadoop ecosystem technologies such as HDFS, Sqoop, Hive, MapReduce, and Spark
Implemented Spark Streaming for real-time data processing and advanced analytics
Demonstrated expertise in scheduling and job automation using IBM Tivoli, Control-M, Oozie, and Airflow, for execution of data processing and ETL pipelines
Designed and developed database solutions using Teradata, Oracle, and SQL Server, including schema design and optimization, stored procedures, triggers, and cursors
Proficient in utilizing version control systems such as Git, GitLab, and VSS for efficient code repository management and collaborative development processes
Environment: AWS, AWS S3, redshift, EMR, SNS, SQS, Athena, glue, cloudwatch, kenisis, route53, IAM, Sqoop, MYSQL, HDFS, Apache Spark, Hive, Cloudera, Kafka, Zookeeper, Oozie, PySpark, Ambari, JIRA, IBM Tivoli, control-m, Flink,Druid,OOZIE, airflow, Teradata, oracle, SQL

Data Engineer

Fidelity Investments

04.2020 - 02.2022

Developed upgrade and downgrade scripts in SQL that filter corrupted records with missing values along with identifying unique records based on different criteria
Installed, configured, and managed Microsoft SQL Server instances, including SQL Server Express, Standard, and Enterprise editions
Configure and manage SSIS server settings for optimal performance
Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
Migrate data from traditional database systems to Azure databases
Design and implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools
Experience in DWH/BI project implementation using Azure Data Factory
Interacts with Business Analysts, Users, and SMEs on elaborating requirements
Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
Worked on automating data ingestion into the Lakehouse and transformed the data, used Apache Spark for leveraging the data, and stored the data in Delta Lake
Ensured data quality and integrity of the data using Azure SQL Database and automated ETL deployment and operationalization
Used Databricks, Scala, and Spark for creating the data workflows and capturing the data from Delta tables in Delta Lakes
Performed Streaming of pipelines using Azure Event Hubs and Stream Analytics to analyze the data from the data-driven workflows
Writing a Data Bricks code and ADF pipeline fully parameterized for efficient code management
Generate ad-hoc and scheduled reports using SQL Server Reporting Services (SSRS)
Development and implementation of ETL pipelines according to the DWH design and architecture (Azure Synapse, ADLS Gen2, Databricks, Azure DevOps)
Worked with Delta Lakes for consistent unification of Streaming, processed the data, and worked on ACID transactions using Apache Spark
Setup and maintain the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse
Cleanse and transform source data within SSIS packages to meet business requirements
Implement Copy activity, Custom Azure Data Factory Pipeline Activities
Primarily involved in Data Migration using SQL, SQL Azure, Azure storage, and Azure Data Factory, SSIS, PowerShell
Create C# applications to load data from Azure storage blob to Azure SQL, to load from web API to Azure SQL and scheduled web jobs for daily loads
Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database and SQL Data Warehouse environment
Experience in DWH/BI project implementation using Azure DF and Databricks
Architect, design and validate Azure infrastructure-as-a-Service (IaaS) environment
Develop dashboards and visualizations to help business users analyze data as well as providing data insight to upper management with a focus on Microsoft products like SQL Server Reporting Services (SSRS) and Power BI
Developed Python scripts to do file validations in Databricks and automated the process using ADF
Developed an automated process in Azure cloud which can ingest data daily from web service and load into Azure SQL DB
Developed Streaming pipelines using Azure Event Hubs and Stream Analytics to analyze data for dealer efficiency and open table counts for data coming in from IOT enabled poker and other pit tables
Analyzed data where it lives by Mounting Azure Data Lake and Blob to Databricks
Implemented complex business logic through T-SQL stored procedures, Functions, Views, and advanced query concepts
Worked with enterprise Data Modeling team on creation of Logical models
Manage budgets and resources for SQL Server projects.

Data Analyst

DKRIN Pvt Ltd

09.2012 - 06.2015

Running SQL operations on JSON, converting the data into a tabular structure with data frames, and storing and writing the data to Hive and HDFS
Developing shell scripts for data ingestion and validation with different parameters, as well as
Writing custom shell scripts to invoke spark Employment
Tuned performance of Informatica mappings and sessions to improve the process and make it efficient after eliminating bottlenecks
Worked on complex SQL Queries, PL/SQL procedures and converted them to ETL tasks
Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks
Created a risk-based machine learning model (logistic regress, random forest, SVM, etc.) to
Predict which customers are more likely to be delinquent based on historical performance data
And rank order them
Evaluated model output using the uncertainty matrix (Precision, Recall as well as Teradata
Resources and utilities (BTEQ, Fast load, Multi Load, Fast Export, and TPUMP)
Developed a monthly report using Python to code the payment results of customers and make
Suggestions to the manager.

Skills

Hadoop
MapReduce
HDFS
Sqoop
PIG
Hive
HBase
Oozie
Flume
NiFi
Kafka
Zookeeper
Yarn
Apache Flink
Apache Spark
Mahout
Sparklib
Apache Druid
Oracle
MySQL
SQL Server
MongoDB
Cassandra
DynamoDB
PostgreSQL
Teradata
Java
Python
PySpark
Scala
Shell script
Perl script
SQL
GCP

AWS
Microsoft Azure
PyCharm
Eclipse
Visual Studio
Plus
SQL Developer
TOAD
SQL Navigator
Query Analyzer
SQL Server Management Studio
SQL Assistance
Postman
SVN
Git
GitHub
Windows 7/8/XP/2008/2012
Ubuntu Linux
MacOS
Kerberos
Dimension Modeling
ER Modeling
Star Schema Modeling
Snowflake Modeling
Control-M
Grafana

Certification

Microsoft Certified Azure Solutions Architect
Databricks Certified Data Engineer Professional
AWS Certified Data Engineer
IBM certified Application Developer

Timeline

Senior Data Engineer

HCL Tech

08.2023 - Current

Data Engineer

Republic Services

03.2022 - 07.2023

Data Engineer

Fidelity Investments

04.2020 - 02.2022

Data Analyst

DKRIN Pvt Ltd

09.2012 - 06.2015

Ashish

Summary

Overview

Work History

Senior Data Engineer

Data Engineer

Data Engineer

Data Analyst

Skills

Certification

Timeline

Senior Data Engineer

Data Engineer

Data Engineer

Data Analyst

Similar Profiles

Abhishek KumarAbhishek Kumar

Suraj Prasad KhilarSuraj Prasad Khilar

Khushi RawatKhushi Rawat

Tushar SrivastavaTushar Srivastava

Jayla AcunaJayla Acuna