Summary
Overview
Work History
Skills
Certification
Timeline
Generic

Ashish

Denton,TX

Summary

Dynamic and motivated IT professional with around 12 years of experience as a Data Engineer with expertise in designing data-intensive applications using Hadoop Ecosystem, Big Data Analytical, Cloud Data Engineering, Data Warehouse / Data Mart, Data Visualization, Reporting, and Data Quality solutions.

Overview

12
12
years of professional experience
1
1
Certification

Work History

Senior Data Engineer

HCL Tech
08.2023 - Current
  • Developed real-time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka and JMS
  • Knowledge of PySpark and used Hive to analyze sensor data and cluster users based on their behavior in the events
  • Designed, developed, and maintained end-to-end data solutions on Azure, utilizing Azure Data Factory, Azure Databricks, and Azure Synapse Analytics
  • Implemented robust ETL processes to move and transform data from various sources into Azure, ensuring data accuracy and integrity
  • Conducted data quality testing, implementing automated checks to identify and rectify anomalies, resulting in a reduction in data-related issues
  • Leveraged unit testing and integration testing frameworks to validate the functionality and reliability of data pipelines, ensuring accurate data movement and transformation
  • Developed and executed comprehensive unit and integration tests for ETL processes, ensuring accurate data movement and transformation and minimizing data-related issues
  • Conducted performance testing on data pipelines, optimizing data processing time through timely performance enhancements
  • Implemented complex data transformations and business logic within Matillion, ensuring data quality and consistency
  • Designed and implemented Matillion jobs to process and load high-volume data streams from external APIs in real-time
  • Maintained and monitored Matillion instances, ensuring system availability and optimal performance
  • Enhanced data security by implementing encryption and access control policies in Matillion
  • Leveraged Python scripting to create automated testing scripts, streamlining data validation procedures and increasing testing efficiency
  • Tested and validated scalability and resilience of data solutions under varying workloads, contributing to the development of reliable and high-performing data architectures
  • Created BigQuery authorized views for row-level security or exposing the data to other teams
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipeline system
  • Experienced in writing Spark Applications in Scala and Python
  • Developed spark applications in Spark on the distributed environment to load a huge number of CSV files with different schema into Hive ORC tables
  • Experience in building and architecting multiple Data pipelines, end-to-end ETL process for Data ingestion and transformation in GCP and coordinating tasks among the team
  • Orchestrated complex data workflows in Matillion by integrating with orchestration tools such as Apache Airflow
  • Implemented data security measures and compliance protocols in alignment with organizational policies and regulatory requirements
  • Developed interactive and visually appealing dashboards using Power Apps to provide real-time insights and data visualization for business stakeholders
  • Monitored and optimized system performance post-migration, leveraging Google Cloud Monitoring and logging tools to ensure ongoing operational efficiency
  • Documented migration processes and lessons learned to support future data migration initiatives and facilitate knowledge sharing within the organization
  • Coordinated with the team and developed a framework to generate Daily Adhoc reports & Extracts from enterprise data from BigQuery
  • Developed custom aggregate functions using Spark SQL and performed interactive querying
  • Knowledge of BigQuery, Cloud functions, & GCP Dataproc Using HDFS MapReduce, Kafka, Spark, HBase, Hive UDF, & Spark to analyze massive important datasets & used the Scala Kafka Consumer API to get data from Kafka topics
  • Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
  • Developed Python code to gather the data from HBase (Cornerstone) and designed the solution to implement using PySpark
  • Created custom aggregate functions using Spark SQL and carried out interactive querying
  • Worked with the team to develop a framework to generate daily ad hoc reports and extracts from enterprise data from BigQuery
  • Used AWS Athena to query the data on top s3 bucket (files in inbound) which helps the business for data analysis
  • Ran data formatting scripts in Java and created terabyte CSV files to be consumed by Hadoop MapReduce jobs
  • Implemented Kafka model which pulls the latest records into Hive external tables
  • Loaded all datasets into Hive from Source CSV files using spark and Cassandra from Source CSV files using Spark/PySpark
  • Develop Cloud Functions in Python to process JSON files from source and load the files to BigQuery
  • Involved in ETL, Data Integration, and Migration by writing Pig scripts
  • Proficient in designing and implementing automated testing strategies tailored for cloud-based applications and services, ensuring comprehensive test coverage and scalability
  • Skilled in integrating automated tests into CI/CD pipelines, enabling rapid and reliable delivery of software updates to cloud environments
  • Proficient in testing applications deployed within containers (Docker) and orchestration platforms (Kubernetes) to ensure consistent behavior across environments
  • Collaborated with cross-functional teams to design interactive dashboards using Power BI, visualizing Synapse Analytics data for key stakeholders
  • Exported the analyzed data to Teradata using Sqoop for visualization and to generate reports for the BI team
  • Migrated the computational code in HQL to PySpark
  • Capable of establishing key performance indicators (KPIs) and metrics to track and evaluate progress toward achieving key results, facilitating data-driven decision-making
  • Imported data into HDFS from various SQL databases and files using Sqoop & from streaming systems using Storm into Big Data Lake
  • Worked on downloading BigQuery data into pandas or Spark data frames for advanced ETL capabilities
  • Written Pig scripts for ETL, Data Integration, and Migration
  • Completed data extraction, aggregation, and analysis in HDFS by using PySpark and storing the data needed to Hive
  • Exposure to the usage of Apache Kafka develops a data pipeline of logs as a stream of messages using producers and consumers
  • Sound knowledge in programming Spark using Scala
  • Populated HDFS and HBase with huge amounts of data using Apache Kafka
  • Worked and troubleshot in creating the connections between Cognos BI and Amazon Athena
  • Wrote Pig Scripts for sorting, joining, filtering, and grouping the data
  • Built PIG and Hive UDFs in Java for enhanced use of PIG and Hive
  • Jenkins-scheduled automatic nightly builds and Maven-automated builds
  • Developed a Jenkins workflow to deploy every microservice build to the Docker registry
  • Involved in writing test cases and implementing test classes using MR Unit and mocking frameworks
  • Experienced in working with various kinds of data sources such as Teradata and Oracle
  • Successfully loaded files to HDFS from Teradata and load loaded from HDFS to Hive and Impala
  • Experienced in analyzing source data to identify patterns, anomalies, and potential issues, ensuring high-quality test scenarios and accurate testing outcomes
  • Build a program with Python and Apache beam and execute it in cloud Dataflow to run Data validation between the raw source files and BigQuery tables
  • Involved in loading data from Linux file systems, servers, and web services using Kafka producers and partitions
  • Used Spark SQL with Scala for creating data frames and performed transformations on data frames
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala
  • Involved in loading data from Linux file systems, servers, and web services using Kafka producers and partitions
  • Experienced in optimizing Hive queries and joins to handle different data sets
  • Involved in ETL, Data Integration, and Migration by writing Pig scripts
  • Involved in writing test cases and implementing test classes using MR Unit and mocking frameworks
  • Developed Shell, Perl, and Python scripts to automate and provide Control flow to Pig scripts
  • Developed Spark applications using Scala for easy Hadoop transitions and optimized the code using PySpark for better performance
  • Involved in ETL, Data Integration, and Migration by writing Pig scripts
  • Developed MapReduce programs to cleanse the data in HDFS obtained from heterogeneous data sources
  • Designed and implemented MongoDB and associated RESTful web service
  • Developed a plan and moved databases to the AWS cloud from on-premises locations effectively
  • Designed and implemented complex data analytics solutions using Google BigQuery, harnessing its capabilities to process and analyze large datasets
  • Developed optimized SQL queries and data transformations within BigQuery, enabling efficient extraction of insights and facilitating data-driven decision-making
  • Made a Splunk dashboard to store the logs for the entire data import process
  • Created unit test plans for the Spark code used in the CICD process
  • Loaded JSON data into SparkSQL, created a Schema RDD, and loaded it into Hive tables while handling structured data with SparkSQL
  • ETL jobs were created to load server data and data from other sources into buckets and move S3 data into the data warehouse
  • Used Talend tool to create workflows for processing data from multiple source systems
  • Utilizing both organized and unstructured data, reports and dashboards were produced
  • Developing controllers that perform data transformation tasks in accordance with business needs
  • Configured spark jobs for continuous integration and deployment to EMR clusters
  • Spark tasks/applications scheduled in the AWS EMR cluster
  • Created an ETL pipeline to use in extracting historical logs from various sources and storing them in an S3 data lake
  • Environment: HDFS, Hive, Sqoop, Pig, Oozie, Cassandra, MySQL, AlloyDB, Kafka, Spark, Redshift, Amazon S3, Snowflake, Data modeling, Scala, Cloudera Manager (CDH4), HDFS, GCP, SNS, GLUE, Python, GIT, Data Modeling and Analysis, ETL, Azure Databricks, Azure Data Factory, Selenium, Bash, Postman, Docker, SVN, Kubernetes, Matillion, Impala, Teradata, Kibana, PowerShell Scripting, Azure SQL Data Warehouse, Data Lake, Power BI, Hadoop, HBase, SSIS, T-SQL, Jenkins.

Data Engineer

Republic Services
03.2022 - 07.2023
  • Created and managed various types of Snowflake tables, including transient, temporary, and persistent tables, to cater to specific data storage and processing needs
  • Implemented advanced partitioning techniques in Snowflake to significantly enhance query performance and expedite data retrieval
  • Defined robust roles and access privileges within Snowflake to enforce strict data security and governance protocols
  • Designed solutions to process high volume data stream ingestion, processing and low latency data provisioning using Hadoop Ecosystems Hive, Pig, Scoop and Kafka, Python, Spark, Scala, NoSql, Nifi, Druid
  • Implemented regular expressions in Snowflake for seamless pattern matching and data extraction tasks
  • Developed and implemented Snowflake scripting solutions to automate critical data pipelines, ETL processes, and data transformations
  • Developed and optimized ETL workflows using AWS Glue to extract, transform, and load data from diverse sources into Redshift for efficient data processing
  • Configured and fine-tuned Redshift clusters to achieve high-performance data processing and streamlined querying
  • Integrated AWS SNS and SQS to enable real-time event processing and efficient messaging
  • Implemented AWS Athena for ad-hoc data analysis and querying on data stored in AWS S3
  • Designed and implemented data streaming solutions using AWS Kinesis, enabling real-time data processing and analysis
  • Designed and developed Flink pipelines to consume streaming data from kafka and applied business logic to massage and transform and serialize raw data
  • Effectively managed DNS configurations and routing using AWS Route53, ensuring efficient deployment of applications and services
  • Implemented robust IAM policies and roles to ensure secure user access and permissions for AWS resources
  • Developed and optimized data processing pipelines using Hadoop ecosystem technologies such as HDFS, Sqoop, Hive, MapReduce, and Spark
  • Implemented Spark Streaming for real-time data processing and advanced analytics
  • Demonstrated expertise in scheduling and job automation using IBM Tivoli, Control-M, Oozie, and Airflow, for execution of data processing and ETL pipelines
  • Designed and developed database solutions using Teradata, Oracle, and SQL Server, including schema design and optimization, stored procedures, triggers, and cursors
  • Proficient in utilizing version control systems such as Git, GitLab, and VSS for efficient code repository management and collaborative development processes
  • Environment: AWS, AWS S3, redshift, EMR, SNS, SQS, Athena, glue, cloudwatch, kenisis, route53, IAM, Sqoop, MYSQL, HDFS, Apache Spark, Hive, Cloudera, Kafka, Zookeeper, Oozie, PySpark, Ambari, JIRA, IBM Tivoli, control-m, Flink,Druid,OOZIE, airflow, Teradata, oracle, SQL

Data Engineer

Fidelity Investments
04.2020 - 02.2022
  • Developed upgrade and downgrade scripts in SQL that filter corrupted records with missing values along with identifying unique records based on different criteria
  • Installed, configured, and managed Microsoft SQL Server instances, including SQL Server Express, Standard, and Enterprise editions
  • Configure and manage SSIS server settings for optimal performance
  • Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
  • Migrate data from traditional database systems to Azure databases
  • Design and implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools
  • Experience in DWH/BI project implementation using Azure Data Factory
  • Interacts with Business Analysts, Users, and SMEs on elaborating requirements
  • Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
  • Worked on automating data ingestion into the Lakehouse and transformed the data, used Apache Spark for leveraging the data, and stored the data in Delta Lake
  • Ensured data quality and integrity of the data using Azure SQL Database and automated ETL deployment and operationalization
  • Used Databricks, Scala, and Spark for creating the data workflows and capturing the data from Delta tables in Delta Lakes
  • Performed Streaming of pipelines using Azure Event Hubs and Stream Analytics to analyze the data from the data-driven workflows
  • Writing a Data Bricks code and ADF pipeline fully parameterized for efficient code management
  • Generate ad-hoc and scheduled reports using SQL Server Reporting Services (SSRS)
  • Development and implementation of ETL pipelines according to the DWH design and architecture (Azure Synapse, ADLS Gen2, Databricks, Azure DevOps)
  • Worked with Delta Lakes for consistent unification of Streaming, processed the data, and worked on ACID transactions using Apache Spark
  • Setup and maintain the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse
  • Cleanse and transform source data within SSIS packages to meet business requirements
  • Implement Copy activity, Custom Azure Data Factory Pipeline Activities
  • Primarily involved in Data Migration using SQL, SQL Azure, Azure storage, and Azure Data Factory, SSIS, PowerShell
  • Create C# applications to load data from Azure storage blob to Azure SQL, to load from web API to Azure SQL and scheduled web jobs for daily loads
  • Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database and SQL Data Warehouse environment
  • Experience in DWH/BI project implementation using Azure DF and Databricks
  • Architect, design and validate Azure infrastructure-as-a-Service (IaaS) environment
  • Develop dashboards and visualizations to help business users analyze data as well as providing data insight to upper management with a focus on Microsoft products like SQL Server Reporting Services (SSRS) and Power BI
  • Developed Python scripts to do file validations in Databricks and automated the process using ADF
  • Developed an automated process in Azure cloud which can ingest data daily from web service and load into Azure SQL DB
  • Developed Streaming pipelines using Azure Event Hubs and Stream Analytics to analyze data for dealer efficiency and open table counts for data coming in from IOT enabled poker and other pit tables
  • Analyzed data where it lives by Mounting Azure Data Lake and Blob to Databricks
  • Implemented complex business logic through T-SQL stored procedures, Functions, Views, and advanced query concepts
  • Worked with enterprise Data Modeling team on creation of Logical models
  • Manage budgets and resources for SQL Server projects.

Data Analyst

DKRIN Pvt Ltd
09.2012 - 06.2015
  • Running SQL operations on JSON, converting the data into a tabular structure with data frames, and storing and writing the data to Hive and HDFS
  • Developing shell scripts for data ingestion and validation with different parameters, as well as
  • Writing custom shell scripts to invoke spark Employment
  • Tuned performance of Informatica mappings and sessions to improve the process and make it efficient after eliminating bottlenecks
  • Worked on complex SQL Queries, PL/SQL procedures and converted them to ETL tasks
  • Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks
  • Created a risk-based machine learning model (logistic regress, random forest, SVM, etc.) to
  • Predict which customers are more likely to be delinquent based on historical performance data
  • And rank order them
  • Evaluated model output using the uncertainty matrix (Precision, Recall as well as Teradata
  • Resources and utilities (BTEQ, Fast load, Multi Load, Fast Export, and TPUMP)
  • Developed a monthly report using Python to code the payment results of customers and make
  • Suggestions to the manager.

Skills

  • Hadoop
  • MapReduce
  • HDFS
  • Sqoop
  • PIG
  • Hive
  • HBase
  • Oozie
  • Flume
  • NiFi
  • Kafka
  • Zookeeper
  • Yarn
  • Apache Flink
  • Apache Spark
  • Mahout
  • Sparklib
  • Apache Druid
  • Oracle
  • MySQL
  • SQL Server
  • MongoDB
  • Cassandra
  • DynamoDB
  • PostgreSQL
  • Teradata
  • Java
  • Python
  • PySpark
  • Scala
  • Shell script
  • Perl script
  • SQL
  • GCP
  • AWS
  • Microsoft Azure
  • PyCharm
  • Eclipse
  • Visual Studio
  • Plus
  • SQL Developer
  • TOAD
  • SQL Navigator
  • Query Analyzer
  • SQL Server Management Studio
  • SQL Assistance
  • Postman
  • SVN
  • Git
  • GitHub
  • Windows 7/8/XP/2008/2012
  • Ubuntu Linux
  • MacOS
  • Kerberos
  • Dimension Modeling
  • ER Modeling
  • Star Schema Modeling
  • Snowflake Modeling
  • Control-M
  • Grafana

Certification

  • Microsoft Certified Azure Solutions Architect
  • Databricks Certified Data Engineer Professional
  • AWS Certified Data Engineer
  • IBM certified Application Developer

Timeline

Senior Data Engineer

HCL Tech
08.2023 - Current

Data Engineer

Republic Services
03.2022 - 07.2023

Data Engineer

Fidelity Investments
04.2020 - 02.2022

Data Analyst

DKRIN Pvt Ltd
09.2012 - 06.2015
Ashish