Data Engineer with 5+ years of experience in designing and developing enterprise-level, low-latency, fault-tolerant data platforms. Skilled in building distributed streaming data pipelines and analytics data stores using Spark, Flink, and Kafka. Proficient in Python, SQL, Scala, and clouds AWS, Azure), with a strong focus on CI/CD practices, security, and collaboration in agile environments. Experienced in supporting scalable cloud-based solutions and optimizing data processing for high-quality insights. Strong background in developing data pipelines, optimizing data storage & processing, and analyzing large-scale distributed systems. Applies SDLC methodologies & agile practices to deliver robust solutions for complex data engineering challenges. Adept at using visualization tools and experienced in cloud-based data architectures, ensuring efficient data management & analysis across diverse technological environments.
Overview
5
5
years of professional experience
Work History
Data Engineer
Elevance
12.2023 - Current
Led migration from on-premises to AWS, implementing ETL processes with AWS Glue, managing S3 buckets, and optimizing data workflows using Apache Airflow and EMR clusters
Developed and maintained distributed streaming data pipelines using Spark Streaming, integrating with Kafka to handle real-time data in a fault-tolerant environment
Collaborated with security and infrastructure teams to adhere to application resiliency standards and ensure compliance
Built analytics data stores and contributed to CI/CD practices, enhancing scalability and operational performance on AWS
Designed and developed robust data pipelines using Spark, Scala, and PySpark, handling diverse data formats (Avro, Parquet, JSON) and integrating with Kafka for real-time streaming
Implemented advanced analytics solutions, including AI algorithms and Machine Learning models, to optimize data processing and improve data accuracy in large-scale enterprise environments
Established a comprehensive metadata management framework using Collibra, enhancing data lineage tracking and compliance monitoring across the organization
Orchestrated end-to-end data solutions, from ingestion to visualization, utilizing technologies such as Hadoop, Snowflake, AWS Redshift, and Tableau, while ensuring data quality through custom validation scripts
Data Engineer
US Bank
11.2022 - 12.2023
Developed and implemented CI/CD pipelines using AWS Code Pipeline, AWS Glue, and AWS Databricks for AWS Big Data solutions
Developed real-time data processing applications using Scala, Python, and Apache Spark Streaming, integrating with various sources like Kafka and JMS for efficient data handling
Implemented and optimized ETL processes using PySpark, Hive, and Spark SQL, creating data frames and performing complex transformations to meet business requirements
Built distributed data computing systems with Spark and Kafka, developing real-time streaming applications and ensuring application resiliency
Partnered with agile teams to implement CI/CD and application scaling solutions across testing and production environments on AWS
Engaged in code reviews, automated testing, and performance tuning for robust, low-latency data solutions
Utilized AWS services including S3, EC2, and CloudFormation for scalable infrastructure, while implementing infrastructure as code (IaC) for automated environment deployment and testing
Created and optimized Spark clusters using Azure Databricks to accelerate high-quality data preparation, developing Spark applications in Scala for seamless Hadoop transitions
Designed and implemented comprehensive testing frameworks to validate data integrity, application performance, and security compliance across cloud migrations and data processing pipelines
Migration of data analytics platforms from Azure to AWS, designing strategies for large-scale data warehouses, and establishing secure data pipelines while optimizing costs and performance
Data Engineer
Allstate Solutions Private Limited
05.2020 - 06.2022
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala
Developed multiple POCs using Scala and deployed them on the Yarn cluster; compared the performance of Spark with Hive and SQL/Teradata
Written Hive queries for data analysis to meet the business requirements
Constructed and optimized ETL workflows for distributed data systems, utilizing Spark and NoSQL (DynamoDB) for large-scale data processing
Designed and implemented data warehousing solutions using Redshift and Snowflake for scalable data insights
Collaborated on agile development cycles, focusing on application performance, scalability, and security compliance
Automated all the jobs, for pulling data from the FTP server to load data into Hive tables using Oozie workflows
Involved in creating Hive tables, working on them using HiveQL, and performing data analysis using Hive and Pig
Worked on creating data pipelines for copy activity, moving, and transforming the data with custom Azure Data Factory pipeline activities for on-cloud ETL processing
Extensive experience in Azure Data Lake Analytics, Azure Data Lake Storage, AZURE Data Factory, Azure SQL databases, and Azure SQL Data Warehouse for providing analytics and reports for improving marketing strategies
Involved in all the steps and scope of the project reference data approach to MDM, created a data dictionary, and mapped from sources to the target in the MDM Data Model
Defined UDFs using PIG and Hive to capture customer behavior
Design and implement MapReduce jobs to support distributed processing using Java, Hive, and Apache Pig
Create Hive external tables on the MapReduce output before partitioning, and bucketing are applied to it
Worked with ETL tools Including Talend Data Integration, Talend Big Data, Pentaho Data Integration, and Informatica
Writing a Data Bricks code and ADF pipeline fully parameterized for efficient code management
Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the map-reduce jobs that extract