Highly skilled Data Engineer with 5+ years of experience in designing, implementing, and optimizing data pipelines and workflows using advanced Big Data technologies. Proficient in Hadoop ecosystem tools, cloud-based data solutions (AWS, Azure), and real-time data processing. Expertise in programming with Python, Scala, Java, and SQL, complemented by a strong background in data warehousing, ETL, and analytics. Adept at leveraging Agile practices to deliver scalable and efficient data solutions, ensuring compliance and integrity across diverse technological environments.
Overview
5
5
years of professional experience
Work History
Data Engineer
Elevance
Indianapolis
12.2023 - Current
Migrated on-premises data solutions to AWS, implementing comprehensive ETL processes using AWS Glue, optimizing workflows with Apache Airflow, and managing large-scale datasets in S3.
Designed and developed robust and scalable data pipelines using Apache Spark, PySpark, and Scala, processing diverse data formats such as Avro, Parquet, and JSON for real-time and batch workloads.
Architected real-time data streaming applications with Kafka and integrated them with machine learning models to improve enterprise-wide analytics and operational decision-making.
Implemented advanced analytics solutions including AI-driven frameworks to enhance data accuracy and streamline complex workflows for high-volume processing environments.
Enhanced metadata management using Collibra, enabling detailed data lineage tracking, compliance auditing, and operational monitoring for enterprise data governance.
Developed data warehousing and visualization solutions with Snowflake, AWS Redshift, and Tableau, ensuring high performance, scalability, and actionable insights across multiple business units.
Established automated data validation scripts to monitor and improve data integrity, leveraging custom scripts integrated with pipelines to ensure compliance with quality standards.
Designed and developed CI/CD pipelines leveraging AWS CodePipeline, Glue, and Databricks, automating deployment processes for large-scale data engineering workflows across cloud environments.
Created and optimized real-time data processing systems using Apache Spark Streaming and Scala, integrating with Kafka and JMS to handle high-throughput, low-latency data streams efficiently.
Migrated analytics platforms and data warehouses from Azure to AWS, employing advanced strategies to optimize cost, scalability, and performance for enterprise data solutions.
Built scalable infrastructure using AWS services such as EC2, S3, and CloudFormation, implementing Infrastructure as Code (IaC) for consistent, automated deployments and environment configuration.
Orchestrated end-to-end cloud data solutions using Azure Databricks, Spark, and Python to process large datasets seamlessly and enable efficient data engineering pipelines.
Designed robust testing frameworks for validating data integrity, application performance, and compliance with security protocols across pipeline and infrastructure layers.
Automated complex ETL workflows using AWS Glue and PySpark, optimizing transformations and accelerating the time to insights for business-critical operations.
Migrated legacy SQL and Hive workflows into Spark-based transformations using Spark RDDs and Scala, significantly improving processing speed and query efficiency across distributed environments.
Designed and deployed proof-of-concept (PoC) projects on Yarn clusters to compare performance and validate Spark's efficiency relative to Hive and traditional SQL/Teradata operations.
Automated data ingestion workflows from FTP servers to Hive tables using Oozie, ensuring smooth and reliable integration with downstream ETL processes.
Developed Azure Data Factory pipelines to orchestrate data movement and transformation tasks, integrating with Azure Data Lake Analytics and Storage for high-performance ETL processing.
Created advanced user-defined functions (UDFs) in Pig and Hive to analyze customer behavior, supporting enhanced decision-making and personalized marketing strategies.
Built MapReduce jobs to support distributed processing of large datasets, integrating with Hive external tables and implementing partitioning and bucketing strategies for query optimization.
Utilized ETL tools including Talend and Informatica to standardize data migration, cleansing, and integration workflows, ensuring consistency and accuracy across complex data environments.
Tools & Environment: Hadoop, Spark, Hive, Pig, MapReduce, Azure Data Factory, Data Lake, SQL, Oozie, FTP integration, Talend, Pentaho, Informatica, Python, Scala, Java, Teradata.