Experienced Senior Data Engineer with expertise in designing and optimizing big data solutions across GCP, AWS, and Azure, specializing in Apache Spark, PySpark, Kafka, Flink, and SQL for large-scale data processing. Skilled in building real-time and batch data pipelines, leveraging Big Query, Dataproc, Snowflake, and Redshift to drive high-performance analytics and reporting. Proficient in ETL/ELT workflows, machine learning data engineering, and cloud-native architectures, with hands-on experience in Terraform, Airflow, Kubernetes, and CI/CD automation to ensure efficient deployment and orchestration of data workflows. Adept at implementing data governance, security, and quality frameworks using Great Expectations, dbt, and SQL optimizations, ensuring data integrity and compliance. Passionate about developing scalable, cloud-based data ecosystems that enable real-time insights and data-driven decision-making while collaborating in Agile environments to drive business intelligence and operational efficiency.
Overview
6
6
years of professional experience
Work History
Sr. Data Engineer
United Airlines
Houston
07.2024 - Current
End-to-End Cloud Data Pipeline Design: Led the design, implementation, and optimization of end-to-end data pipelines on Google Cloud Platform (GCP), utilizing BigQuery, Dataproc, Pub/Sub, and Google Cloud Storage (GCS) to process large-scale structured and unstructured datasets, enabling seamless data flow across the organization.
Real-Time Data Streaming Architecture: Engineered a robust real-time streaming architecture using Apache Kafka, Apache Flink, and Google Pub/Sub, successfully handling millions of records per second, and ensuring low-latency data ingestion for real-time analytics, fraud detection, and monitoring.
Developed and optimized PySpark applications on Dataproc to handle data transformations, aggregations, and feature engineering for both batch and streaming workloads, leading to a 30% reduction in processing times and significant performance improvements.
Led the automation of ETL workflows using Apache Airflow, integrating services such as Dataproc, Big Query, and Pub/Sub, resulting in a 40% improvement in pipeline reliability and a 50% reduction in manual intervention.
Successfully migrated legacy Hadoop-based workloads (HDFS, Hive, MapReduce) to GCP (Dataproc and BigQuery), reducing processing time by over 60% and achieving a 25% reduction in operational costs through cloud storage optimization and improved compute efficiency.
Established and implemented data governance and quality assurance frameworks using Great Expectations and dbt, ensuring data integrity, accuracy, and compliance. This led to a 40% reduction in data inconsistencies and improved stakeholder trust in data-driven decision-making.
Integrated machine learning-based anomaly detection models into the data pipeline using Vertex AI, enabling real-time fraud detection and early identification of operational anomalies, resulting in a 20% increase in fraud detection accuracy.
Spearheaded the creation of CI/CD pipelines using Terraform and Google Cloud Build, automating the deployment of data workflows and infrastructure. This reduced deployment time by 35% and minimized system downtime, ensuring continuous delivery of high-quality data pipelines.
Collaborated closely with data scientists, business analysts, and product managers to design data pipelines that aligned with business objectives, enabling real-time reporting and actionable insights. Enhanced decision-making for the business by streamlining data availability.
Mentored a team of junior engineers, conducting regular knowledge-sharing sessions, code reviews, and training in big data technologies such as PySpark, GCP, and Apache Airflow. This resulted in a noticeable improvement in team efficiency and technical expertise.
Developed real-time data dashboards using Tableau and Python-based tools like Dash and Matplotlib to monitor key pipeline metrics, providing stakeholders with actionable insights into pipeline performance, data health, and business KPIs.
Engineered and deployed optimized data storage and retrieval strategies in GCS and Big Query, utilizing file formats like Parquet and Avro, dynamic partitioning, and cost-efficient query techniques. This led to a 30% reduction in storage costs and faster query execution.
Ensured compliance with data privacy and security regulations by implementing encryption protocols and role-based access controls (RBAC) in GCP. Leveraged IAM roles and policies to enforce secure access to sensitive data across cloud environments.
Performance Tuning & Query Optimization: Continuously optimized SQL queries and data models in BigQuery, leveraging partitioning, clustering, and indexing strategies to improve query performance, reducing query execution times by 25%, and enhancing overall system responsiveness.
Worked on cross-cloud integration projects, connecting GCP with AWS, and Azure services to create hybrid architectures for data processing and analytics, improving data accessibility and scalability across multiple cloud platforms.
Data Engineer
CapitalOne
Delaware
09.2023 - 06.2024
Designed and implemented scalable ETL pipelines using Python and Apache Spark (PySpark) to process terabytes of structured and unstructured data from multiple sources, improving data flow efficiency.
Developed real-time data streaming solutions using Apache Kafka and Azure Event Hubs, integrating high-volume transaction data for real-time fraud detection and risk assessment.
Built batch and streaming data pipelines in Azure Synapse, Azure Data Lake (ADLS), and Databricks, optimizing data storage, transformation, and retrieval for machine learning models.
Automated data ingestion and transformation workflows using Apache Airflow and Azure Data Factory (ADF), reducing manual intervention and improving data consistency.
Migrated legacy Hadoop and on-prem databases to cloud-based architectures (Azure Synapse, Snowflake), optimizing query performance and cost efficiency for data analytics.
Designed and optimized complex SQL queries (window functions, CTEs, indexing, partitioning) across Azure SQL Database, Synapse, Snowflake, and PostgreSQL, enhancing data accessibility for analytics teams.
Developed feature engineering pipelines in PySpark and Pandas, improving machine learning model performance and accelerating data preprocessing workflows.
Implemented CI/CD pipelines using Terraform and Azure DevOps, automating infrastructure provisioning and ensuring reliable deployment of data pipelines.
Built role-based access controls (RBAC) and encryption protocols using Azure Key Vault and ADLS, ensuring compliance with security standards.
Designed scalable and high-performance data warehouse solutions using Star and Snowflake schemas in Snowflake, Synapse, and Big Query, supporting advanced analytics workloads.
Integrated Power BI and Tableau dashboards with cloud data sources, enabling real-time business intelligence and data-driven decision-making.
Worked in Agile development sprints, collaborating with cross-functional teams to optimize data pipeline performance and implement new features.
Environment: Python (Pandas, PySpark, SQL Alchemy, Fast API), SQL (Azure SQL, PostgreSQL, MySQL, Snowflake, Big Query), Azure (Synapse, Data Factory, Event Hubs, Data Lake, Key Vault, Azure DevOps), Apache Spark, Airflow, Kafka, Terraform, Databricks, Hadoop (HDFS, Hive, YARN), Docker, Kubernetes (AKS), Power BI, Tableau.
Data Engineer
Tata Consultancy Services
Hyderabad
08.2019 - 04.2023
Designed and deployed scalable data pipelines on AWS, leveraging Amazon S3, Glue, and EMR to process and store massive datasets efficiently.
Developed Java-based ETL workflows using Apache Spark on AWS EMR, optimizing transformations, aggregations, and joins for improved performance and fault tolerance.
Integrated real-time data streaming using Apache Kafka and Amazon Kinesis, enabling event-driven architectures and reducing data ingestion latency.
Architected and implemented a data lake solution on Amazon S3, organizing raw, processed, and curated datasets using efficient storage formats like Parquet and ORC.
Built and optimized data warehousing solutions on Redshift and Snowflake, improving analytical query performance through columnar storage, partitioning, and indexing techniques.
Developed CI/CD pipelines for data workflows using AWS Code Pipeline, Terraform, and CloudFormation, automating deployment and infrastructure provisioning.
Implemented Airflow DAGs to orchestrate data pipelines, scheduling complex dependencies across AWS Glue, Redshift, and EMR for batch and streaming workloads.
Designed fault-tolerant data ingestion pipelines integrating APIs, PostgreSQL, MySQL, and MongoDB, ensuring high availability and consistency of ingested data.
Optimized Java-based Spark applications by tuning memory management, parallelism, and shuffle operations, reducing execution time and improving resource efficiency.
Containerized Spark jobs using Docker and Kubernetes (EKS), enabling auto-scaling and efficient resource utilization for distributed processing workloads.
Implemented real-time log analytics solutions using Elasticsearch, Logstash, and Kibana (ELK stack), improving system monitoring and anomaly detection.
Developed custom monitoring scripts in Python and Java, integrating with AWS CloudWatch and Prometheus to track pipeline performance, failures, and latency issues.
Automated data quality checks using Great Expectations, ensuring data integrity before ingestion into data warehouses and analytical platforms.
Designed and implemented access control policies using AWS IAM and Lake Formation, enforcing data security, governance, and compliance standards.
Mentored junior engineers on Big Data frameworks, Java-based Spark development, and best practices in data engineering, fostering a culture of continuous learning.