Summary
Overview
Work History
Education
Skills
Certification
Timeline
Generic

Harika Vaddi

Pittsburgh,PA

Summary

I am a highly skilled Data Engineer with over 11 years of experience specializing in AWS and GCP-based big data solutions. I have a strong track record of designing and implementing scalable ETL/ELT pipelines, real-time data streaming architectures, and cloud-native data platforms across industries such as finance, insurance, telecommunications, and healthcare. My technical expertise spans a wide range of cloud services and tools including AWS Glue, Redshift, Lambda, EMR, Kinesis, Terraform, Snowflake, Google Cloud Dataflow, BigQuery, and Cloud Composer (Apache Airflow). I have also worked extensively with dbt (Data Build Tool) to build modular, testable data transformations and ensure data quality in modern data stack environments. I bring deep experience in developing high-performance data lakes, orchestrating pipelines, managing data governance, and deploying infrastructure as code. My work emphasizes cost-efficiency, performance optimization, and compliance with regulatory standards (HIPAA, GDPR, SOC 2, PCI-DSS). Proficient in Python, SQL, Apache Spark, and Apache Beam, I thrive in fast-paced, collaborative environments and enjoy solving complex data challenges. I am passionate about enabling data-driven decision-making through robust, scalable, and secure data solutions built on cutting-edge cloud and big data technologies.

Overview

14
14
years of professional experience
1
1
Certification

Work History

AWS Data Engineer

PNC (Banking)
09.2022 - Current
  • Designed and implemented scalable ETL/ELT data pipelines using AWS Glue, Lambda, and Step Functions to ingest, transform, and validate data from diverse sources.
  • Utilized dbt (Data Build Tool) to create modular and testable transformations in the cloud data environment, enabling version-controlled, reusable SQL models.
  • Integrated and optimized Snowflake for enterprise-wide analytics, leveraging its performance features (automatic clustering, virtual warehouses) for high-volume reporting workloads.
  • Built real-time streaming pipelines using Kinesis and Apache Kafka to support time-sensitive analytics and operational dashboards.
  • Developed orchestration workflows with Apache Airflow (Cloud Composer) to manage complex dependencies and ensure reliable data delivery.
  • Implemented robust data validation and quality frameworks using PyDeequ and unit testing within ETL processes to ensure data reliability and compliance.
  • Designed and optimized S3-based data lakes, partitioned and cataloged via Glue Data Catalog and exposed via Athena and Redshift Spectrum.
  • Automated infrastructure provisioning and pipeline deployments using Terraform, CloudFormation, and CI/CD pipelines (GitLab CI).
  • Partnered with DevSecOps to embed security into CI/CD workflows and ensure compliance with HIPAA, SOC 2, and GDPR standards.
  • Tuned and optimized Redshift workloads and federated queries for performance and cost-efficiency.
  • Created reusable metadata-driven frameworks to standardize ingestion and transformation logic across multiple lines of business.
  • Worked closely with cross-functional teams to gather requirements, deliver production-grade solutions, and resolve complex data issues independently.
  • Environment: AWS Glue, AWS Lambda, Amazon S3, Amazon Redshift, Redshift Spectrum, AWS Kinesis, Snowflake, dbt, Athena, Apache Airflow (Cloud Composer), Apache Kafka, Python, SQL, PySpark, Terraform, CloudFormation, GitLab CI/CD, CloudWatch, PyDeequ, IAM, Step Functions

Azure Data Engineer

Allstate (Insurance)
01.2020 - 09.2022
  • Designed and developed ETL/ELT pipelines using Azure Data Factory (ADF) and Azure Databricks to process structured and semi-structured data across multiple source systems.
  • Leveraged dbt (Data Build Tool) for transformation logic and data modeling within Azure Synapse and Snowflake, promoting modular, version-controlled, and testable SQL workflows.
  • Implemented and maintained Snowflake as a cloud data warehouse, managing role-based access control, cost optimization, and federated queries for cross-domain analytics.
  • Built and managed real-time streaming data pipelines using Azure Event Hubs, Kafka, and Stream Analytics for claims and policy analytics use cases.
  • Developed reusable ingestion frameworks in PySpark and Scala on Azure Databricks, supporting large-scale data onboarding into Data Lake Storage Gen2.
  • Applied data quality checks using Great Expectations and embedded validation into ADF and Databricks pipelines to ensure data reliability and business rule compliance.
  • Automated infrastructure provisioning with Terraform and integrated CI/CD deployments using Azure DevOps and Git-based version control.
  • Collaborated with security and governance teams to implement compliance-driven data policies for PII, HIPAA, and GDPR within enterprise data platforms.
  • Designed and optimized Delta Lake solutions and enabled time-travel queries, data compaction, and schema evolution for better manageability and performance.
  • Partnered with data scientists to deliver ML-ready feature pipelines and integrated results back into reporting ecosystems via Power BI.
  • Enabled operational metadata tracking and logging via Azure Monitor, Log Analytics, and custom logging frameworks in ADF and Databricks.
  • Actively participated in sprint planning and agile ceremonies, working cross-functionally with architects, analysts, and QA to deliver end-to-end production data solutions.
  • Environment: Azure Data Factory, Azure Databricks, Azure Synapse Analytics, Azure Event Hubs, Azure Stream Analytics, Snowflake, dbt, Azure Data Lake Storage Gen2, Apache Kafka, Delta Lake, Python, SQL, Scala, PySpark, Great Expectations, Azure DevOps, Terraform, Power BI, Azure Monitor, Log Analytics, Key Vault

Big Data Engineer

Sprint (Tele-communication)
05.2017 - 01.2020
  • Designed and implemented big data pipelines for high-frequency data feeds using AWS EMR and AWS Kinesis to support real-time data processing.
  • Developed scalable data lakes on Amazon S3 with optimized data partitioning and lifecycle policies for cost-efficient storage and management.
  • Integrated AWS Kinesis Data Streams for high-volume, low-latency event ingestion, enabling real-time analytics for mobile app and network telemetry data.
  • Created ETL workflows using AWS Glue to ingest, transform, and load data into Amazon Redshift for business intelligence and reporting.
  • Implemented real-time fraud detection models using AWS SageMaker, integrating machine learning inference into live data pipelines.
  • Built and maintained API-driven data pipelines using AWS API Gateway and AWS Lambda, enabling dynamic connectivity with external systems.
  • Enforced data security using AWS IAM, KMS, and encryption techniques to secure data at rest and in transit.
  • Automated and orchestrated multi-step data workflows using AWS Data Pipeline, ensuring reliability and consistency across systems.
  • Developed CI/CD pipelines using AWS CodePipeline to automate deployment of data workflows and infrastructure components.
  • Utilized Redshift Spectrum to query large datasets in Amazon S3 without loading them into the warehouse, enabling fast and flexible analytics.
  • Collaborated with data scientists to deploy real-time recommendation systems using AWS Lambda and SageMaker.
  • Managed data quality through validation and transformation checks built into AWS Glue workflows.
  • Configured monitoring and alerting with AWS CloudWatch for real-time visibility into data job health and system performance.
  • Led migration of legacy data systems to a modern AWS-based architecture, enhancing scalability and reducing operational costs.
  • Optimized AWS Glue jobs by implementing parallel processing and partitioning, resulting in improved runtime and performance.
  • Environment: AWS Kinesis, AWS EMR, Amazon S3, Amazon Redshift, AWS Glue, AWS Lambda, AWS SageMaker, AWS Data Pipeline, AWS CodePipeline, AWS CloudWatch, AWS IAM, Python, SQL, Apache Spark, JIRA, Git

Sr. Data Engineer

WellCare (Health)
10.2014 - 05.2017
  • Designed and implemented end-to-end ETL pipelines using Google Cloud Data flow and BigQuery to process large-scale healthcare data.
  • Managed and optimized data lakes on Google Cloud Storage (GCS), applying partitioning and life cycle policies for efficient storage and retrieval.
  • Developed real-time streaming data ingestion pipelines using Pub/Sub and Data flow to process patient records and insurance claims.
  • Utilized Cloud Functions to enable server less and event-driven data processing, reducing infrastructure management overhead.
  • Orchestrated complex workflows using Cloud Composer (Apache Airflow) to automate healthcare data processing tasks.
  • Designed and optimized BigQuery schema, ensuring efficient query performance and cost optimization.
  • Implemented Infrastructure as Code (IaC) using Terraform to automate GCP resource provisioning and management.
  • Managed data security and governance by implementing Identity and Access Management (IAM) roles and policies to meet regulatory compliance requirements.
  • Developed machine learning models on Vertex AI, enabling predictive analytics for healthcare trends and decision-making.
  • Integrated data from multiple sources, including on-prem databases, APIs, and data warehouses, automating pipelines using Cloud Data Fusion.
  • Applied data quality validation using Great Expectations and custom Data flow pipelines to ensure data accuracy.
  • Created custom monitoring dashboards using Cloud Logging and Cloud Monitoring to track ETL job performance and failures.
  • Automated data ingestion and transformation using Cloud Data prep and Cloud Functions, reducing manual effort.
  • Led data migration projects from on-prem databases to Cloud SQL and BigQuery, ensuring data integrity and minimal downtime.
  • Environment: Google Cloud Data flow, BigQuery, Cloud Storage (GCS), Pub/Sub, Cloud Functions, Cloud Composer (Apache Airflow), Terraform, Vertex AI, Cloud IAM, Python, SQL, Apache Beam, Looker, JIRA, Git

ETL Developer

TIBCO
11.2011 - 06.2012
  • Developed and maintained ETL pipelines using Google Cloud Platform (GCP) services such as Cloud Data flow, Cloud Pub/Sub, and BigQuery for enterprise data integration.
  • Automated batch and streaming ETL processes to extract, transform, and load data from relational databases into analytics-ready formats for reporting and insights.
  • Designed and optimized BigQuery datasets and tables, implementing partitioning, clustering, and query tuning to enhance performance and reduce costs.
  • Utilized Cloud Storage for staging, archiving, and managing large datasets in formats such as JSON, CSV, Parquet, and Avro for structured and semi-structured data processing.
  • Implemented Apache Beam pipelines on Dataflow to perform scalable, parallelized data transformations and aggregations in a distributed environment.
  • Developed and scheduled Cloud Composer (Airflow) DAGs to orchestrate ETL workflows, manage dependencies, and ensure data consistency across multiple GCP services.
  • Configured IAM roles, policies, and VPC Service Controls to enforce secure access control and maintain compliance with data security standards.
  • Assisted in API integrations and data connectors, enabling seamless data exchange between on-premises relational databases, SaaS platforms, and cloud-based storage.
  • Performed data cleansing, validation, and enrichment to improve data quality before ingestion into BigQuery and downstream analytical tools.
  • Supported business intelligence and reporting teams, enabling ad-hoc querying, dashboarding, and advanced analytics through Looker Studio and BigQuery ML.
  • Optimized ETL jobs by implementing incremental data loading, deduplication strategies, and indexing techniques to enhance processing speed.
  • Environment: Google Cloud Platform (GCP), Cloud Dataflow, BigQuery, Cloud Storage, Cloud Pub/Sub, Cloud Composer (Airflow), IAM, VPC Service Controls, Apache Beam, SQL, Python, Cloud Logging, Cloud Monitoring, Looker Studio

Education

Master’s - Information Technology

Gannon University
01.2013

Bachelor's degree - undefined

St. Ann's College of Engineering and Technology
01.2011

Skills

  • Big Data & Data Warehousing: Hadoop (HDFS, Hive, HBase, MapReduce, Sqoop, Oozie), Apache Spark, Apache Kafka, AWS Redshift, Snowflake, Amazon S3, Trino
  • Scripting & Programming Languages: Python, Bash scripting, Java, SQL, PL/SQL, JSON, CSV, Parquet, Shell
  • Data Processing Frameworks: Apache Spark (Spark SQL, Spark Streaming), Aws, Apache Storm, Apache Flink,kafka
  • Cloud Platforms & Services: Amazon Web Services (AWS): S3, Redshift, Glue, Lambda, EMR, Athena, Kinesis, Step Functions, Data Pipeline, RDS, DynamoDB, CloudWatch, IAM, SageMaker Microsoft Azure: Azure Data Factory, Databricks, Synapse, Azure DevOps Google Cloud Platform (GCP): BigQuery, Cloud Storage, Dataflow
  • Data Storage & Management: AWS S3, DynamoDB, HBase, Hive, RDS, AWS Secrets Manager, SQL Server, Teradata, Oracle, MS Access
  • Data Integration & ETL: AWS Glue, Apache Airflow, DBT, AWS Lambda, AWS EMR, Sqoop, SSIS, AWS DMS
  • Data Transformation & Modeling: Spark, PySpark, SQL (Views, Functions, Triggers, Stored Procedures, Indexes), HiveQL, Redshift Schema Design
  • Databases: Amazon RDS, Amazon Redshift, Oracle, Teradata, MS SQL Server, Snowflake, SQL Server, PostgreSQL, HBase, DynamoDB
  • ETL, Data Warehousing & BI Tools: Snowflake, Redshift, BigQuery, DBT, Apache Airflow, Power BI, Tableau

Certification

  • Azure Data Engineer Associate(DP-203)
  • Azure Fundamentals(AZ-900)
  • AWS Certified Developer – Associate

Timeline

AWS Data Engineer

PNC (Banking)
09.2022 - Current

Azure Data Engineer

Allstate (Insurance)
01.2020 - 09.2022

Big Data Engineer

Sprint (Tele-communication)
05.2017 - 01.2020

Sr. Data Engineer

WellCare (Health)
10.2014 - 05.2017

ETL Developer

TIBCO
11.2011 - 06.2012

Bachelor's degree - undefined

St. Ann's College of Engineering and Technology

Master’s - Information Technology

Gannon University
Harika Vaddi