Summary
Overview
Work History
Education
Skills
Timeline
Generic

Nikhila Reddy

Summary

IT professional with over 10 years of experience as a Data Engineer specializing in ETL processes, database management, and cloud platforms. Proven expertise in designing and optimizing data pipelines using SQL and NoSQL databases, ensuring efficient data processing and integration. Extensive experience with Hadoop, Spark, and Azure services, successfully migrating data to cloud environments and developing scalable ETL processes. Strong background in data governance, security, and compliance, with proficiency in managing complex workflows and utilizing modern data engineering tools.

Overview

11
11
years of professional experience

Work History

Sr Data Engineer

Capgemini
NYC
12.2023 - Current
  • Designed and optimized complex data pipelines using PostgreSQL and MongoDB to support hybrid relational and NoSQL medical data models.
  • Developed scalable data ingestion and transformation frameworks using Java for ingestion logic and Python with Spark for real-time and batch data processing.
  • Leveraged Hadoop HDFS for distributed storage and built high-performance jobs with MapReduce and YARN for processing clinical claims and EHR data.
  • Utilized Hive, Pig, HBase, Sqoop, and Oozie for querying, scripting, NoSQL access, RDBMS-Hadoop integration, and workflow management in big data healthcare ecosystems.
  • Engineered distributed processing jobs with PySpark, Spark Core, and Spark SQL, enabling rapid data transformation and aggregation for reporting and analytics.
  • Built real-time data pipelines using Spark Streaming integrated with healthcare systems for near real-time alerts and operational dashboards.
  • Designed and managed cloud data warehouses using Snowflake and Azure Synapse Analytics, enabling scalable analytics and historical patient data storage.
  • Modelled data marts with Star Schema, Snowflake Schema, Fact and Dimension tables, applying OLAP techniques for complex slicing and dicing of clinical KPIs.
  • Processed and exchanged diagnostic data using multiple data serialization formats including Avro, Parquet, ORC, JSON, and XML for compatibility and performance.
  • Built secure and governed ETL pipelines on Azure Data Factory, Azure Databricks, and Azure Data Lake, ensuring compliance with GDPR, CCPA, and patient care-specific data standards.
  • Integrated Azure Stream Analytics with Power BI for real-time visualization of patient flow and operational metrics; utilized Azure Functions for serverless orchestration.
  • Ensured data security, data encryption, and data governance across ingestion and transformation layers using Azure DevOps for automation and environment consistency.
  • Migrated legacy data using traditional ETL tools like Informatica and SSIS, enabling modernized processing on Azure infrastructure with full data validation and profiling.
  • Containerized Spark-based workloads using Docker and deployed on Kubernetes, improving performance, scalability, and deployment consistency in production pipelines.
  • Built streaming applications using Apache Kafka for ingestion and Apache Flink for low-latency processing of real-time clinical and operational events.
  • Orchestrated infrastructure and version control with Git, Terraform, and Azure DevOps, supporting CI/CD pipelines and Agile delivery through Scrum ceremonies and sprint cadences.
  • Collaborated with cross-functional teams to ensure data quality, data validation, and data deduplication processes were implemented across all healthcare data workflows, ensuring the integrity and reliability of insights derived from clinical and operational datasets.
  • Environment: PostgreSQL, Azure Data Factory, Azure Databricks, Azure Data Lake, Azure Synapse Analytics, Azure Cosmos DB, Azure Stream Analytics, Azure Functions, Azure DevOps, Hadoop (HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, Oozie), PySpark, Spark (Core, SQL, Streaming), Snowflake, Power BI, Apache Kafka, Apache Flink, Docker, Kubernetes, Git, Terraform, Informatica, SSIS, Avro, Parquet, ORC, JSON, XML, Agile, Jira, GDPR, CCPA Compliance

Sr Big Data Engineer

Deloitte
NYC
07.2021 - 04.2023
  • Designed and optimized ETL pipelines using MySQL and Cassandra to integrate and process large-scale medical datasets, ensuring regulatory compliance with GDPR and CCPA and supporting accurate healthcare analytics.
  • Automated data transformation workflows using Java and Python (PySpark, Spark Core, Spark SQL), facilitating robust data cleansing, validation, and profiling of sensitive clinical information.
  • Leveraged Hadoop HDFS, MapReduce, and Hive for scalable storage and efficient processing of structured and unstructured healthcare-related data to support advanced analytics and reporting.
  • Integrated and deduplicated multi-source patient data using Sqoop, Oozie, and HBase, enabling seamless extraction and reducing errors during loading into clinical data environments.
  • Built real-time data pipelines with Spark Streaming and PySpark to monitor and analyze live patient and operational datasets, enhancing the responsiveness of healthcare systems.
  • Developed cloud-based data warehouses using Snowflake and Azure Synapse to centralize and scale medical data storage, facilitating efficient query performance for clinical insights.
  • Designed and implemented Star Schema and Snowflake Schema models to support accurate ETL, consistent validation, and reliable healthcare data reporting.
  • Utilized data serialization formats like Avro, Parquet, and ORC for efficient medical data storage and retrieval, while applying encryption to maintain compliance with patient data protection standards.
  • Built BI dashboards in Power BI and Tableau for real-time clinical insights and healthcare decision-making.
  • Orchestrated data pipelines using Azure Data Lake, Data Factory, and Databricks for smooth healthcare data integration.
  • Used Azure Stream Analytics and Azure Functions to support real-time ingestion and automation of data validation and cleansing during critical medical data flows.
  • Managed ETL with Informatica and SSIS to migrate legacy healthcare systems securely to cloud platforms.
  • Administered containerized data workflows using Docker and Kubernetes, ensuring scalable, consistent processing of healthcare data across multiple cloud environments.
  • Built and deployed streaming applications using Apache Kafka and Apache Flink to process real-time patient sensor data, clinical logs, and appointment workflows, improving responsiveness and operational efficiency.
  • Carried out infrastructure as code using Git, Terraform, and Azure DevOps to automate CI/CD pipelines, accelerating the deployment of secure, compliant healthcare data solutions.
  • Collaborated within Agile Scrum teams, contributing to sprint planning, retrospectives, and project milestones to ensure timely delivery of integrated and compliant data solutions for healthcare platforms.
  • Environment: Azure Data Factory, Azure Databricks, Azure Data Lake, Azure Synapse Analytics, Cassandra, Azure Stream Analytics, Azure Functions, Azure DevOps, MySQL, Hadoop (HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, Oozie), PySpark, Spark (Core, SQL, Streaming), Snowflake, Power BI, Apache Kafka, Apache Flink, Docker, Kubernetes, Git, Terraform, Informatica, SSIS, Avro, Parquet, ORC, JSON, XML, Agile, Jira, GDPR, CCPA Compliance

Big Data Engineer

Blue Cross Blue Shield
Chicago
07.2019 - 06.2021
  • Constructed and maintained ETL processes for extracting, transforming, and loading patient care data from legacy systems to cloud data environments, ensuring data quality and compliance with GDPR and CCPA regulations.
  • Used SQL and NoSQL (PostgreSQL, Cassandra) to manage medical data such as lab results and patient records, ensuring HIPAA compliance.
  • Developed real-time data pipelines using Spark Streaming and PySpark, enabling timely processing of patient vitals and medical notes for alert systems, reducing latency and enhancing decision-making.
  • Integrated medical data formats (JSON, Avro, Parquet, ORC) using Azure Blob Storage and Azure Data Lake Analytics for scalable storage and real-time medical analytics.
  • Built data workflows in Azure Data Factory and Spark on Azure Databricks to process electronic health records (EHR), lab test results, and patient observations.
  • Designed a Snowflake cloud data warehouse for centralized querying of EHR systems and historical patient data.
  • Applied data cleansing, validation, and profiling to ensure accurate and high-quality health-related datasets used across patient monitoring platforms.
  • Used Microsoft SQL and MongoDB for ETL across structured and unstructured electronic medical records, supporting diverse care delivery models.
  • Processed large-scale patient datasets with HDFS, MapReduce, and Hive, ensuring compliance with data governance and medical industry regulations.
  • Supported data migration projects from on-premises EHR systems to cloud-based medical platforms using Sqoop, Oozie, and HBase.
  • Collaborated with medical analytics teams to implement Star Schema and Snowflake Schema for optimized querying of health outcomes and treatment pathways.
  • Ensured data encryption and data security best practices during the handling of sensitive patient documentation and medical identifiers in compliance with HIPAA guidelines.
  • Containerized data processing workflows using Docker and deployed them on Kubernetes, enhancing scalability and fault tolerance for health analytics platforms.
  • Designed and implemented streaming data solutions with Apache Kafka and Apache Flink to process real-time health sensor data, doctor-patient interactions, and appointment tracking.
  • Automated deployment pipelines and infrastructure management using Git, Terraform, Azure DevOps, and Azure Functions, accelerating the rollout of secure and scalable medical solutions.
  • Optimized data pipelines for faster processing of patient monitoring feeds and diagnostic reports, improving response time and care delivery effectiveness.
  • Built interactive BI dashboards with Power BI and Tableau for medical staff and administrators, supporting clinical decisions with real-time data visualization.
  • Executed Agile Scrum practices, contributing to sprint planning, retrospectives, and ensuring timely delivery of compliant medical data solutions.
  • Environment: Azure Data Factory, Azure Databricks, Azure Data Lake, Azure Synapse Analytics, MongoDB, Azure Functions, Azure DevOps, Snowflake, PostgreSQL Server, Hadoop (HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, Oozie), PySpark, Spark (Core, SQL, Streaming), Apache Kafka, Apache Flink, Docker, Kubernetes, Git, Terraform, Avro, Parquet, ORC, JSON, XML, Power BI, Jira, Agile Scrum, HIPAA Compliance, CCPA, GDPR

Sr Big Data Engineer

Evoke Technologies
Hyderabad
01.2017 - 02.2019
  • Worked with Hadoop components (HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, Oozie) for large-scale data processing and migration to cloud platforms using Informatica, Talend, and Sqoop.
  • Utilized SQL (PostgreSQL) and NoSQL (Cassandra) for structured/unstructured data integration, ensuring high data security and compliance (GDPR/CCPA) in ETL processes.
  • Implemented Java and Python utilities to automate file handling and data validation tasks.
  • Optimized Spark-based workflows with Spark SQL, and Spark Streaming for batch and real-time data processing.
  • Designed Star and Snowflake schemas, Fact and Dimension Tables for OLAP queries, ensuring data quality through cleansing, validation, deduplication, and profiling.
  • Directed data serialization and schema integration with Avro, Parquet, ORC, JSON, and XML for efficient cross-system compatibility.
  • Developed real-time streaming solutions using Apache Kafka and Flink for low-latency data pipelines.
  • Built interactive BI dashboards with Power BI and Tableau for data-driven decision-making.
  • Automated ETL workflows with Apache Airflow for scheduling and monitoring complex data pipelines.
  • Implemented Infrastructure as Code with Git, Bitbucket, Terraform, enhancing scalability and resource management.
  • Integrated CI/CD pipelines with Jenkins and GitLab CI for faster deployment cycles and high-quality software delivery.
  • Participated in Agile Scrum, ensuring timely delivery through Sprint Planning and Backlog Grooming.
  • Managed seamless data migration from legacy systems to cloud, ensuring data integrity and security.
  • Applied data governance practices (GDPR) for data security, encryption, and auditing compliance.
  • Optimized data processing architecture for seamless integration of structured and unstructured data.
  • Environment: Hadoop, HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, Oozie, PostgreSQL, MongoDB, Cassandra, Informatica, Talend, Spark, Spark SQL, Spark Streaming, Avro, Parquet, ORC, JSON, XML, Apache Kafka, Apache Flink, Power BI, Tableau, Apache Airflow, Git, Bitbucket, Terraform, Jenkins, GitLab CI, Agile Scrum, Azure.

Hadoop Developer

Infosys
Bengaluru
01.2014 - 12.2016
  • Configured with Hadoop ecosystem components, including HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, and Oozie for distributed data processing, data migration, and workflow automation. Leveraged these technologies for processing large-scale data sets, batch and real-time data processing.
  • Engineered data flows with SQL (IBM DB2) and NoSQL (MongoDB) for seamless integration of structured and unstructured datasets.
  • Constructed scalable ETL pipelines via Informatica and IBM DataStage, integrated with Hadoop for distributed processing.
  • Coded integration workflows in Java and Scala for backend automation and orchestration.
  • Modeled OLAP systems using Star and Snowflake schemas, optimizing Fact and Dimension tables for fast querying.
  • Developed pipelines with cleansing, validation, profiling, and deduplication to uphold data integrity and compliance.
  • Employed formats like Avro, Parquet, ORC, JSON, and XML for efficient serialization and cross-platform data exchange.
  • Deployed RabbitMQ to facilitate scalable microservice communication in distributed environments.
  • Participated in Agile ceremonies, enhancing delivery through sprint planning, retrospectives, and feedback cycles.
  • Automated ETL jobs in Talend and Informatica, boosting performance across large-scale data systems.
  • Delivered insights via interactive dashboards in Power BI and Tableau, supporting data-driven decisions.
  • Implemented data migration strategies to move legacy data systems to modern cloud-based platforms, ensuring smooth transition and high data integrity during the migration process.
  • Utilized version control with Git and Bitbucket to manage project codebase and infrastructure as code using Terraform to automate deployment and ensure scalability of the data processing architecture.
  • Developed data pipelines for both batch and real-time data processing, utilizing Apache Kafka for streaming data integration, ensuring low-latency data processing across distributed environments.
  • Ensured data security and compliance with CCPA & GDPR regulations, implementing robust data governance frameworks to secure sensitive data during extraction, transformation, and loading processes.
  • Environment: Hadoop, HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, Oozie, PostgreSQL, MongoDB, Cassandra, Informatica, IBM DataStage, Talend, Java, Avro, Parquet, ORC, JSON, XML, Apache Kafka, Apache Flink, Power BI, Tableau, Git, Jenkins, GitLab CI, RabbitMQ, Terraform, Scrum.

Education

Master’s - Management Information Systems

University of Illinois, Springfield
01.2024

Batchelor’s - Computer Science Engineering

CVR College of Engineering
01.2013

Skills

  • Python
  • Scala
  • Java
  • R
  • T-SQL
  • U-SQL
  • PL/SQL
  • Microsoft Azure
  • Azure Data Lake
  • Azure Databricks
  • Azure Data Factory
  • Azure Stream Analytics
  • Azure CosmosDB
  • Azure Functions
  • Azure DevOps
  • Power BI
  • Tableau
  • Kafka
  • XML
  • RabbitMQ
  • RESTful services
  • SOAP UI
  • WSDL
  • GIT
  • GitHub
  • Bitbucket
  • GitLab
  • JIRA
  • Confluence
  • HDFS
  • YARN
  • MapReduce
  • Sqoop
  • Impala
  • HBase
  • Flume
  • Spark
  • Apache Airflow
  • Oozie
  • Hive
  • Pig
  • Hadoop
  • MySQL
  • SQL Server
  • MongoDB
  • Teradata
  • Cassandra
  • PostgreSQL
  • Jenkins
  • Kubernetes
  • Terraform
  • ETL
  • Snowflake
  • Informatica
  • SSIS
  • Windows
  • UNIX
  • LINUX

Timeline

Sr Data Engineer

Capgemini
12.2023 - Current

Sr Big Data Engineer

Deloitte
07.2021 - 04.2023

Big Data Engineer

Blue Cross Blue Shield
07.2019 - 06.2021

Sr Big Data Engineer

Evoke Technologies
01.2017 - 02.2019

Hadoop Developer

Infosys
01.2014 - 12.2016

Master’s - Management Information Systems

University of Illinois, Springfield

Batchelor’s - Computer Science Engineering

CVR College of Engineering
Nikhila Reddy