Summary
Overview
Work History
Education
Skills
Timeline
Generic

Bharat Yadav Basaboyina

Jersey City,NJ

Summary

Highly experienced Data Engineer with over 10 years of expertise in building, optimizing, and managing scalable data infrastructure, enabling data-driven decision-making across diverse domains by utilizing advanced data engineering best practices and cloud-native solutions. Proficient in SQL and NoSQL technologies including PostgreSQL, MySQL, Oracle, Cassandra, and MongoDB, delivering efficient querying and storage for both structured and unstructured datasets. Strong programming expertise in Python, Scala and Java, applied extensively to design robust ETL pipelines, real-time stream processors, and automated data transformation scripts for high-performance data solutions. Deep understanding of the Hadoop ecosystem, including HDFS, MapReduce, YARN, Hive, Pig, HBase, Sqoop, and Oozie, delivering distributed processing and big data analytics solutions at scale. Comprehensive hands-on experience with the Apache Spark ecosystem, including PySpark, Spark Core, Spark SQL, and Spark Streaming, for processing batch and real-time data, performing in-memory analytics, and building machine learning pipelines. Designed and implemented cloud-native ETL workflows using AWS Glue, Lambda, and Athena, enabling automated data extraction, transformation, and querying across distributed systems with high scalability and low operational overhead. Built and maintained data lake architectures on AWS S3 and EMR, integrating real-time and batch data pipelines to support large-scale analytics, reduce processing time, and ensure high availability in production environments. Proven expertise in designing and managing end-to-end data pipelines, encompassing data ingestion, extraction, transformation, and loading (ETL) to support enterprise-wide analytics, machine learning, and business intelligence needs. Specialized in data quality and security practices, including data cleansing, validation, profiling, deduplication, and encryption, ensuring accuracy, integrity, and protection of sensitive data across all processing stages. Strong knowledge of data governance frameworks and compliance standards such as GDPR and CCPA, implementing controls for secure data handling, auditing, and enterprise-level data integration, migration, and processing workflows. Developed and orchestrated complex data pipelines using Apache Airflow, Luigi, and Dagster, ensuring seamless scheduling, monitoring, and dependency management across data workflows. Implemented streaming data applications using Apache Kafka and Apache Flink, enabling real-time data ingestion, event processing, and analytics for time-sensitive business use cases. Applied Infrastructure as Code (IaC) practices using Terraform, enabling consistent, automated, and scalable deployment of cloud infrastructure for data engineering platforms. Designed and optimized large-scale data warehouses using Snowflake, AWS Redshift, Azure Synapse Analytics, and Databricks Delta Lake, driving data availability for advanced analytics and reporting. Specialized in dimensional modeling, including the design and implementation of Star and Snowflake Schemas, OLAP cubes, Fact and Dimension tables for efficient analytical querying. Well-versed in data serialization and file formats including Avro, Parquet, ORC, JSON, and XML, optimizing storage and exchange of large datasets across distributed systems. Developed traditional ETL pipelines using Informatica, Talend, Apache NiFi, and SSIS, ensuring seamless data extraction, transformation, and loading from diverse sources to centralized platforms. Created interactive and insightful data visualizations and dashboards using Tableau, Power BI, and QlikView, empowering business users with self-service analytics and data storytelling. Containerized data engineering applications and deployed them using Docker and Kubernetes, ensuring scalability, portability, and efficient orchestration across environments. Used GIT, SVN, and JIRA/Confluence for version control, team collaboration, and project tracking, contributing to streamlined DevOps and agile software delivery cycles. Integrated CI/CD pipelines using AWS CodePipeline , AWS Code Deploy and Jenkins, promoting automation, testing, and continuous integration in the deployment lifecycle of data engineering applications. Focused on real-time data processing, pipeline performance tuning, and data modeling strategies to ensure high availability, reliability, and low-latency insights from complex datasets. Experienced in Agile Scrum and Waterfall SDLC methodologies, collaborating cross-functionally with stakeholders to deliver high-quality data products in iterative and structured development environments.

Overview

11
11
years of professional experience

Work History

AWS Engineer

Master Card
01.2023 - Current
  • Developed secure, scalable data solutions integrating PostgreSQL on AWS RDS and MongoDB on EC2, ensuring high availability and low latency for both transactional and analytical workloads.
  • Built data processing modules using Java and Python, handling ingestion, transformation, and integration of multi-source data for real-time analytics and customer behavior modeling.
  • Engineered large-scale batch and real-time pipelines using Hadoop MapReduce on AWS EMR, stored data on HDFS, and orchestrated workflows with Apache Oozie. Leveraged YARN for resource allocation, utilized Hive and Pig for query processing and ETL scripting, and integrated RDBMS sources via Sqoop.
  • Designed NoSQL data models using HBase for high-speed reads/writes of real-time transaction data, optimizing fraud detection use cases with Hadoop-native storage and querying capabilities.
  • Processed real-time and batch data using PySpark, implementing Spark Core, Spark SQL, and Spark Streaming on AWS EMR, enabling complex transformations, joins, and streaming aggregations over high-velocity data streams.
  • Developed ETL pipelines with Spark SQL and PySpark to enrich customer data across channels and store insights for downstream BI dashboards and ML models.
  • Modeled unified customer data using Star and Snowflake schemas, with Fact and Dimension Tables powering OLAP use cases in Snowflake and AWS Redshift, significantly improving reporting accuracy and query performance.
  • Utilized Avro, Parquet, ORC, JSON, and XML formats for ingestion, serialization, and transformation pipelines, ensuring compatibility and performance across storage and processing layers.
  • Leveraged AWS S3 for centralized data lake storage, orchestrated data processing with AWS Glue, and performed analytics using AWS Athena, repeating this structure across streaming and batch pipelines.
  • Deployed real-time applications and ETL jobs on AWS EC2 and AWS Lambda, with periodic and event-driven processing handled via Lambda and scheduled Glue workflows.
  • Leveraged AWS EMR to orchestrate distributed data processing using Spark and Hadoop, while integrating MySQL for structured application data and MongoDB for high-volume customer activity logs, enabling efficient multi-stage transformation of raw and semi-structured datasets.
  • Designed and orchestrated scalable customer data pipelines using Apache NiFi and Airflow, standardizing diverse sources like CRMs, mobile apps, and card transactions into unified schemas with reliable, dependency-aware workflows for downstream analytics.
  • Built dashboards and KPI visualizations using Tableau, consuming curated data from AWS Redshift and Athena, helping stakeholders gain insights into customer behavior and product usage.
  • Containerized microservices using Docker and deployed them on Kubernetes for running real-time streaming components, ensuring consistent deployments and horizontal scaling.
  • Engineered a high-performance streaming architecture using Apache Kafka and Apache Flink to enable real-time fraud detection across financial transactions and customer activity streams, while ensuring scalable, collaborative, and automated deployments through robust infrastructure management using GIT, SVN, and Terraform.
  • Led end-to-end delivery of high-impact data features by deploying CI/CD pipelines using Jenkins and AWS CodePipeline for seamless, zero-downtime releases, while actively engaging in Agile Scrum ceremonies and leveraging JIRA and Confluence for efficient task tracking, collaboration, and documentation across cross-functional teams.
  • Performed complex calculations to analyze and optimize engineering processes.

AWS Engineer

Signa
05.2021 - 12.2022
  • Integrated PostgreSQL hosted on AWS RDS and Cassandra on AWS EC2 to support real-time storage for patient vitals and event-driven clinical data, enabling reliable access to structured and semi-structured datasets.
  • Developed backend microservices in Java for HL7/FHIR parsing and data ingestion and implemented streaming and batch processing logic using Python and PySpark, supporting low-latency medical data pipelines and model inputs.
  • Leveraged Hadoop MapReduce and HDFS on AWS EMR to archive historical patient monitoring data; used Hive for medical data querying, Pig for transformation scripts, and orchestrated ingestion with Oozie.
  • Extracted EMR logs using Sqoop, utilized YARN for resource management, and integrated HBase for device-level time-series reads.
  • Developed real-time analytics pipelines using PySpark (Spark Streaming, Spark SQL, Spark Core) on AWS EMR, transforming patient telemetry and ML-inferred device signals into actionable insights with sub-second latency for alerting and visualization.
  • Modeled historical and analytical data using Star and Snowflake schemas, building Fact Tables and Dimension Tables to support diagnostic reports in AWS Redshift and clinical dashboards in Snowflake.
  • Handled diverse data formats including Avro (device logs), Parquet (structured metrics), ORC (intermediate analytical data), JSON (HL7/FHIR structured events), and XML (legacy lab reports) across ingestion and storage layers.
  • Architected AWS-native solutions using S3 for storing wearable and device data, Glue for transformation and cataloging, Athena for querying time-series archives, and EMR for processing real-time and batch clinical data streams.
  • Implemented event-driven workflows via AWS Lambda, integrating it with Glue Jobs, and batch scheduling through Oozie, achieving seamless automation across ingestion, processing, and alerting systems.
  • Designed, deployed, and monitored critical infrastructure and model pipelines across AWS EC2 instances, ensuring fault-tolerant, scalable systems for patient alerting and CDSS workflows.
  • Engineered a hybrid data platform using MySQL and Cassandra for patient engagement, while orchestrating ETL workflows with Informatica and Apache Airflow to automate HL7 parsing, normalization of unstructured health records, and end-to-end model pipelines across cloud-native databases.
  • Developed and deployed real-time monitoring solutions by building Grafana and Kibana dashboards on top of AWS Athena and Elasticsearch, while containerizing stream processing services with Docker and orchestrating them on Kubernetes for high availability and scalability.
  • Designed low-latency streaming architecture using Apache Kafka and Apache Flink integrated with AWS Kinesis for ingesting telemetry from edge devices, enabling immediate alerting and clinical response.
  • Managed infrastructure as code and repositories using GIT, Terraform, and AWS CloudFormation, standardizing deployment and enabling multi-environment setup for dev/test/prod stages.
  • Deployed production-ready CI/CD pipelines with Jenkins and AWS CodePipeline, automating testing, image builds, and ECS deployments of patient monitoring and CDSS APIs.
  • Participated in Agile Scrum ceremonies, including sprint planning, backlog grooming, and sprint retrospectives, collaborating with healthcare providers, ML engineers for end-to-end delivery of high-impact healthcare solutions.

AWS Engineer

Wells Fargo
06.2019 - 04.2021
  • Utilized Amazon RDS for structured transactional data and Cassandra on AWS EC2 for high-throughput NoSQL storage, enabling efficient retrieval of both streaming and batch credit datasets.
  • Developed real-time credit risk and fraud detection pipelines using Java and Scala, embedding logic into Spark Streaming and MLlib for low-latency machine learning inference.
  • Leveraged the Hadoop ecosystem on AWS EMR, including MapReduce, HDFS, and YARN, for distributed processing. Used Hive for querying, Pig for ETL scripts, HBase for scalable storage, Sqoop for RDBMS integration, and Oozie for workflow orchestration.
  • Designed and optimized real-time and batch data pipelines using Spark Core, Spark SQL, and Spark Streaming to aggregate, transform, and analyze transaction and credit data.
  • Processed continuous streams of financial transactions and loan events using Spark Streaming, performing real-time joins and transformations for downstream ML scoring and alerts.
  • Implemented Snowflake and AWS Redshift for cloud-based analytics, enabling fast, multi-dimensional reporting and financial intelligence across large-scale datasets.
  • Applied dimensional modeling using Star and Snowflake schemas, designing Fact and Dimension tables to support OLAP-based customer and transaction analysis.
  • Handled data serialization and interchange using formats like Avro, Parquet, ORC, JSON, and XML to meet diverse ingestion, transformation, and storage requirements.
  • Built scalable ETL pipelines with AWS Glue, automated orchestration with AWS Lambda, performed ad hoc analysis using Amazon Athena, and managed compute/storage via Amazon S3, EC2, and EMR.
  • Integrated MySQL for structured data staging and Cassandra for unstructured and semi-structured storage to enable seamless hybrid data workflows.
  • Developed ETL workflows using Informatica and Talend to extract, cleanse, and enrich credit and transactional data for analytics and ML model training.
  • Orchestrated data pipelines and model scoring workflows using Apache Airflow and Dagster, managing dependencies across streaming and batch operations.
  • Created business dashboards and monitoring tools using Power BI and Grafana, connected to Athena and Redshift to visualize loan default risks and anomalies in real time.
  • Packaged Spark-based services and ML scoring APIs into Docker containers and deployed them using Kubernetes, ensuring high availability and scalability.
  • Established CI/CD pipelines using Jenkins and AWS CodePipeline; managed source control and infrastructure with Git, Terraform, and CloudFormation; and collaborated in Agile Scrum ceremonies for timely project delivery.

AWS Engineer

Fiserv
03.2017 - 05.2019
  • Designed and implemented robust data pipelines using HDFS, Hive, MapReduce, Pig, Sqoop, HBase, and Oozie for distributed storage, transformation, RDBMS imports, and workflow orchestration within the Hadoop ecosystem.
  • Optimized batch and real-time document workflows by leveraging Hadoop YARN for resource management and Hive for querying structured metadata from financial statements.
  • Managed hybrid storage using AWS RDS (PostgreSQL) for relational data and MongoDB for schema-flexible NoSQL workloads, ensuring compliance and data integrity.
  • Developed secure, scalable services in Java and Python to tokenize and encrypt PII/PCI data, integrated with AWS KMS for centralized key management.
  • Utilized PySpark and Spark Core to create high-performance batch pipelines for large-scale e-statement generation, ensuring scalability across millions of records.
  • Implemented real-time processing using Spark Streaming and Spark SQL to ingest, transform, and analyze encrypted transactional data for compliant insights.
  • Modeled data marts using Star and Snowflake schemas, designing Fact and Dimension tables to support OLAP-style reporting and analytics.
  • Engineered end-to-end secure pipelines with Apache NiFi and Apache Spark, handling ingestion, transformation, and loading across distributed environments.
  • Applied data quality, deduplication, validation, and encryption workflows aligned with GDPR and CCPA compliance standards throughout the ETL lifecycle.
  • Worked with serialization and storage formats like Avro, Parquet, ORC, and interchange formats JSON and XML to optimize performance and schema integrity.
  • Architected integrated solutions using MySQL for structured data and Apache Cassandra for high-throughput NoSQL operations in e-statement processing.
  • Built ETL workflows with Apache NiFi and Informatica, consolidating data from CRM systems, mobile apps, and banking platforms for unified analytics.
  • Orchestrated and automated workflows with Apache Airflow, implemented real-time fraud detection using Apache Kafka, Apache Flink, and created real-time dashboards in Power BI and QlikView; managed CI/CD with Jenkins, infrastructure with Terraform, and team collaboration via Git, JIRA, and Confluence in Agile environments.

AWS Engineer

TCS
01.2014 - 10.2016
  • Designed and implemented scalable data pipelines using Hadoop MapReduce, YARN, and HDFS, integrating Siebel CRM data for centralized processing and archival.
  • Utilized Apache Hive for querying loyalty transaction data, Pig for transforming semi-structured campaign inputs, and Sqoop for batch ingestion from Oracle CRM and SQL Server.
  • Implemented HBase for storing high-frequency customer activity logs and orchestrated workflows with Apache Oozie to streamline ingestion and processing.
  • Engineered hybrid storage combining SQL Server for structured loyalty data and Cassandra for real-time customer interactions, enabling low-latency insights.
  • Developed Java-based microservices for customer point redemption integrated with the rewards engine via REST APIs, supporting seamless transactional workflows.
  • Modeled analytical data marts using Star schema with Fact and Dimension tables to enable OLAP-based reporting across customer demographics, transactions, and time-series dimensions.
  • Led end-to-end data workflow development, ensuring high-quality ingestion, transformation, profiling, and validation in compliance with GDPR and CCPA regulations.
  • Developed batch pipelines using Informatica and real-time integrations via SSIS, supporting ingestion from CRMs, mobile platforms, and external sources.
  • Handled diverse data formats including Avro and Parquet for HDFS optimization, as well as JSON and XML from external APIs and legacy systems.
  • Built dashboards and customer insights reports using OBIEE and SAP BusinessObjects, enabling marketing teams to monitor reward trends and segmentation.
  • Integrated RabbitMQ for asynchronous messaging between CRM systems and loyalty services; maintained Git repositories, used JIRA for Agile management, and followed infrastructure-as-code principles.

Education

Bachelor of Science - Computer Science

Acharya Nagarjuna University
Guntur
08.2013

Skills

Python

  • Java
  • Scala
  • Shell Scripting
  • SQL
  • Hadoop
  • MapReduce
  • YARN
  • Hive
  • Pig
  • HBase
  • Sqoop
  • Oozie
  • Spark
  • Flink
  • Kafka Streams
  • Apache Kafka
  • RabbitMQ
  • Amazon MSK
  • PostgreSQL
  • MySQL
  • SQL Server
  • Oracle
  • MongoDB
  • Cassandra
  • Neo4j
  • AWS S3
  • AWS EMR
  • AWS Glue
  • AWS Lambda
  • AWS Athena
  • Informatica
  • Apache NiFi
  • Talend
  • SSIS
  • Apache Airflow
  • Dagster
  • Star Schema
  • Snowflake Schema
  • OLAP
  • Fact Tables
  • Dimension Tables
  • Avro
  • Parquet
  • ORC
  • JSON
  • XML
  • Power BI
  • Tableau
  • QlikView
  • Grafana
  • Kibana
  • OBIEE
  • BusinessObjects
  • Docker
  • Kubernetes
  • Jenkins
  • Git
  • SVN
  • Terraform
  • GitHub Actions
  • JIRA
  • Confluence
  • Agile Scrum
  • Sprint Planning
  • Retrospectives

Timeline

AWS Engineer

Master Card
01.2023 - Current

AWS Engineer

Signa
05.2021 - 12.2022

AWS Engineer

Wells Fargo
06.2019 - 04.2021

AWS Engineer

Fiserv
03.2017 - 05.2019

AWS Engineer

TCS
01.2014 - 10.2016

Bachelor of Science - Computer Science

Acharya Nagarjuna University
Bharat Yadav Basaboyina