Summary
Overview
Work History
Education
Skills
Timeline
Generic

Bhavya Mallela

Summary

Experienced Senior Data Engineer with over 7 years of expertise in architecting, developing, and optimizing high-performance ETL pipelines for real-time and batch processing using Apache Spark (Scala and PySpark), SQL, and Python.

Proven success in migrating complex enterprise-scale data workflows from Scala Spark to PySpark, improving modularity, reducing latency, and enhancing maintainability.

Adept in handling diverse data sources and formats, including structured, semi-structured (JSON, Parquet), and unstructured data using tools like Hive, Delta Lake, and Apache Hudi for efficient data management. Strong command over relational and NoSQL databases including Oracle, MySQL, PostgreSQL, SQL Server, MongoDB, CouchDB, Amazon Redshift, and Snowflake, leveraging advanced SQL techniques, stored procedures, indexing strategies, and query optimization for large-scale data processing. Extensive experience across cloud platforms such as AWS (S3, Glue, Lambda, Athena, Redshift, Secrets Manager) and Microsoft Azure (Data Factory, Blob Storage, Synapse Analytics, Azure SQL), with secure integration using IAM, Vault, and Secrets Manager. Skilled in real-time streaming and event-driven architecture using Apache Kafka and Spark Structured Streaming, building low-latency data flows for business-critical applications.

Proficient in orchestrating workflows using Airflow, Rundeck, and Databricks Workflows, and integrating CI/CD pipelines through Jenkins, GitHub Actions, and Concourse, ensuring automation and seamless deployment across data engineering processes.

Highly collaborative in engaging with cross-functional stakeholders to gather requirements, analyze source data from various upstream systems, and translate complex business rules into scalable Spark/SQL logic.

Experienced in version control and code lifecycle management using Git, GitHub, Bitbucket, Passionate about data quality, security, and cost-efficient data architecture, consistently delivering reliable and performant data products that power enterprise reporting, analytics, and ML pipelines.

Overview

7
7
years of professional experience

Work History

Senior Data Engineer

Comcast
12.2023 - Current
  • Led the complete migration of over 50 Databricks-based ETL jobs from Scala Spark to modular PySpark, covering ingestion, compute, and advanced compute layers, while ensuring functional consistency, improved maintainability, and optimized execution.
  • Developed PySpark pipelines within Databricks, orchestrated through Databricks Workflows, with outputs stored in AWS S3 using Parquet, Delta Lake, and JSON formats to support analytical and operational downstream systems.
  • Re-engineered an existing ETL process that previously wrote to CouchDB, replacing it with a MongoDB-based implementation for better scalability, flexible schema handling, and enhanced integration with microservices and analytics layers.
  • Resolved critical Spark serialization issues that occurred during PySpark migration by identifying scope leakage in closures and restructuring code logic to ensure compliance with distributed processing constraints.
  • Built advanced SQL-based transformations by querying and joining data across multiple platforms such as Oracle, PostgreSQL, Teradata, MySQL, Trino, and MongoDB using Spark SQL, JDBC connectors, and tools like DBeaver.
  • Constructed eligibility pipelines that apply complex business rules using PySpark transformations, window functions, and broadcast joins, sourcing data from diverse teams and standardizing it into unified, queryable formats.
  • Worked directly with business teams to gather eligibility program requirements, interpret rule logic, and implement them as scalable ETL pipelines by sourcing from upstream data providers and validating through SQL.
  • Integrated AWS components including Glue Catalog, Secrets Manager, Athena, and CloudWatch into the pipeline framework for secure credential management, data governance, audit logging, and operational visibility.

Data Engineer

Tata Consultancy Services
05.2019 - 07.2022
  • Developed and maintained ETL pipelines using Scala Spark and PySpark, performing both batch and micro-batch processing for transforming structured and semi-structured datasets across domains like telecom and healthcare.
  • Utilized Azure Data Factory (ADF) for data movement and transformation workflows, sourcing from Azure Blob Storage, on-prem SQL Server, and external APIs, and loading processed data into Azure SQL Database and Synapse Analytics.
  • Built and managed real-time data ingestion pipelines using Apache Kafka, handling high-throughput streaming data and integrating it with Spark-based processing layers and NoSQL storage like MongoDB.
  • Designed and executed ETL processes that staged data into Snowflake, applying transformation logic using SQL and supporting concurrent analytical queries through external BI tools like Power BI.
  • Worked extensively with databases including Oracle, Teradata, PostgreSQL, and MySQL, crafting optimized SQL queries, stored procedures, and views for downstream analytics and operational reporting.
  • Followed Agile methodologies for sprint planning, backlog grooming, and delivering incremental improvements, using tools like Jira for story tracking and Confluence for documentation.
  • Used Git and GitHub for source code versioning, managed pull requests, resolved merge conflicts, and followed branch management best practices for CI/CD integration and collaboration with cross-functional teams.

Software Developer

Grepthor Solutions
07.2018 - 05.2019
  • Wrote and optimized SQL queries to extract, join, and filter data from relational databases such as MySQL, PostgreSQL, and SQL Server, supporting business reports and ad hoc data requests.
  • Developed interactive dashboards and visualizations using Tableau, including bar charts, filters, KPIs, and calculated fields, helping stakeholders derive insights from structured data.
  • Built simple web-based dashboards using HTML, CSS, and JavaScript, integrating backend data through APIs or static datasets for internal reporting interfaces.
  • Gained hands-on exposure to AWS services like S3 for data storage, RDS for managing relational databases, and Athena for querying S3 data using SQL.
  • Supported basic ETL workflows by assisting with data validation, cleaning, and transformation tasks under guidance, learning industry best practices for scalable and reusable data pipelines.
  • Collaborated with cross-functional teams to define project requirements and deliver software solutions.

Education

Bachelor of Technology -

Vasireddy Venkatadri International Technological University
01-2018

Computer Science

University of Central Missouri
01-2023

Skills

  • Big Data & Frameworks: Apache Spark (Scala & PySpark), Hive, Delta Lake, Apache Hudi, Hadoop, Numpy and pandas
  • Databases & Warehousing: Oracle, MySQL, Teradata, Trino, SQL Server, PostgreSQL, MongoDB, Snowflake
  • Languages: Python, Scala, SQL (PL/SQL, T-SQL), Shell Scripting
  • Cloud Platforms: AWS (S3, Glue, EC2, Lambda, Athena, DynamoDB, Secrets Manager, Redshift, Kinesis, CloudWatch), Azure (Data Factory, Blob Storage, Synapse Analytics, Azure SQL, Azure Monitor, Azure Functions)
  • CI/CD & Orchestration: Jenkins, Concourse, Rundeck, Airflow
  • Web Technologies: HTML, Javascript, CSS
  • Tools & Utilities: Git, GitHub, Bitbucket, Databricks

Timeline

Senior Data Engineer

Comcast
12.2023 - Current

Data Engineer

Tata Consultancy Services
05.2019 - 07.2022

Software Developer

Grepthor Solutions
07.2018 - 05.2019

Bachelor of Technology -

Vasireddy Venkatadri International Technological University

Computer Science

University of Central Missouri
Bhavya Mallela