Summary
Overview
Work History
Timeline
Generic
Praveena V

Praveena V

Senior Data Engineer

Summary

Senior Data Engineer with 11 years of experience in developing and optimizing scalable data pipelines and ETL workflows. Expertise in big data technologies such as Apache Spark and Kafka, along with proficiency in AWS services including Glue and Redshift. Proven ability to automate ETL processes, optimize SQL queries, and program in Python and Java, resulting in enhanced data availability and accuracy. Successfully led teams to create innovative data solutions that improve system efficiency and support business decision-making.

Overview

11
11
years of professional experience

Work History

Senior Data Engineer – Data Bricks

JPMC
05.2022 - Current
  • Led the end-to-end design and development of large-scale, cloud-native data pipelines on Databricks, integrating real-time and batch processing capabilities across enterprise environments.
  • Architected and optimized ETL workflows using AWS Glue, enhancing scalability and performance for high-volume data pipelines and ensuring seamless integration with Databricks Lakehouse.
  • Developed streaming data pipelines using Apache Kafka, Spark Structured Streaming, and Spark Streaming on Databricks for real-time ingestion and transformation of business-critical data.
  • Engineered complex Spark applications using PySpark and Scala to perform high-performance data transformations, aggregations, and analytics across structured and semi-structured data sources.
  • Implemented advanced Spark optimization strategies including data partitioning, in-memory caching, broadcast variables, and shuffle reduction to enhance job efficiency and minimize compute costs.
  • Fine-tuned Spark clusters on Databricks by configuring batch interval timing, executor memory, parallelism, and auto-scaling for optimal throughput and stability.
  • Translated legacy SQL/Hive queries into performant Spark jobs leveraging Spark SQL, RDDs, and DataFrame APIs for large-scale analytics on Databricks.
  • Utilized AWS Glue for schema inference and catalog integration with Parquet and Avro formats, ensuring interoperability across AWS services and Hive-compatible tools.
  • Orchestrated data ingestion from MySQL and other RDBMS sources into S3/HDFS using Sqoop and Glue, supporting both structured and semi-structured data formats.
  • Designed and maintained automated data workflows using Apache Airflow to coordinate Databricks notebooks, AWS Glue jobs, and ingestion tasks.
  • Integrated Databricks with S3, Hive Metastore, and Lake Formation for secure, governed, and scalable data lake operations.
  • Built reusable and parameterized Databricks notebooks and workflows, incorporating delta lake features such as time travel, ACID transactions, and schema enforcement.
  • Managed scalable data solutions in multi-node Databricks clusters, with performance tuning for workloads running on a 105-node distributed compute environment.
  • Created and maintained CI/CD pipelines for Databricks jobs using Git and DevOps tools to ensure version control, testing, and continuous deployment.
  • Developed ML pipelines on Databricks using MLlib, scikit-learn, and TensorFlow, integrating model training and batch scoring with data pipelines.
  • Collaborated with data analysts and business stakeholders to deliver reliable and performant data products, ensuring high availability and data quality across all environments.
  • Environment: Databricks, AWS, AWS Glue, Apache Spark, Spark Structured Streaming, PySpark, Scala, Delta Lake, Apache Kafka, Apache Airflow, Amazon S3, Hive, MySQL, Git, CI/CD Pipelines

Senior Data Engineer – Data Bricks

Capital one
09.2019 - 03.2022
  • Led the end-to-end design and development of enterprise-scale ETL pipelines in AWS using Apache Spark and Python, supporting both batch and streaming workloads across Unix/Linux environments.
  • Architected and implemented data warehouse migration strategies, ensuring adherence to Teradata security models and optimizing SQL-based transformations.
  • Developed robust Spark applications using PySpark and Spark SQL, with advanced optimization techniques including data partitioning, in-memory caching, and broadcast variables to improve performance and reduce costs.
  • Automated data ingestion and transformation processes using Shell scripting, Common Batch Framework, and Common Scheduler, ensuring timely and reliable data delivery for critical business functions.
  • Maintained data warehouse standards and processes, aligning with State of Michigan (SOM) compliance and security protocols.
  • Engineered ETL solutions to extract, cleanse, and load data from multiple source systems into Teradata and other RDBMS, using SQL, shell scripts, and batch automation tools.
  • Authored and promoted build automation scripts and change management workflows, managing DDL/code migrations in adherence to structured promotion processes.
  • Supported audit compliance by implementing access controls, conducting elevated access/user audits, and addressing findings in coordination with SSP documentation.
  • Provided production support and troubleshooting for escalated issues, collaborating with cross-functional teams to resolve complex data anomalies and ensure continuity.
  • Trained and enabled development teams in Unix/Linux, SQL, Spark, and ETL troubleshooting, driving knowledge transfer and enhancing team autonomy.
  • Participated in disaster recovery planning, backup strategies, and continuity testing for data warehouse systems.
  • Environment: Unix, Linux, AWS, Apache Spark, Spark SQL, PySpark, Python, Shell Scripting, Teradata, SQL, RDBMS, Common Batch Framework, Common Scheduler, Build Automation Scripts, Structured Promotion Processes, Teradata Security Models, Access Controls, SSP, Audit & Compliance Tools, ETL Pipelines, Data Warehouse Migration, DDL Management

Data Engineer / Application Developer

Samsung
11.2017 - 08.2019
  • Design, develop, and maintain scalable, enterprise-grade data pipelines and ETL workflows for batch and real-time processing using Apache Spark.
  • Build streaming data ingestion solutions leveraging Apache Kafka and Hadoop ecosystem tools.
  • Architect and implement cloud-native data solutions on AWS, utilizing services such as S3 and Glue.
  • Manage data warehousing and analytics workloads using AWS Redshift and EMR.
  • Develop data integration and transformation processes using Python and PySpark to ensure data quality and performance.
  • Write and optimize SQL queries and automate workflows with Shell scripting.
  • Build, test (unit and integration), and deploy backend services and REST APIs using Node.js and Java.
  • Containerize applications using Docker and orchestrate deployments with Kubernetes for scalability and resilience.
  • Collaborate with data scientists, analysts, product owners, QA, and DevOps teams to deliver end-to-end data solutions.
  • Optimize Spark jobs and data storage formats such as Parquet and Delta Lake through partitioning and caching techniques.
  • Automate data pipeline orchestration using Apache Airflow, Common Batch Framework, and custom schedulers.
  • Implement data security and access controls in line with organizational policies and compliance standards.
  • Monitor production pipeline health, troubleshoot failures, and resolve data anomalies to ensure system reliability.
  • Participate in disaster recovery planning, backup strategies, and resiliency testing for data platforms.
  • Engage in code reviews, CI/CD pipeline creation, and continuous integration using Jenkins and Maven.
  • Support frontend development using React.js, Redux, or Vue.js for building UI components related to data applications.
  • Environment: Apache Spark, Apache Kafka, Hadoop ecosystem tools, AWS (S3, Glue, Redshift, EMR), Python, PySpark, SQL, Shell scripting, Node.js, Java, Docker, Kubernetes, Parquet, Delta Lake, Apache Airflow, Common Batch Framework, Jenkins, Maven, React.js, Redux, Vue.js

Data Engineer / Application Developer

Tesla
04.2016 - 09.2017
  • Developed scalable ETL pipelines and data workflows using Python and Apache Airflow to support Tesla OS modules.
  • Designed and optimized SQL queries to transform operational data for reporting and analytics using SQL and AWS Redshift.
  • Built and maintained backend services and REST APIs using Node.js and Python, ensuring efficient data access and system communication.
  • Automated data ingestion processes and batch jobs using Shell scripting and Apache Kafka for near real-time updates.
  • Integrated and exposed backend data services to frontend modules built in Angular and React.js.
  • Managed CI/CD pipelines and deployment workflows using Jenkins and GitHub Actions.
  • Stored, queried, and archived application data using AWS S3 and Redshift.
  • Supported frontend and backend integration testing with unit test frameworks in Node.js and Python.
  • Collaborated across product, design, and engineering teams, following Agile and SDLC methodologies.
  • Ensured system reliability and performance through live monitoring and metrics analysis using internal Tesla tools.
  • Environment: Python, Apache Airflow, SQL, AWS Redshift, Node.js, Shell Scripting, Apache Kafka, Angular, React.js, Jenkins, GitHub Actions, AWS S3, Unit Test Frameworks (Node.js & Python), Agile, SDLC

ETL Developer

UHG
12.2014 - 03.2016
  • Developed and maintained robust ETL pipelines using SSIS and T-SQL for processing healthcare claims and patient records from multiple relational databases including SQL Server and Oracle.
  • Designed and deployed interactive dashboards and KPI scorecards in Power BI, delivering actionable insights to business stakeholders and improving reporting efficiency by 40%.
  • Applied data quality checks and validation logic to ensure consistency and accuracy across clinical, financial, and operational datasets.
  • Engineered data preparation workflows to support machine learning models for patient risk stratification and readmission prediction using Python.
  • Implemented natural language processing (NLP) pipelines to extract clinical insights from unstructured medical notes, improving classification accuracy.
  • Created and tuned stored procedures and views to support large-scale reporting and ad-hoc data requests while ensuring performance optimization.
  • Developed documentation and data dictionaries, and collaborated closely with global stakeholders to translate healthcare domain requirements into effective DWBI and AI/ML solutions.
  • Built scalable data models to support descriptive and predictive analytics, contributing to the organization’s shift from retrospective to real-time decision making.
  • Participated in requirements analysis, performance tuning, and database management activities aligned with software development best practices.
  • Applied AI/ML and deep learning techniques to pilot predictive analytics use cases using frameworks like TensorFlow (as part of R&D phase).
  • Collaborated with cross-functional teams to monitor, maintain, and continuously improve data governance and analytics architecture.
  • Ensured compliance with healthcare data standards and participated in audits and quality reviews.
  • Environment: SSIS, SQL Server, Oracle, SQL (T-SQL), Python, Power BI, TensorFlow, pandas, scikit-learn, Stored Procedures, Views, Data Validation, Performance Tuning, Data Modeling, Requirements Gathering, Documentation

Timeline

Senior Data Engineer – Data Bricks

JPMC
05.2022 - Current

Senior Data Engineer – Data Bricks

Capital one
09.2019 - 03.2022

Data Engineer / Application Developer

Samsung
11.2017 - 08.2019

Data Engineer / Application Developer

Tesla
04.2016 - 09.2017

ETL Developer

UHG
12.2014 - 03.2016
Praveena VSenior Data Engineer