Summary
Overview
Work History
Education
Skills
Certification
Timeline
Generic

Ravi Kiran Alluri

New York

Summary

Senior Data Engineer with 8+ years of experience building scalable data pipelines and distributed systems for fraud detection, healthcare analytics, and enterprise data platforms. Proven track record of architecting multi-terabyte ETL workflows, optimizing Spark performance by 40%+, reducing compute costs by 30-40%, and migrating legacy systems to modern frameworks, reducing code complexity by 60%. Certified Databricks Developer with expertise in AWS EMR Serverless, Apache Spark, Kafka streaming, Azure Databricks, and ML feature engineering pipelines.

Overview

6
6
years of professional experience
2
2
Certifications

Work History

Senior Data Engineer

Intuit
New York
02.2025 - Current
  • Architected and implemented Batch Processing Pipelines (BPP) using AWS EMR Serverless and QuickETL framework, processing multi-terabyte fraud detection datasets across fraud_360_dm, fraud_360_stable, and fraud_360_rpt schemas with nightly scheduling and dependency orchestration

    • Engineered real-time streaming pipelines leveraging Apache Kafka and EventBus integration, reducing fraud alert latency by 60% and enabling low-latency content moderation workflows for Mailchimp fraud detection

    • Optimized Spark job performance by 30%+ through executor tuning, broadcast joins, partition pruning, and adaptive skew join strategies, configuring Spark adaptive execution with skewed partition detection (256MB threshold) and dynamic coalescing for unevenly distributed data

    • Optimized AWS EMR Serverless memory allocation by fine-tuning executor memory, memory overhead, and dynamic allocation parameters, implementing right-sized configurations that reduced compute costs by 30-40% while maintaining pipeline performance and reliability

    • Migrated legacy PySpark codebases to QuickETL framework, converting imperative Python ETL scripts to declarative HOCON configuration files, reducing code complexity by 60% and standardizing pipeline patterns across fraud detection workflows

    • Built ML feature engineering pipelines with time-windowed aggregations, user behavioral analytics, and transaction-level features stored in Delta format, supporting downstream SageMaker model training workflows

    • Designed dimensional data models using star schema patterns with SCD Type-2 logic for historical fraud label tracking, implementing CDC tracking and partition management strategies to ensure data freshness and enable automated reprocessing

    • Developed scalable ETL pipelines integrating data from Redshift, S3, and DynamoDB into unified fraud data marts, supporting fraud scoring and risk analytics with multi-environment configurations (APRD, PRD, E2E, TAXAPRD, TAXPRD)

    • Implemented CI/CD workflows using Jenkins and Git sync processes, automating testing, validation, and deployment of QuickETL pipeline configurations to S3 with version control and rollback capabilities

    • Collaborated with data scientists and analysts to design reusable pipeline components for fraud feature engineering, improving experimentation velocity and enabling faster model iteration cycles

Senior Data Engineer

Brady Plus
California
03.2023 - 12.2024
  • Developed and orchestrated Azure Data Factory (ADF) pipelines to perform ETL processes, utilizing Copy Activity, Data Flow, ForEach, and Lookup activities to transform and load data into SQL Server staging area with incremental loads using watermark tables

    • Implemented real-time data ingestion pipelines using Apache Kafka as streaming source and Databricks for data processing, transforming, and loading into Delta Lake with ACID transaction support

    • Optimized PySpark jobs by implementing partitioning, caching, and broadcast joins, reducing execution time and improving overall job performance for large-scale healthcare datasets

    • Connected Azure Synapse to Delta Lake using Spark pools, enabling seamless access to Delta tables and leveraging ACID properties for data transactions

    • Secured sensitive connection details by integrating Azure Key Vault with ADF and Synapse, ensuring credentials such as database passwords, API keys, and storage account keys are safely managed

    • Built scalable ETL pipelines by integrating Kafka and Databricks, enabling real-time analytics and data transformation on streaming data for high-performance data engineering solutions

    • Designed and implemented complex data models in SQL Server and Synapse following best practices, leveraging star schema design with SCD-1 and SCD-2 implementations

    • Containerized PySpark code in Docker, reducing deployment times and enhancing portability, allowing easy migration and testing across ADF and Databricks environments

    • Automated unit testing and integration testing within CI/CD pipelines to validate functionality and performance of data pipelines, reducing risk of defects and ensuring high-quality code delivery

Senior Data Engineer

Optum (UHG) Healthcare
Minnesota City
05.2021 - 02.2023
  • Designed and implemented Azure Data Factory pipelines to orchestrate data migration and ETL processes across multiple platforms, ensuring seamless data flow between on-premises and Azure cloud storage

    • Deployed real-time stream processing jobs on Databricks using Kafka as data source, leveraging Structured Streaming APIs for handling high-velocity healthcare data streams

    • Migrated on-premises Spark jobs from Scala to Python notebooks in Databricks, rewriting Scala-based Spark jobs using Maven builds to generate JAR files for seamless execution

    • Optimized PySpark code by fine-tuning RDD transformations, actions, and Spark SQL queries, improving job execution time by 25% while handling large-scale healthcare datasets

    • Designed and implemented Medallion Architecture (Bronze, Silver, Gold layers) in Databricks to organize data workflows, enabling scalable, modular data pipeline structure for batch and streaming data

    • Utilized Databricks Delta Lake for efficient handling of SCD Type 2 changes, leveraging merge operations for managing inserts, updates, and deletes seamlessly

    • Designed and orchestrated complex ETL workflows using Apache Airflow, automating data pipelines with DAGs to schedule, monitor, and manage tasks across distributed environments

    • Processed real-time data streams using Kafka Streams API for data enrichment, aggregation, and windowed computations, delivering low-latency insights for business-critical healthcare applications

    • Integrated Databricks and Snowflake workflows with CI/CD pipelines, automating deployment and versioning of notebooks, jobs, and SQL scripts, ensuring consistency across environments

    • Managed and secured Delta Tables using Unity Catalog for fine-grained access controls, simplifying data governance and enabling compliance with HIPAA and organizational data security policies

Senior ETL Developer

Tabula Rasa HealthCare
Moorestown
01.2020 - 04.2021
  • Designed and implemented distributed data processing pipelines using Apache Spark, Hive, and Python for healthcare data analytics

    • Utilized Terraform to define and manage AWS infrastructure as code, automating provisioning and configuration of EC2 instances, S3 buckets, VPCs, RDS databases, and IAM roles

    • Built and deployed Python-based AWS Lambda functions to integrate with external APIs, handling complex data transformations and securely transmitting data to downstream services

    • Extracted data from multiple source systems (S3, Redshift, RDS) and created multiple tables/databases in Glue Catalog by creating Glue Crawlers

    • Used AWS EMR to transform and move large amounts of data into and out of AWS S3, optimizing data processing workflows for healthcare analytics

Education

Master of Science -

Southern Illinois University, Edwardsville
Edwardsville, IL
12-2018

Skills

TECHNICAL SKILLS

Cloud Platforms: AWS (EMR Serverless, S3, Lambda, Redshift, DynamoDB, Glue, Athena, EC2, CloudWatch), Azure (Databricks, Data Factory, Synapse, Storage, Key Vault, Logic Apps)

Big Data Technologies: Apache Spark, PySpark, Spark SQL, QuickETL, Delta Lake, Databricks, Hadoop, Hive, HDFS, Kafka, EventBus, Airflow, Snowflake

ETL Tools: Azure Data Factory, AWS Glue, SSIS, Talend, QuickETL, Meghdoot, BPP (Batch Processing Pipeline)

Databases: PostgreSQL, SQL Server, Amazon Redshift, DynamoDB, HBase, Cassandra

Languages: Python, SQL, Scala, PL/SQL, T-SQL, HiveQL, Shell Scripting, HOCON

Data Modeling: Star Schema, Snowflake Schema, Dimensional Modeling, SCD Type-1/Type-2, CDC, Medallion Architecture

DevOps & Tools: Jenkins, CI/CD, Git, Docker, Terraform, CloudWatch, Unity Catalog

Methodologies: Agile, Scrum, Waterfall, Test-Driven Development (TDD)

Certification

Databricks certificate link: https://credentials.databricks.com/5b82f140-75fc-4180-8782-91ee9dad1adb

Timeline

Senior Data Engineer

Intuit
02.2025 - Current

Senior Data Engineer

Brady Plus
03.2023 - 12.2024

Senior Data Engineer

Optum (UHG) Healthcare
05.2021 - 02.2023

Senior ETL Developer

Tabula Rasa HealthCare
01.2020 - 04.2021

Master of Science -

Southern Illinois University, Edwardsville
Ravi Kiran Alluri