Overview
Summary
Skills
Work History
Education
Hi, I’m

Balaji Digala

Irving,TX
Balaji Digala

Overview

6
years of professional experience
1
Certification

Summary

Experienced Data Engineer with 6 years in developing, optimizing, and automating complex ETL/ELT pipelines leveraging AWS, Azure, Spark, Hadoop, and Snowflake. Skilled in Python and Scala for advanced data transformations and analysis. Certified AWS Associate Solutions Architect with hands-on experience in S3, DynamoDB, Glue, EMR, ECS, IAM, EC2, and Lambda. Proficient in data warehousing with Snowflake, Redshift, and relational (MySQL, PostgreSQL) and NoSQL (MongoDB, DynamoDB, HBase) databases. Expertise in ETL tools such as Talend and Informatica, as well as real-time data streaming and processing with Apache Kafka and Spark Streaming. Managed deploying containerized applications using Docker and Kubernetes, and managing infrastructure as code with Terraform. Forte in project management and SDLC methodologies, utilizing JIRA, GIT, Jenkins for CI/CD, Agile practices, and innovation strategies.

Skills

  • Python, SQL, Scala, Linux
  • AWS, Azure
  • Apache Spark, Airflow
  • Spark SQL, PySpark
  • Apache Kafka, Hadoop
  • Snowflake, Databricks
  • MySQL, SQL Server
  • PostgreSQL, Teradata
  • MongoDB, HBase
  • Cassandra, DynamoDB
  • Informatica, Talend
  • CSV, JSON, Parquet, XML

Work History

American Airlines

Data Engineer
01.2024 - Current

Job overview

  • Developed real-time data processing enterprise system using Apache Kafka and Apache Spark in Scala, streamlining analysis of streaming data from diverse external sources.
  • Implemented Spark SQL scripts in Databricks and Scala for batch processing jobs to extract, transform, and aggregate data from multiple file formats, reducing processing time by 30%.
  • Engineered data pipeline integrations, ETL processes, and comprehensive ingestion from source systems to Azure Blob Storage, Azure Data Lake, SQL, and Azure Synapse Analytics using Azure Data Factory, T-SQL, Spark SQL, and U-SQL.
  • Performed DAG (directed acyclic graphs) lineage tracking using Airflow for transparency and traceability in data transformations across Azure, Databricks, and Snowflake.
  • Designed relational database system and was involved in logical modeling using dimensional modeling techniques such as Star schema and Snowflake schema.
  • Configured Docker containers to streamline development workflows, Git for version control, managed tasks and bugs with Jira, and employed Agile (SCRUM) methodologies for efficient software development, leading to enhanced project delivery and governance by 30%.

Environment: Apache Spark, Spark SQL, Databricks, Scala, Map Reduce, Azure, Tableau, Power BI, Python, Apache Airflow, Apache Kafka, Docker, Azure, Hive, Git, Jira, SQL, MongoDB, Agile.

Paychex

Data Engineer
02.2022 - 12.2022

Job overview

  • Devised Scala scripts and UDF's using data frames and RDD in Spark for data cleansing, aggregation, and writing back into S3 bucket, resulting in 50% reduction in processing time.
  • Accelerated data processing from Hive tables using PySpark, Spark SQL, and MapReduce on HDFS to cleanse heterogeneous data and enhance data retrieval speed for analytical insights.
  • Orchestrated robust ETL pipelines using Azure Data Factory, streamlining data flow from Data Lake to multiple databases using stored procedures, data flows, and Azure Functions.
  • Configured Azure Polybase for efficient data extraction from Azure Data Lake, streamlining data workflows, and improving data integration speed by 20%.
  • Generated interactive reports and dashboards using Power BI and Tableau, improving business decision-making by providing real-time insights into POS and operational data.

Environment: Scala, Apache Spark, Python, S3, Hive, PySpark, Spark SQL, RDD, MapReduce, HDFS, Azure Data Factory, Azure Data Lake, Azure Functions, Hadoop, Kafka, Apache Airflow.

Western Union

Data Engineer
03.2020 - 01.2022

Job overview

  • Developed Python-based API (RESTful Web Service) to track revenue and perform analysis.
  • Leveraged AWS to build scalable, cloud-based data solutions, utilizing services like EC2, S3, Redshift, and EMR (Elastic Map Reduce) to manage and process data efficiently.
  • Integrated AWS Glue for data cataloging and Informatica for ETL jobs, ensuring data quality.
  • Modeled data warehouses and marts for effective data management using Kimball methodology, Facts, Dimensions, SCDs, Surrogate Keys, Star schema, and Snowflake schema.
  • Orchestrated end-to-end data pipelines on Snowflake, integrating batch and streaming data for real-time analytics with Snowpipe and Streams, reducing data processing time by 40%.
  • Implemented Amazon Elastic Kubernetes Service (EKS) scheduler to automate application deployment in cloud using Docker automation techniques.
  • Executed unit testing, validations, and debugging to ensure reliable data solutions.

Environment: Python, Amazon Elastic Kubernetes Service, Informatica, ETL, Power BI, Tableau, AWS, Snowflake, Python, RESTFul, Docker, AWS Glue Data Catalog, MongoDB, SQL, AWS ECS.

Web Affinity Technologies Pvt Ltd

Data Engineer
08.2017 - 02.2020

Job overview

  • Deployed data pipelines using AWS Kinesis for real-time streaming, S3 for raw data storage, and Lambda for serverless processing. Leveraged Redshift for structured data warehousing and fast queries, and DynamoDB for scalable NoSQL storage and retrieval.
  • Designed and reviewed processes to optimize ETL pipeline architecture and codebase using Spark and Hive (including daily runs, error handling, and logging) to useful metadata.
  • Employed Hadoop for distributed storage and processing of large data sets, improving efficiency and scalability of data ingestion and transformation workflows.
  • Pioneered PIG scripts for analysis of semi-structured data. Used Pig as an ETL tool for transformations, event joins, filters, and pre-aggregations before ingesting data onto HDFS.
  • Proficient in SQL query optimization, indexing, and tuning. Created stored procedures, triggers, functions, and views for real-time analytics using Oracle, SQL Server, and MySQL.
  • Created Tableau dashboards and reports for data visualization, reporting, and analysis. Employed Power Query in Power BI to pivot and unpivot data model for data cleansing.
  • Utilized Terraform to automate provisioning and management of cloud infrastructure, ensuring consistency, repeatability, and scalability across AWS and Azure environments.

Environment: AWS Kinesis, AWS S3, AWS EMR, AWS Lambda, Redshift, DynamoDB, Spark, Windows, Hive, Hadoop, Pig, Oracle, Microsoft SQL Server, MySQL, Tableau, Power BI, Terraform, Agile.

Education

Bradley University
Peoria, IL

Master of Science from Computer Science
05.2024
Balaji Digala