Summary

Overview

Work History

Education

Skills

Accomplishments

Certification

Certifications : • Databricks Certified Associate Developer for Apache Spark 3.0 - Credential ID 891

Timeline

Bindu Bala

Hillsboro,USA

Summary

Senior Data Engineer with 6+ years of experience in designing and developing scalable ETL pipelines and delivering end-to-end data solutions across cloud and big data platforms. Proficient in Spark, Databricks, Snowflake, Hive, and AWS, for processing large-scale datasets to drive actionable business insights. Experienced in data visualization using Tableau and Power BI, with strong analytical skills. Quick to adapt to new technologies, with growing interest and hands-on learning in emerging fields like Gen AI and ML.

Overview

years of professional experience

Certification

Work History

DATA ENGINEER

DIRECTV

02.2024 - Current

Designed, developed, and maintained ETL pipelines to support Base Management and Retention KPIs, ensuring seamless analytics continuity and daily operational efficiency.
Designed and implemented robust, scalable ETL frameworks to migrate terabytes of data from on-premise Teradata systems into AWS S3 and Snowflake using Azure Databricks, improving data accessibility and enabling modern cloud-based analytics.
Implemented end-to-end Lakehouse platform using Azure Databricks and Delta Lake to build scalable batch-oriented data pipelines with incremental data loading, orchestrated through Databricks Workflows, improving pipeline reliability.
Built and optimized big data solutions using PySpark, Python, Spark SQL, and Scala, supporting diverse analytics domains such as spend analysis, Reconnects, base management, viewership trends, click stream data, churn analysis, base Package Migrations, Equipment migrations, and call center analytics.
Optimized Spark performance and tuned pipelines to enhance data quality, processing efficiency, and system reliability, and support consistent SLA adherence.
Designed interactive reports and dashboards in Tableau to visualize key metrics and performance indicators for Reconnects, App Engagement, Base Package Migrations, Movers, Equipment Migrations, and Protection Plan dashboards.
Collaborated closely with business stakeholders in agile environments to deliver aggregated datasets and insights that supported data-driven decisions through accurate, timely insights and strategic retention initiatives, across satellite and streaming TV products.
Environment: Spark, Databricks, PySpark, Spark SQL, Python, AWS S3, Hive, Snowflake, Teradata. AWS S3, Azure Data Factory, Lakehouse Platform, Tableau, Power BI.

DATA ENGINEER

COLUMBIA

02.2023 - 01.2024

Developed and maintained scalable data ingestion and processing pipelines using Apache PySpark, Spark SQL, Hive, Snowflake, and AWS services such as EMR and EC2.
Improved system performance by 40% through PySpark optimization techniques, including data shuffling reduction, broadcast joins, dataframe caching, and partitioning strategies.
Orchestrated ETL workflows using Apache Airflow, enabling automated and reliable data movement and transformations.
Automated historical data loads (from 2015 onward) for Sales Data Products, significantly reducing load durations and manual intervention.
Built Spark-based transformations and aggregations, storing results in Parquet format for efficient downstream consumption.
Leveraged Azure Databricks for managing notebooks, workflows, Delta tables, and Lakehouse architecture to enable unified data processing.
Designed and managed data pipelines using Databricks and Azure Data Factory, including migration of legacy EMR-based jobs to Databricks, improving maintainability and cost-efficiency.
Utilized Snowflake as the data consumption layer, creating views and external tables to deliver datasets to business stakeholders.
Proficient in Spark performance tuning techniques such as partitioning, broadcast joins, memory management, and in-memory processing.
Created and managed Hive external/managed tables, implemented partitions, buckets, UDFs, and applied compression and performance optimization techniques.
Used Sqoop for importing data into Hive, crafting complex queries for analysis and data transformation.
Investigated and resolved data issues through root cause analysis, log monitoring, and error tracing to ensure data accuracy and consistency.
Participated in all phases of the development lifecycle: requirement analysis, technical design, implementation, testing, and deployment.
Experienced in using Git for version control and Jenkins for CI/CD automation.
Actively engaged in Agile Scrum practices, including sprint planning, daily stand-ups, and iterative delivery.
Environment: Spark, Databricks, AWS S3, PySpark, spark SQL, Hive, AzureDataFactory, Airflow, EMR, Snowflake, Hadoop, HDFS, Sqoop, GitHub, Oracle Jenkins, Amazon Redshift, Amazon DynamoDB Agile, Unix/Linux

DATA ENGINEER

RISHENNYA SERVICES PVT LIMITED

11.2018 - 12.2022

Experienced working for EMR cluster in AWS cloud and working with S3.
Developed ETL frameworks for data using PySpark.
Involved in Hive/SQL queries performing spark transformations using Spark RDDs and Python (PySpark).
Performed fine-tuning of Spark applications/jobs to improve the efficiency and overall processing time for the pipelines.
Optimized the Hive Queries using the various files formats like PARQUET, JSON, AVRO, and Parquet.
Extensively used Sqoop to import/export data between RDBMS and hive tables, incremental imports, and created Sqoop jobs for last saved value.
Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
Implemented several Batch Ingestion jobs for Historical data migration from various relational databases and files using Sqoop.
Knowledge of creating the user-defined functions (UDFs) in Hive.
Involved in loading the structured and semi-structured Data into spark clusters using Spark SQL and Data Frames API.
POC on building ETL pipelines using Apache NiFi.
Used NiFi to import the data and apply the basic transformations using community and custom-built processors.
Implemented the workflows using the Apache Oozie framework to automate tasks.
Experienced in development actives in complete Agile model using JIRA and GIT.
Environment: Spark, Python, Spark SQL, AWS, S3, AWS EMR, Sqoop, Hive, MySQL, Avro, Parquet, ORC, Oracle, PySpark, Oozie, and NiFi.

DATA ANALYST

SMARK TECHNOLOGIES

02.2018 - 10.2018

Spearheaded a data-driven initiative to optimize retail sales performance through in-depth analysis of transactional data using SQL.
Developed and deployed dynamic reports using Power BI, providing real-time insights into product sales, customer behavior, and inventory turnover, leading to a 15% increase in overall sales.
Implemented DAGs to streamline inventory data workflows, reducing errors and this initiative led to a 20% improvement in inventory turnover.
Collaborated with cross-functional teams to deliver on-time projects and initiatives.

Education

Master of Science - Business Analytics

Grand Canyon University

Phoenix, AZ

Skills

Python
SQL
Pyspark
R
Data Modeling
ETL
Apache Airflow
Big Data Technologies

Azure Databricks
Tableau
Snowflake
Azure Machine Learning
AWS Services (EMR,S3,EC2,Anthena)
Data Visualization
Data Analysis
Advanced Excel

Accomplishments

· Databricks Certified Associate Developer for Apache Spark 3.0 - Credential ID 89102552

· Data Analysis using Pyspark- Coursera 2021

Certification

Databricks Certified Associate Developer for Apache Spark 3.0 - Credential ID 89102552
Data Analysis using Pyspark- Coursera 2021

Certifications : • Databricks Certified Associate Developer for Apache Spark 3.0 - Credential ID 891

· Databricks Certified Associate Developer for Apache Spark 3.0 - Credential ID 89102552

· Data Analysis using Pyspark- Coursera 2021

Timeline

DATA ENGINEER

DIRECTV

02.2024 - Current

DATA ENGINEER

COLUMBIA

02.2023 - 01.2024

DATA ENGINEER

RISHENNYA SERVICES PVT LIMITED

11.2018 - 12.2022

DATA ANALYST

SMARK TECHNOLOGIES

02.2018 - 10.2018

Master of Science - Business Analytics

Grand Canyon University