Summary
Overview
Work History
Education
Skills
Websites
Timeline
Generic

SAI AKHIL

Denton

Summary

Data Engineer with 8+ years of experience in designing, building, and optimizing large-scale data processing architectures across cloud-based infrastructures. Expertise in ETL development, real-time data streaming, API-driven data integration, and advanced analytics using AWS, Databricks, Spark, and SQL. Proven ability to migrate, manage, and optimize complex enterprise data from diverse sources, ensuring scalability, efficiency, and governance across data lakes, warehouses, and analytical platforms. Passionate about driving digital transformation through AI-powered analytics, real-time data processing, and cloud-native solutions, enabling data-driven decision-making and enterprise intelligence.

Overview

9
9
years of professional experience

Work History

Senior Data Engineer

Qorvo
Greensboro
09.2024 - Current
  • Project description: Led the migration of SAP SuccessFactors and SAP ECC (Master Data) data into Databricks, implementing Medallion Architecture for structured data transformation and governance
  • Developed scalable ETL pipelines using PySpark, SQL, and Delta Lake, optimizing data processing efficiency by 33% and reducing compute costs by 45%
  • Designed AWS S3-based cloud storage solutions to enhance query performance and cost-effective data retention while ensuring seamless data integration
  • Automated workflows using Red woods , enabling efficient data ingestion, transformation, and monitoring with proactive alerting mechanisms
  • Integrated Unity Catalog for governance, ensuring data security, compliance, and controlled access to structured datasets
  • Built enterprise-wide reporting solutions by integrating Power BI with Unity Catalog Views, improving real-time business intelligence and decision-making
  • Developed a scalable API ingestion framework, streamlining multi-source data extraction and automation, while collaborating with cross-functional teams to drive data strategy alignment, innovation, and performance optimization within a cloud-based data ecosystem
  • Key Contributions:
  • Migrated SAP ECC and SAP SuccessFactors data into Databricks, leveraging PySpark and SQL for efficient extraction, transformation, and storage in Delta Lake
  • Designed scalable ETL pipelines to ensure data accuracy, consistency, and accessibility, supporting enterprise analytics and reporting
  • Processed General Ledger (GL), Accounts Payable (AP), Accounts Receivable (AR), Asset Accounting (AA), and Cost Center Accounting (CO) master data
  • Designed and implemented Medallion Architecture (Bronze → Silver → Gold) in Databricks, ensuring structured data transformation and optimized workflows
  • Integrated audit columns and watermarking in the Silver layer while consolidating aggregated insights in the Gold layer
  • Enhanced data governance, lineage tracking, and processing efficiency, enabling seamless integration with downstream applications and enterprise analytics
  • Fine-tuned Databricks clusters, PySpark jobs, and SQL transformations, achieving a 33% improvement in SAP ECC data processing efficiency and a 45% boost in PySpark job performance.Optimized Spark execution plans, partitioning strategies, and caching mechanisms, significantly reducing computation time and memory consumption
  • Architected a scalable AWS S3-based storage solution, integrating Databricks Delta Lake for transactional data processing and historical tracking.Architected a scalable AWS S3-based storage solution, integrating Databricks Delta Lake for transactional data processing and historical tracking
  • Implemented Unity Catalog for metadata management, data lineage tracking, and security compliance, ensuring structured governance and controlled access.Established role-based access control (RBAC), audit logging, and encryption policies, strengthening data security and compliance adherence
  • Developed automated ETL workflows using Redwoods and Databricks, reducing manual effort and streamlining data ingestion, transformation, and loading processes
  • Developed a scalable API ingestion framework to dynamically fetch multi-module SAP financial data (GL balances, AP/AR transactions, fixed asset details, profit center reports, and customer credit limits), processing it efficiently in Databricks and AWS S3 for real-time analytics and financial reporting
  • Responsibilities:
  • Designed and managed ETL pipelines in Databricks using PySpark, SQL, and Delta Lake, ensuring efficient data ingestion, transformation, and storage.Ensured end-to-end data integrity, from extraction to storage, by implementing data validation and error-handling mechanisms
  • Implemented and optimized Spark SQL queries to perform complex data transformations, aggregations, and joins within Databricks Delta Lake
  • Enhanced query performance and execution efficiency by leveraging adaptive query execution (AQE), partition pruning, and caching strategies, ensuring faster data retrieval and reduced computational overhead
  • Enforced data quality checks, audit columns, and watermarking techniques to ensure consistent and accurate datasets
  • Designed and implemented error-handling mechanisms, ensuring seamless recovery from unexpected data issues
  • Monitored data freshness, completeness, and accuracy, ensuring reliability in downstream analytics and decision-making
  • Tuned Databricks clusters and PySpark jobs, reducing compute costs while maintaining high data processing efficiency.Applied parallel processing techniques and Spark optimizations to handle large volumes of data efficiently
  • Implemented data security, lineage tracking, and governance policies using Unity Catalog, ensuring data integrity and regulatory compliance.Designed access control mechanisms and role-based permissions, ensuring that sensitive data remains secure while maintaining accessibility
  • Provided ongoing production support, resolving pipeline failures, bottlenecks, and performance issues to maintain high data availability.Built real-time monitoring dashboards to track ETL pipeline performance, proactively identifying anomalies and inefficiencies
  • Developed optimized data models, ensuring structured and well-organized data storage for reporting and analytics.Ensured scalability and maintainability by designing efficient data schemas for large-scale enterprise use cases
  • Developed dynamic API-based data extraction solutions, ensuring seamless integration with third-party and internal data sources.Implemented API request throttling and pagination techniques, ensuring efficient data retrieval and processing
  • Led data migration and modernization efforts, leveraging Databricks, AWS S3, Delta Lake, and PySpark to improve efficiency.Implemented data lifecycle management strategies, optimizing data retention, archiving, and access policies
  • Worked closely with data analysts and stakeholders to optimize data pipelines for better business insights
  • Provided technical guidance and data access solutions, enabling teams to leverage high-quality datasets for reporting
  • Ensured seamless communication between technical and business units, aligning data engineering efforts with enterprise goals
  • Environment: AWS, Microsoft SSIS, Databricks, Apache Spark, PySpark, Spark SQL, SQL, Delta Lake,Databricks Workflows, Databricks Dashboards, MS Excel, Redwoods Scheduling, SAP SuccessFactors, SAP ECC, SailPoint, REST APIs, API Gateway, JDBC, Databricks Unity Catalog, Metadata Management, Data Lineage Tracking, Power BI, Python, SQL, Shell Scripting, VS Code, Git, GitLab, CI/CD Integration.

Senior Data Engineer

Freddie Mac
Richmond
10.2022 - 08.2024
  • Project description: I have spearheaded the design and development of scalable data processing frameworks, integrating seamlessly with AWS services while orchestrating the migration of on-prem databases to Amazon S3, AWS Glue, and Amazon RDS
  • Expertise spans ETL processes, Hadoop infrastructure, and Spark development, optimizing jobs for real-time data processing
  • Leveraging Kafka, Docker, and Kubernetes, you've orchestrated microservices deployment while ensuring robust data governance through Databricks Unity Catalog
  • Data pipelines were built to extract, transform, and load the data into Amazon EMR, which in turn will be utilized as a source data for multiple downstream systems and processes
  • Data Wrangling, Massaging, and Enrichment will be performed in the Data Lake staging area
  • My role extends to CI/CD pipeline implementation, cluster management, and enforcing stringent security measures, showcasing a comprehensive skill set in cloud-native and big data technologies
  • Key Contributions:
  • Designed and developed scalable data frameworks, migrating on-prem databases to AWS services like S3, Glue, and RDS, while integrating Snowflake for cloud warehousing
  • Built and optimized Spark applications using Scala and PySpark on Hadoop clusters for real-time and batch data processing
  • Leveraged Kafka, Docker, Kubernetes, and Spark to process real-time streaming data efficiently, ensuring low-latency event handling
  • Created robust ETL/ELT pipelines using AWS Glue, SQL, Spark SQL, Airflow, and dbt, transforming raw data into structured datasets
  • Implemented validation checks, error handling, and governance standards in Airflow DAGs and dbt models to maintain data integrity
  • Fine-tuned Spark jobs for performance optimization, improving resource utilization, parallelism, and memory tuning
  • Developed UDFs in PySpark and stored procedures to meet specific business needs
  • Developed CI/CD pipelines for automated testing, deployment, and integration of data processing workflows
  • Designed scalable data models and transformations, ensuring alignment with business logic and analytical requirements
  • Worked closely with Agile teams and business stakeholders to define requirements, validate data integrity, and maintain governance standards
  • Architected and deployed Hadoop clusters (HDFS, YARN), ensuring efficient storage, processing, and scalability of big data applications
  • Responsibilities:
  • Designed and developed scalable data frameworks, migrating on-prem databases to AWS services like S3, Glue, and RDS, while integrating Snowflake for cloud warehousing
  • Built and optimized Spark applications using Scala and PySpark on Hadoop clusters to process large-scale data efficiently
  • Leveraged Kafka, Docker, and Kubernetes to handle real-time streaming data and enable low-latency event-driven processing
  • Created ETL/ELT pipelines using AWS Glue, SQL, and Airflow, automating data ingestion, transformation, and loading across multiple sources
  • Implemented data validation checks, error handling, and governance frameworks using Airflow DAGs and dbt models to ensure data integrity
  • Developed optimized Spark jobs by fine-tuning resource allocation, parallelism, and memory usage to improve data pipeline efficiency
  • Designed and implemented data models, ensuring consistency and scalability to support business intelligence and analytics needs
  • Established CI/CD pipelines for automated testing and deployment of data workflows, improving development efficiency and reliability
  • Migrated legacy data pipelines to AWS and Snowflake, improving performance and reducing operational costs
  • Developed and maintained Kafka producers and consumers for seamless message processing and real-time data movement
  • Collaborated with Agile teams and business stakeholders to gather requirements, validate data integrity, and ensure compliance with governance policies
  • Architected and deployed Hadoop clusters, including HDFS and YARN, to manage distributed storage and large-scale data processing
  • Conducted performance tuning on Spark applications and database queries, optimizing execution time and resource utilization
  • Environment: Amazon RDS, AWS Glue, Amazon S3, AWS Data Pipeline, Amazon QuickSight, AWS Lambda, Amazon EMR, SQL, Spark, Python, NumPy, Scipy, Pytest, Pandas, Matplotlib, BeautifulSoup, TextBlob, Scala, Astro, YARN, DBT, Spark-SQL, PySpark, Pair RDD's, Snowflake, Spark MLLib, IAM, RDS, NiFi, Oozie, Hadoop (HDFS, Spark, Hive, HBase), Kafka, ETL, Real-time Data Processing, Airflow, Data Governance, Metadata Management, Power Bi.

Senior Data Engineer

Optum, Hospitals and Health Care
Boston
06.2020 - 10.2022
  • The client-maintained data across multiple sources, including SQL Server, React, Big data eco-systems, IT automation tools and Kubernetes
  • The Ingestion team, leveraging Apache NiFi, funneled this data into a centralized data lake
  • Tailored to specific business requirements for each module, comprehensive data analysis was conducted using Apache Spark
  • I have been instrumental in designing, implementing, and optimizing large-scale data platforms and pipelines
  • The project involves ensuring the seamless operation of complex backend services, effective incident response coordination, and utilizing various technologies for data processing and automation
  • Subsequently, the data extracts were stored in AWS S3 buckets
  • Redshift tables, built upon this S3 data, then drove the BI dashboards
  • Key Contributions:
  • Implemented robust monitoring solutions for large-scale data platforms, ensuring optimal performance
  • Conducted troubleshooting of data pipelines and complex backend services
  • Managed large datasets efficiently, optimizing storage and retrieval processes
  • Implemented data governance practices for data quality, integrity, and security
  • Implemented custom business logic and data transformations using Scala's functional programming features, such as higher-order functions and immutable data structures, to achieve scalable and composable data processing pipelines
  • Orchestrated data workflows and dependencies using Apache Airflow, configuring DAGs (Directed Acyclic Graphs) to automate and schedule data pipeline executions
  • Designed and implemented data pipeline monitoring and alerting mechanisms to ensure data integrity and timely execution
  • Demonstrated a deep understanding of Apache Spark, handling complex Spark jobs, and resolving performance issues
  • Implemented optimizations for Spark jobs, enhancing overall data processing efficiency
  • Developed and maintained automation tools using Python, streamlining repetitive tasks, and improving overall workflow
  • Automated data ingestion processes, reducing manual intervention and enhancing overall system reliability
  • Demonstrated expertise in Big Data ecosystems, including HDFS, Kafka, and SQL
  • Implemented data transformation rules, ensuring seamless movement of data between stages in the data lake
  • Utilized monitoring tools such as Splunk, Prometheus, and Grafana for real-time visibility into system performance
  • Implemented proactive measures based on insights gathered from monitoring tools
  • Utilized AWS Glue as the primary ETL tool, incorporating features like AWS Glue Crawler, Data Catalog, and Connections
  • Explored AWS Athena for convenient data analysis in Amazon S3 with standard SQL
  • Initiated AWS Lambda functions in Python to stimulate Python scripts for large dataset transformations and analytics in EMR Clusters
  • Responsibilities:
  • Designed, implemented, and managed scalable data pipelines for ingesting, transforming, and storing diverse data sets in a cloud-based infrastructure
  • Collaborated with cross-functional teams to understand data requirements and develop solutions meeting business objectives
  • Implemented robust error handling mechanisms and logging procedures to facilitate debugging and troubleshooting of Python code within DataBricks environment
  • Created Docker images for streamlined deployment processes, contributing to a more efficient CI/CD pipeline
  • Automated data ingestion processes, reducing manual intervention and enhancing overall system reliability
  • Optimized data workflows for efficiency, reliability, cost, and performance
  • Ensured data quality, integrity, and security throughout the data lifecycle
  • Established event driven ETL pipelines with AWS Glue, activating upon new data appearance in AWS S3
  • Initiated a data collection pipeline using Kafka and Spark Streaming, storing the resultant data in Cassandra with PySpark-driven transformations
  • Conducted performance tuning and optimization of dbt models and SQL queries in Snowflake, analyzing query execution plans and identifying opportunities for enhancing data processing efficiency
  • Conducted ELT operations utilizing tools like PySpark, SparkSQL, Hive, and Python on extensive datasets
  • Orchestrated data pipelines using Airflow for scheduling PySpark jobs and deployed Flume for weblog server data acquisition.
  • Collaborated with stakeholders to define data models and business logic within dbt, ensuring alignment with analytical requirements and best practices
  • Conducted code reviews and documentation updates to maintain a scalable and maintainable dbt project
  • Delved into data modeling and normalization techniques, enabling the loading of multi-source raw data in various storage formats into Data Lakes
  • Environment: Amazon RDS, AWS Glue, Amazon S3, AWS Lambda, Amazon EMR, AlgoCD, Jenkins, Apache NiFi, Apache Spark, Apache Airflow, Apache Kafka, React, DBT, Django, Scala, PySpark, SparkSQL, Hive, Python, Kubernetes, Docker, Apache Airflow, CI/CD pipeline, Aurora, Snowflake, Flume, Redshift, SQL Server, Splunk, Prometheus, Grafana, Machine Learning models and Power BI.

Data Engineer

Publicis Sapient
Plano
01.2018 - 06.2020
  • As the client progressed towards digital automation encompassing Data Analytics, Application Development, Robotic Process Automation, AI, DevOps, and Test Automation Services, our objective was to establish a database for a new program and craft the respective schemas in the data warehouse, both relational and dimensional
  • Primarily, the project involved migrating multifarious data sources to a consolidated Azure SQL Database using Azure Data Factory, while executing the necessary ETL processes to bolster data loads and transformations
  • Key Contributions:
  • Developed Multi-Cloud Data Processing with a focus on Azure services such as Azure Functions
  • Formulated PySpark applications on Azure HDInsight to process and analyze data sourced from emails, complaints, forums, and click streams, thereby ensuring a holistic customer care approach
  • Established comprehensive data quality assurance processes using Azure Data Factory and Azure Databricks, including data profiling, cleansing, and validation routines, ensuring the accuracy and reliability of analytical insights
  • Constructed data pipelines to process streamed data in Azure, aggregating over 10 different data sources
  • Automated ETL processes with PySpark & SparkSQL on Azure HDInsight clusters, facilitating both reporting and data transformations
  • Successfully migrated historical data from on-premises systems to Azure HDInsight and Azure Synapse Analytics
  • Developed comprehensive best practices guides for Azure data engineering processes, facilitating knowledge transfer, and ensuring consistency in development methodologies across the project lifecycle
  • Responsibilities:
  • Spearheaded the design and implementation of large-scale data solutions, predominantly using PySpark on Azure HDInsight integrated with Azure Data Factory
  • Harnessing Azure HDInsight, delved into the PySpark framework to enhance performance and optimization of existing algorithms
  • Seamlessly ingested data from relational databases to Azure Blob Storage using Azure Data Factory
  • Employed Azure Functions and Azure Data Factory for data transformation, meeting advanced analytics requirements and channeling data for downstream applications
  • Integrated Azure Event Hubs for real-time data streaming and processing
  • Utilized Spark-Streaming APIs in conjunction with Azure Event Hubs to transform real-time data, subsequently persisting into Azure Cosmos DB and Azure Synapse Analytics
  • Managed large datasets using Partitions, effective Joins, and Transformations within Azure HDInsight
  • Expertly converted data between formats, such as from Avro to Parquet and vice-versa, to ensure optimal data storage and processing in Azure
  • Integrated Azure Synapse Analytics to access and query data in Azure Blob Storage, optimizing performance through efficient partitioning
  • Developed Spark applications to cater to diverse business logics, predominantly using Python
  • Streamlined DDLs for table generation in Azure Synapse Analytics as per the Field Mapping Document
  • Environment: Azure SQL Database, Azure Data Factory, Azure HDInsight, Azure Functions, Azure Event Hubs, Azure Synapse Analytics, Azure Cosmos DB, Azure Blob Storage, PySpark, SparkSQL, Kafka, Cassandra, PostgreSQL, Python, Scala.

Data Engineer

Paylocity
Plano
08.2016 - 01.2018
  • Our focus was centered on the discipline of data quality management
  • We incorporated methods to quantitatively measure, enhance, and ensure the quality and integrity of the organization's data
  • The established framework provided a consistent approach for streamlined data-flow processes, focusing on data quality, evaluating data assets, and crafting standardized data documentation
  • Our overarching aim was continuous improvement in data quality within the Data Fabric
  • Key Contributions:
  • Pioneered the development of a robust data pipeline, integrating technologies like Spark, Hive, Impala, and HBase
  • Authored versatile Spark programs in Java, facilitating the extraction, transformation, and aggregation of data from varied file formats
  • Played a vital role in the ETL process, bridging the gap between OLTP and OLAP databases, enhancing the capabilities of the Decision Support Systems
  • Expertly transitioned tables and applications to the AWS cloud infrastructure, particularly focusing on AWS S3
  • Spearheaded real-time data processing initiatives using Spark Streaming, integrating Kafka for an optimized data pipeline system
  • Responsibilities:
  • Managed various data types, encompassing unstructured (like logs, clickstreams), semi-structured (such as XML, JSON), and structured data from RDBMS platforms
  • Employed Python for importing metadata into Hive and actively engaged in the migration process to AWS
  • Crafted Python scripts tailored for AWS, manipulating resources via API calls using the BOTO3 SDK, and efficiently utilized AWS CLI
  • Implemented and monitored CI/CD pipelines, leveraging tools such as Maven and GitHub within the AWS environment
  • Integrated Oozie with the broader Hadoop stack, supporting diverse Hadoop jobs ranging from MapReduce and Hive to Sqoop, while also handling system-specific tasks
  • Environment: Spark, Hive, Impala, HBase, Java, OLTP, OLAP, AWS (particularly AWS S3), Spark Streaming, Kafka, Unstructured Data Formats (logs, clickstreams), Semi-Structured Data Formats (XML, JSON), RDBMS, Python, Hive, BOTO3 SDK, AWS CLI, Maven, GitHub, Oozie, MapReduce, Sqoop.

Education

Bachelor of Technology - Computer Science

University of North Texas
05.2016

Skills

  • Python
  • BOTO3 SDK
  • Pandas
  • Scikit-learn
  • Scala
  • Java
  • SQL
  • Pyspark
  • AWS
  • EMR
  • EC2
  • RDS
  • Athena
  • Lambda
  • S3
  • Redshift
  • DynamoDB
  • Glue
  • Kinesis
  • Data Pipeline
  • API Gateway
  • Azure
  • SQL Database
  • Data Factory
  • HDInsight
  • Azure Functions
  • Azure Event Hubs
  • Azure Synapse Analytics
  • Cosmos DB
  • Blob Storage
  • Snowflake
  • Hadoop
  • HDFS
  • MapReduce
  • YARN
  • Spark
  • SparkSQL
  • Kafka
  • Hive
  • Sqoop
  • Flume
  • Oozie
  • Impala
  • HBase
  • AWS Glue
  • DataBricks
  • Apache NiFi
  • AWS Data Pipeline
  • BODS
  • MySQL
  • MS-SQL
  • Teradata
  • DBeaver
  • Amazon RDS
  • Aurora
  • Oracle DB
  • MongoDB
  • Salesforce
  • Amazon Quick Sight
  • Tableau
  • Power BI
  • RESTful Services
  • Spring Boot
  • Spark Streaming
  • PySpark MLlib
  • Apache Airflow
  • Jenkins
  • Redwoods
  • Git
  • Bitbucket
  • GitLab
  • JIRA
  • Microsoft Office Suite
  • Unix
  • Linux Red Hat
  • Windows
  • MacOS
  • Maven
  • CI/CD pipelines
  • Data Lakes
  • Snowflake Design
  • Physical Models
  • Logical Models
  • Schema Design
  • Complex SQL Queries
  • Window Functions
  • Performance Tuning
  • Postman
  • Jupyter

Timeline

Senior Data Engineer

Qorvo
09.2024 - Current

Senior Data Engineer

Freddie Mac
10.2022 - 08.2024

Senior Data Engineer

Optum, Hospitals and Health Care
06.2020 - 10.2022

Data Engineer

Publicis Sapient
01.2018 - 06.2020

Data Engineer

Paylocity
08.2016 - 01.2018

Bachelor of Technology - Computer Science

University of North Texas
SAI AKHIL