Data Engineer with 8+ years of experience in designing, building, and optimizing large-scale data processing architectures across cloud-based infrastructures. Expertise in ETL development, real-time data streaming, API-driven data integration, and advanced analytics using AWS, Databricks, Spark, and SQL. Proven ability to migrate, manage, and optimize complex enterprise data from diverse sources, ensuring scalability, efficiency, and governance across data lakes, warehouses, and analytical platforms. Passionate about driving digital transformation through AI-powered analytics, real-time data processing, and cloud-native solutions, enabling data-driven decision-making and enterprise intelligence.
Overview
9
9
years of professional experience
Work History
Senior Data Engineer
Qorvo
Greensboro
09.2024 - Current
Project description: Led the migration of SAP SuccessFactors and SAP ECC (Master Data) data into Databricks, implementing Medallion Architecture for structured data transformation and governance
Developed scalable ETL pipelines using PySpark, SQL, and Delta Lake, optimizing data processing efficiency by 33% and reducing compute costs by 45%
Designed AWS S3-based cloud storage solutions to enhance query performance and cost-effective data retention while ensuring seamless data integration
Automated workflows using Red woods , enabling efficient data ingestion, transformation, and monitoring with proactive alerting mechanisms
Integrated Unity Catalog for governance, ensuring data security, compliance, and controlled access to structured datasets
Built enterprise-wide reporting solutions by integrating Power BI with Unity Catalog Views, improving real-time business intelligence and decision-making
Developed a scalable API ingestion framework, streamlining multi-source data extraction and automation, while collaborating with cross-functional teams to drive data strategy alignment, innovation, and performance optimization within a cloud-based data ecosystem
Key Contributions:
Migrated SAP ECC and SAP SuccessFactors data into Databricks, leveraging PySpark and SQL for efficient extraction, transformation, and storage in Delta Lake
Designed scalable ETL pipelines to ensure data accuracy, consistency, and accessibility, supporting enterprise analytics and reporting
Processed General Ledger (GL), Accounts Payable (AP), Accounts Receivable (AR), Asset Accounting (AA), and Cost Center Accounting (CO) master data
Designed and implemented Medallion Architecture (Bronze → Silver → Gold) in Databricks, ensuring structured data transformation and optimized workflows
Integrated audit columns and watermarking in the Silver layer while consolidating aggregated insights in the Gold layer
Enhanced data governance, lineage tracking, and processing efficiency, enabling seamless integration with downstream applications and enterprise analytics
Fine-tuned Databricks clusters, PySpark jobs, and SQL transformations, achieving a 33% improvement in SAP ECC data processing efficiency and a 45% boost in PySpark job performance.Optimized Spark execution plans, partitioning strategies, and caching mechanisms, significantly reducing computation time and memory consumption
Architected a scalable AWS S3-based storage solution, integrating Databricks Delta Lake for transactional data processing and historical tracking.Architected a scalable AWS S3-based storage solution, integrating Databricks Delta Lake for transactional data processing and historical tracking
Implemented Unity Catalog for metadata management, data lineage tracking, and security compliance, ensuring structured governance and controlled access.Established role-based access control (RBAC), audit logging, and encryption policies, strengthening data security and compliance adherence
Developed automated ETL workflows using Redwoods and Databricks, reducing manual effort and streamlining data ingestion, transformation, and loading processes
Developed a scalable API ingestion framework to dynamically fetch multi-module SAP financial data (GL balances, AP/AR transactions, fixed asset details, profit center reports, and customer credit limits), processing it efficiently in Databricks and AWS S3 for real-time analytics and financial reporting
Responsibilities:
Designed and managed ETL pipelines in Databricks using PySpark, SQL, and Delta Lake, ensuring efficient data ingestion, transformation, and storage.Ensured end-to-end data integrity, from extraction to storage, by implementing data validation and error-handling mechanisms
Implemented and optimized Spark SQL queries to perform complex data transformations, aggregations, and joins within Databricks Delta Lake
Enhanced query performance and execution efficiency by leveraging adaptive query execution (AQE), partition pruning, and caching strategies, ensuring faster data retrieval and reduced computational overhead
Enforced data quality checks, audit columns, and watermarking techniques to ensure consistent and accurate datasets
Designed and implemented error-handling mechanisms, ensuring seamless recovery from unexpected data issues
Monitored data freshness, completeness, and accuracy, ensuring reliability in downstream analytics and decision-making
Tuned Databricks clusters and PySpark jobs, reducing compute costs while maintaining high data processing efficiency.Applied parallel processing techniques and Spark optimizations to handle large volumes of data efficiently
Implemented data security, lineage tracking, and governance policies using Unity Catalog, ensuring data integrity and regulatory compliance.Designed access control mechanisms and role-based permissions, ensuring that sensitive data remains secure while maintaining accessibility
Provided ongoing production support, resolving pipeline failures, bottlenecks, and performance issues to maintain high data availability.Built real-time monitoring dashboards to track ETL pipeline performance, proactively identifying anomalies and inefficiencies
Developed optimized data models, ensuring structured and well-organized data storage for reporting and analytics.Ensured scalability and maintainability by designing efficient data schemas for large-scale enterprise use cases
Developed dynamic API-based data extraction solutions, ensuring seamless integration with third-party and internal data sources.Implemented API request throttling and pagination techniques, ensuring efficient data retrieval and processing
Led data migration and modernization efforts, leveraging Databricks, AWS S3, Delta Lake, and PySpark to improve efficiency.Implemented data lifecycle management strategies, optimizing data retention, archiving, and access policies
Worked closely with data analysts and stakeholders to optimize data pipelines for better business insights
Provided technical guidance and data access solutions, enabling teams to leverage high-quality datasets for reporting
Ensured seamless communication between technical and business units, aligning data engineering efforts with enterprise goals
Environment: AWS, Microsoft SSIS, Databricks, Apache Spark, PySpark, Spark SQL, SQL, Delta Lake,Databricks Workflows, Databricks Dashboards, MS Excel, Redwoods Scheduling, SAP SuccessFactors, SAP ECC, SailPoint, REST APIs, API Gateway, JDBC, Databricks Unity Catalog, Metadata Management, Data Lineage Tracking, Power BI, Python, SQL, Shell Scripting, VS Code, Git, GitLab, CI/CD Integration.
Senior Data Engineer
Freddie Mac
Richmond
10.2022 - 08.2024
Project description: I have spearheaded the design and development of scalable data processing frameworks, integrating seamlessly with AWS services while orchestrating the migration of on-prem databases to Amazon S3, AWS Glue, and Amazon RDS
Expertise spans ETL processes, Hadoop infrastructure, and Spark development, optimizing jobs for real-time data processing
Leveraging Kafka, Docker, and Kubernetes, you've orchestrated microservices deployment while ensuring robust data governance through Databricks Unity Catalog
Data pipelines were built to extract, transform, and load the data into Amazon EMR, which in turn will be utilized as a source data for multiple downstream systems and processes
Data Wrangling, Massaging, and Enrichment will be performed in the Data Lake staging area
My role extends to CI/CD pipeline implementation, cluster management, and enforcing stringent security measures, showcasing a comprehensive skill set in cloud-native and big data technologies
Key Contributions:
Designed and developed scalable data frameworks, migrating on-prem databases to AWS services like S3, Glue, and RDS, while integrating Snowflake for cloud warehousing
Built and optimized Spark applications using Scala and PySpark on Hadoop clusters for real-time and batch data processing
Leveraged Kafka, Docker, Kubernetes, and Spark to process real-time streaming data efficiently, ensuring low-latency event handling
Created robust ETL/ELT pipelines using AWS Glue, SQL, Spark SQL, Airflow, and dbt, transforming raw data into structured datasets
Implemented validation checks, error handling, and governance standards in Airflow DAGs and dbt models to maintain data integrity
Fine-tuned Spark jobs for performance optimization, improving resource utilization, parallelism, and memory tuning
Developed UDFs in PySpark and stored procedures to meet specific business needs
Developed CI/CD pipelines for automated testing, deployment, and integration of data processing workflows
Designed scalable data models and transformations, ensuring alignment with business logic and analytical requirements
Worked closely with Agile teams and business stakeholders to define requirements, validate data integrity, and maintain governance standards
Architected and deployed Hadoop clusters (HDFS, YARN), ensuring efficient storage, processing, and scalability of big data applications
Responsibilities:
Designed and developed scalable data frameworks, migrating on-prem databases to AWS services like S3, Glue, and RDS, while integrating Snowflake for cloud warehousing
Built and optimized Spark applications using Scala and PySpark on Hadoop clusters to process large-scale data efficiently
Leveraged Kafka, Docker, and Kubernetes to handle real-time streaming data and enable low-latency event-driven processing
Created ETL/ELT pipelines using AWS Glue, SQL, and Airflow, automating data ingestion, transformation, and loading across multiple sources
Implemented data validation checks, error handling, and governance frameworks using Airflow DAGs and dbt models to ensure data integrity
Developed optimized Spark jobs by fine-tuning resource allocation, parallelism, and memory usage to improve data pipeline efficiency
Designed and implemented data models, ensuring consistency and scalability to support business intelligence and analytics needs
Established CI/CD pipelines for automated testing and deployment of data workflows, improving development efficiency and reliability
Migrated legacy data pipelines to AWS and Snowflake, improving performance and reducing operational costs
Developed and maintained Kafka producers and consumers for seamless message processing and real-time data movement
Collaborated with Agile teams and business stakeholders to gather requirements, validate data integrity, and ensure compliance with governance policies
Architected and deployed Hadoop clusters, including HDFS and YARN, to manage distributed storage and large-scale data processing
Conducted performance tuning on Spark applications and database queries, optimizing execution time and resource utilization
The client-maintained data across multiple sources, including SQL Server, React, Big data eco-systems, IT automation tools and Kubernetes
The Ingestion team, leveraging Apache NiFi, funneled this data into a centralized data lake
Tailored to specific business requirements for each module, comprehensive data analysis was conducted using Apache Spark
I have been instrumental in designing, implementing, and optimizing large-scale data platforms and pipelines
The project involves ensuring the seamless operation of complex backend services, effective incident response coordination, and utilizing various technologies for data processing and automation
Subsequently, the data extracts were stored in AWS S3 buckets
Redshift tables, built upon this S3 data, then drove the BI dashboards
Key Contributions:
Implemented robust monitoring solutions for large-scale data platforms, ensuring optimal performance
Conducted troubleshooting of data pipelines and complex backend services
Managed large datasets efficiently, optimizing storage and retrieval processes
Implemented data governance practices for data quality, integrity, and security
Implemented custom business logic and data transformations using Scala's functional programming features, such as higher-order functions and immutable data structures, to achieve scalable and composable data processing pipelines
Orchestrated data workflows and dependencies using Apache Airflow, configuring DAGs (Directed Acyclic Graphs) to automate and schedule data pipeline executions
Designed and implemented data pipeline monitoring and alerting mechanisms to ensure data integrity and timely execution
Demonstrated a deep understanding of Apache Spark, handling complex Spark jobs, and resolving performance issues
Implemented optimizations for Spark jobs, enhancing overall data processing efficiency
Developed and maintained automation tools using Python, streamlining repetitive tasks, and improving overall workflow
Automated data ingestion processes, reducing manual intervention and enhancing overall system reliability
Demonstrated expertise in Big Data ecosystems, including HDFS, Kafka, and SQL
Implemented data transformation rules, ensuring seamless movement of data between stages in the data lake
Utilized monitoring tools such as Splunk, Prometheus, and Grafana for real-time visibility into system performance
Implemented proactive measures based on insights gathered from monitoring tools
Utilized AWS Glue as the primary ETL tool, incorporating features like AWS Glue Crawler, Data Catalog, and Connections
Explored AWS Athena for convenient data analysis in Amazon S3 with standard SQL
Initiated AWS Lambda functions in Python to stimulate Python scripts for large dataset transformations and analytics in EMR Clusters
Responsibilities:
Designed, implemented, and managed scalable data pipelines for ingesting, transforming, and storing diverse data sets in a cloud-based infrastructure
Collaborated with cross-functional teams to understand data requirements and develop solutions meeting business objectives
Implemented robust error handling mechanisms and logging procedures to facilitate debugging and troubleshooting of Python code within DataBricks environment
Created Docker images for streamlined deployment processes, contributing to a more efficient CI/CD pipeline
Automated data ingestion processes, reducing manual intervention and enhancing overall system reliability
Optimized data workflows for efficiency, reliability, cost, and performance
Ensured data quality, integrity, and security throughout the data lifecycle
Established event driven ETL pipelines with AWS Glue, activating upon new data appearance in AWS S3
Initiated a data collection pipeline using Kafka and Spark Streaming, storing the resultant data in Cassandra with PySpark-driven transformations
Conducted performance tuning and optimization of dbt models and SQL queries in Snowflake, analyzing query execution plans and identifying opportunities for enhancing data processing efficiency
Conducted ELT operations utilizing tools like PySpark, SparkSQL, Hive, and Python on extensive datasets
Orchestrated data pipelines using Airflow for scheduling PySpark jobs and deployed Flume for weblog server data acquisition.
Collaborated with stakeholders to define data models and business logic within dbt, ensuring alignment with analytical requirements and best practices
Conducted code reviews and documentation updates to maintain a scalable and maintainable dbt project
Delved into data modeling and normalization techniques, enabling the loading of multi-source raw data in various storage formats into Data Lakes
As the client progressed towards digital automation encompassing Data Analytics, Application Development, Robotic Process Automation, AI, DevOps, and Test Automation Services, our objective was to establish a database for a new program and craft the respective schemas in the data warehouse, both relational and dimensional
Primarily, the project involved migrating multifarious data sources to a consolidated Azure SQL Database using Azure Data Factory, while executing the necessary ETL processes to bolster data loads and transformations
Key Contributions:
Developed Multi-Cloud Data Processing with a focus on Azure services such as Azure Functions
Formulated PySpark applications on Azure HDInsight to process and analyze data sourced from emails, complaints, forums, and click streams, thereby ensuring a holistic customer care approach
Established comprehensive data quality assurance processes using Azure Data Factory and Azure Databricks, including data profiling, cleansing, and validation routines, ensuring the accuracy and reliability of analytical insights
Constructed data pipelines to process streamed data in Azure, aggregating over 10 different data sources
Automated ETL processes with PySpark & SparkSQL on Azure HDInsight clusters, facilitating both reporting and data transformations
Successfully migrated historical data from on-premises systems to Azure HDInsight and Azure Synapse Analytics
Developed comprehensive best practices guides for Azure data engineering processes, facilitating knowledge transfer, and ensuring consistency in development methodologies across the project lifecycle
Responsibilities:
Spearheaded the design and implementation of large-scale data solutions, predominantly using PySpark on Azure HDInsight integrated with Azure Data Factory
Harnessing Azure HDInsight, delved into the PySpark framework to enhance performance and optimization of existing algorithms
Seamlessly ingested data from relational databases to Azure Blob Storage using Azure Data Factory
Employed Azure Functions and Azure Data Factory for data transformation, meeting advanced analytics requirements and channeling data for downstream applications
Integrated Azure Event Hubs for real-time data streaming and processing
Utilized Spark-Streaming APIs in conjunction with Azure Event Hubs to transform real-time data, subsequently persisting into Azure Cosmos DB and Azure Synapse Analytics
Managed large datasets using Partitions, effective Joins, and Transformations within Azure HDInsight
Expertly converted data between formats, such as from Avro to Parquet and vice-versa, to ensure optimal data storage and processing in Azure
Integrated Azure Synapse Analytics to access and query data in Azure Blob Storage, optimizing performance through efficient partitioning
Developed Spark applications to cater to diverse business logics, predominantly using Python
Streamlined DDLs for table generation in Azure Synapse Analytics as per the Field Mapping Document
Our focus was centered on the discipline of data quality management
We incorporated methods to quantitatively measure, enhance, and ensure the quality and integrity of the organization's data
The established framework provided a consistent approach for streamlined data-flow processes, focusing on data quality, evaluating data assets, and crafting standardized data documentation
Our overarching aim was continuous improvement in data quality within the Data Fabric
Key Contributions:
Pioneered the development of a robust data pipeline, integrating technologies like Spark, Hive, Impala, and HBase
Authored versatile Spark programs in Java, facilitating the extraction, transformation, and aggregation of data from varied file formats
Played a vital role in the ETL process, bridging the gap between OLTP and OLAP databases, enhancing the capabilities of the Decision Support Systems
Expertly transitioned tables and applications to the AWS cloud infrastructure, particularly focusing on AWS S3
Spearheaded real-time data processing initiatives using Spark Streaming, integrating Kafka for an optimized data pipeline system
Responsibilities:
Managed various data types, encompassing unstructured (like logs, clickstreams), semi-structured (such as XML, JSON), and structured data from RDBMS platforms
Employed Python for importing metadata into Hive and actively engaged in the migration process to AWS
Crafted Python scripts tailored for AWS, manipulating resources via API calls using the BOTO3 SDK, and efficiently utilized AWS CLI
Implemented and monitored CI/CD pipelines, leveraging tools such as Maven and GitHub within the AWS environment
Integrated Oozie with the broader Hadoop stack, supporting diverse Hadoop jobs ranging from MapReduce and Hive to Sqoop, while also handling system-specific tasks