Summary
Overview
Work History
Education
Skills
Websites
Personal Information
Timeline
Generic

Dharma Sai Aakarsh Mangamuri

Connecticut,USA

Summary

Data Engineer with 5 years of expertise in architecting, building, and optimizing scalable data pipelines and architectures. Proficient in Apache Spark, Apache Kafka, and Hadoop for big data processing tasks. Experience with cloud platforms including Microsoft Azure and Google Cloud, utilizing services like Azure Data Factory and Google Cloud Dataproc. Skilled in implementing and optimizing data warehouse solutions such as Snowflake, Redshift, and Azure Synapse Analytics. Developed robust ETL/ELT processes using tools like Apache Airflow, Informatica, and Azure Data Factory. Proficient in data visualization tools like Tableau, Power BI, and Python libraries. Integrated machine learning models and algorithms using TensorFlow, Scikit-learn, and PySpark. Skilled in data mining techniques and experienced in data modeling. Utilized statistical analysis, hypothesis testing, and predictive modeling techniques to derive actionable insights. Developed database architectural strategies at the modeling, design, and implementation stages.

Overview

6
6
years of professional experience

Work History

Data Engineer

Citigroup
08.2023 - Current
  • Architected and implemented cloud-native data storage and processing solutions on AWS, utilizing S3 for storage, Redshift for data warehousing, EC2 for compute resources, and REST APIs for seamless data access
  • Collaborated closely with product management teams to align data solutions with business goals
  • Boosted data retrieval speed by 30% and reduced latency through seamless integration of diverse data sources, significantly improving system performance
  • Leveraged AWS Lambda for serverless compute tasks and AWS Glue for efficient data cataloging and ETL processes
  • Developed custom PySpark scripts to preprocess and transform over 100 TB of data monthly, optimizing data pipelines for better performance
  • Engineered robust PySpark scripts for parallel processing, significantly improving pipeline speed and reliability
  • Integrated with AWS EMR to scale processing capabilities on-demand, ensuring efficient handling of peak workloads
  • Streamlined deployment processes by 60% through automated pipelines with Jenkins and CircleCI, while leveraging Terraform for Infrastructure as Code (IaC), ensuring swift, reproducible environments
  • Reduced manual intervention by 60% and improved deployment times by configuring Jenkins pipelines to trigger Terraform scripts, ensuring consistent and reproducible environments across development, testing, and production stages
  • Leveraged Informatica PowerCenter for complex data integration tasks, creating intricate data mappings to ensure seamless ETL processes
  • Integrated Azure Databricks to perform large-scale data transformations, utilizing Apache Spark’s distributed processing capabilities
  • Elevated data processing efficiency by 40%, accelerating data availability for real-time analytics and strategic reporting
  • Designed and developed ETL processes in Azure Data Factory, orchestrating the ingestion, transformation, and loading of 1 TB of data daily from various external sources into Azure Synapse Analytics
  • Implemented dynamic pipelines with custom activities and triggers, ensuring data integrity and consistency across the data lake and warehouse, and providing timely data insights to business stakeholders
  • Utilized Tableau, Power BI, and Python libraries (matplotlib, seaborn) to create interactive visualizations, translating complex datasets into actionable insights
  • Developed advanced dashboards for real-time monitoring and analysis of key business metrics
  • These visualizations enhanced data-driven decision-making by providing clear, concise, and impactful data presentations to business leaders.

Data Engineer

Accenture
09.2019 - 06.2022
  • Built and maintained scalable data pipelines using Apache Spark and Kafka, processing over 1TB of health data daily for analysis and reporting
  • Integrated health data from EHR systems, medical devices, and claims databases using REST APIs and data extraction techniques, leveraging AWS Cloud services for data storage and processing, improving data accessibility
  • Designed and optimized data models for efficient storage and retrieval of health data using MySQL and PostgreSQL, reducing query times by 50%
  • Implemented data governance processes to comply with HIPAA regulations and maintain data privacy using Enterprise Content Discovery and Management (ECDP), enhancing data security
  • Managed Kubernetes for containerized applications, ensuring uptime and efficient resource utilization for data engineering workloads
  • Automated deployment of data engineering applications using Kubernetes manifests, reducing deployment times
  • Built real-time data pipelines with Apache Flink and NiFi, enabling immediate insights and alerts for critical health events, improving response times by 25%
  • Extracted, transformed, and loaded data using Azure Data Factory, Databricks, and Data Lake Storage, processing over 500GB of data daily for analytical purposes
  • Developed data visualizations and reports in Tableau, providing actionable health insights and improving strategic planning
  • Deployed machine learning models for predictive analytics in health data using Python and Scikit-learn, reducing patient readmission rates by 15%
  • Enhanced data processing and storage efficiency with Hadoop, Hive, and Parquet, reducing storage costs by 20% and processing times by 40%
  • Executed data quality assurance processes, identifying and correcting data anomalies, improving data accuracy by 25%, and documenting processes using tools like Git and Jira for effective collaboration.

Data Engineer Intern

NEXTGEN Healthcare
07.2018 - 08.2019
  • Implemented Snowflake data warehouse solutions on Google Cloud Platform (GCP), enhancing data storage efficiency
  • Created and managed Snowflake databases, schemas, tables, and views for efficient data storage and retrieval
  • Maintained data pipelines for ingesting, transforming, and loading healthcare data using Apache Spark, Kafka, and Python, processing over 2TB of data weekly
  • Ensured data accuracy and consistency through data cleansing and validation using Pandas and Jupyter notebooks, reducing data errors by 25%
  • Implemented deep learning models for predictive health insights using TensorFlow, improving patient outcomes
  • Utilized GitHub for version control and collaboration
  • Installed Hadoop, MapReduce, HDFS, and Google Cloud SDK and developed multiple MapReduce jobs in PIG and Hive for data cleaning and pre-processing on GCP
  • Created automation regression scripts for ETL process validation between databases like Google BigQuery, Oracle, MongoDB, T-SQL, and SQL Server using Python, reducing manual validation time by 50%
  • Performed data analysis and visualization using Power BI, and QlikView to present insights to business users and stakeholders
  • Worked on NoSQL Databases HBase, and SPARK for real-time streaming of data into the cluster
  • Designed complex SSIS Packages for Extract, Transform and Load (ETL) with data from different sources
  • Maintained version control and utilized collaboration tools such as Git, Jira, and Confluence to track changes and ensure seamless teamwork.

Education

Master of Science in Business Analytics and Project Management -

University of Connecticut
Stamford, CT
05.2024

Skills

  • SDLC
  • Agile
  • Waterfall
  • Python
  • SQL
  • Java
  • R
  • Scala
  • NumPy
  • Pandas
  • Matplotlib
  • SciPy
  • Scikit-learn
  • TensorFlow
  • Seaborn
  • Tableau
  • Power BI
  • Advanced Excel (Pivot Tables, VLOOKUP)
  • Visual Studio Code
  • PyCharm
  • Jupyter
  • IntelliJ
  • AWS Cloud
  • Microsoft Azure
  • Google Cloud Platform
  • MySQL
  • PostgreSQL
  • MongoDB
  • T-SQL
  • Apache Spark
  • Apache Hadoop
  • Apache Kafka
  • Apache Beam
  • Flink
  • NiFi
  • Git
  • GitHub
  • Windows
  • Linux
  • Mac iOS
  • Azure Data Factory
  • AWS Glue
  • Apache Airflow
  • Informatica
  • Talend
  • Snowflake
  • Amazon Redshift
  • Google BigQuery
  • Azure Synapse Analytics
  • Docker
  • Kubernetes
  • Jenkins
  • Terraform
  • CloudFormation
  • Data Quality
  • Compliance (HIPAA)
  • Data Encryption
  • GDPR
  • Clustering
  • Association Rules
  • Decision Trees
  • Neural Networks
  • ER Diagrams
  • Star and Snowflake Schemas
  • Normalization
  • Dimensional Modeling
  • Machine Learning (TensorFlow, Scikit-learn, PySpark)
  • Predictive Analytics
  • Statistical Analysis
  • Deep Learning (TensorFlow, Scikit-learn)
  • CircleCI
  • REST APIs
  • GRPC
  • API Development
  • ETL development
  • Data Warehousing
  • Data Modeling
  • Data Pipeline Design
  • SQL Expertise
  • NoSQL Databases
  • Data Analysis

Personal Information

Title: Data Engineer

Timeline

Data Engineer

Citigroup
08.2023 - Current

Data Engineer

Accenture
09.2019 - 06.2022

Data Engineer Intern

NEXTGEN Healthcare
07.2018 - 08.2019

Master of Science in Business Analytics and Project Management -

University of Connecticut
Dharma Sai Aakarsh Mangamuri