Summary

Overview

Work History

Education

Skills

Websites

Personal Information

Timeline

Dharma Sai Aakarsh Mangamuri

Connecticut,USA

Summary

Data Engineer with 5 years of expertise in architecting, building, and optimizing scalable data pipelines and architectures. Proficient in Apache Spark, Apache Kafka, and Hadoop for big data processing tasks. Experience with cloud platforms including Microsoft Azure and Google Cloud, utilizing services like Azure Data Factory and Google Cloud Dataproc. Skilled in implementing and optimizing data warehouse solutions such as Snowflake, Redshift, and Azure Synapse Analytics. Developed robust ETL/ELT processes using tools like Apache Airflow, Informatica, and Azure Data Factory. Proficient in data visualization tools like Tableau, Power BI, and Python libraries. Integrated machine learning models and algorithms using TensorFlow, Scikit-learn, and PySpark. Skilled in data mining techniques and experienced in data modeling. Utilized statistical analysis, hypothesis testing, and predictive modeling techniques to derive actionable insights. Developed database architectural strategies at the modeling, design, and implementation stages.

Overview

years of professional experience

Work History

Data Engineer

Citigroup

08.2023 - Current

Architected and implemented cloud-native data storage and processing solutions on AWS, utilizing S3 for storage, Redshift for data warehousing, EC2 for compute resources, and REST APIs for seamless data access
Collaborated closely with product management teams to align data solutions with business goals
Boosted data retrieval speed by 30% and reduced latency through seamless integration of diverse data sources, significantly improving system performance
Leveraged AWS Lambda for serverless compute tasks and AWS Glue for efficient data cataloging and ETL processes
Developed custom PySpark scripts to preprocess and transform over 100 TB of data monthly, optimizing data pipelines for better performance
Engineered robust PySpark scripts for parallel processing, significantly improving pipeline speed and reliability
Integrated with AWS EMR to scale processing capabilities on-demand, ensuring efficient handling of peak workloads
Streamlined deployment processes by 60% through automated pipelines with Jenkins and CircleCI, while leveraging Terraform for Infrastructure as Code (IaC), ensuring swift, reproducible environments
Reduced manual intervention by 60% and improved deployment times by configuring Jenkins pipelines to trigger Terraform scripts, ensuring consistent and reproducible environments across development, testing, and production stages
Leveraged Informatica PowerCenter for complex data integration tasks, creating intricate data mappings to ensure seamless ETL processes
Integrated Azure Databricks to perform large-scale data transformations, utilizing Apache Spark’s distributed processing capabilities
Elevated data processing efficiency by 40%, accelerating data availability for real-time analytics and strategic reporting
Designed and developed ETL processes in Azure Data Factory, orchestrating the ingestion, transformation, and loading of 1 TB of data daily from various external sources into Azure Synapse Analytics
Implemented dynamic pipelines with custom activities and triggers, ensuring data integrity and consistency across the data lake and warehouse, and providing timely data insights to business stakeholders
Utilized Tableau, Power BI, and Python libraries (matplotlib, seaborn) to create interactive visualizations, translating complex datasets into actionable insights
Developed advanced dashboards for real-time monitoring and analysis of key business metrics
These visualizations enhanced data-driven decision-making by providing clear, concise, and impactful data presentations to business leaders.

Data Engineer

Accenture

09.2019 - 06.2022

Built and maintained scalable data pipelines using Apache Spark and Kafka, processing over 1TB of health data daily for analysis and reporting
Integrated health data from EHR systems, medical devices, and claims databases using REST APIs and data extraction techniques, leveraging AWS Cloud services for data storage and processing, improving data accessibility
Designed and optimized data models for efficient storage and retrieval of health data using MySQL and PostgreSQL, reducing query times by 50%
Implemented data governance processes to comply with HIPAA regulations and maintain data privacy using Enterprise Content Discovery and Management (ECDP), enhancing data security
Managed Kubernetes for containerized applications, ensuring uptime and efficient resource utilization for data engineering workloads
Automated deployment of data engineering applications using Kubernetes manifests, reducing deployment times
Built real-time data pipelines with Apache Flink and NiFi, enabling immediate insights and alerts for critical health events, improving response times by 25%
Extracted, transformed, and loaded data using Azure Data Factory, Databricks, and Data Lake Storage, processing over 500GB of data daily for analytical purposes
Developed data visualizations and reports in Tableau, providing actionable health insights and improving strategic planning
Deployed machine learning models for predictive analytics in health data using Python and Scikit-learn, reducing patient readmission rates by 15%
Enhanced data processing and storage efficiency with Hadoop, Hive, and Parquet, reducing storage costs by 20% and processing times by 40%
Executed data quality assurance processes, identifying and correcting data anomalies, improving data accuracy by 25%, and documenting processes using tools like Git and Jira for effective collaboration.

Data Engineer Intern

NEXTGEN Healthcare

07.2018 - 08.2019

Implemented Snowflake data warehouse solutions on Google Cloud Platform (GCP), enhancing data storage efficiency
Created and managed Snowflake databases, schemas, tables, and views for efficient data storage and retrieval
Maintained data pipelines for ingesting, transforming, and loading healthcare data using Apache Spark, Kafka, and Python, processing over 2TB of data weekly
Ensured data accuracy and consistency through data cleansing and validation using Pandas and Jupyter notebooks, reducing data errors by 25%
Implemented deep learning models for predictive health insights using TensorFlow, improving patient outcomes
Utilized GitHub for version control and collaboration
Installed Hadoop, MapReduce, HDFS, and Google Cloud SDK and developed multiple MapReduce jobs in PIG and Hive for data cleaning and pre-processing on GCP
Created automation regression scripts for ETL process validation between databases like Google BigQuery, Oracle, MongoDB, T-SQL, and SQL Server using Python, reducing manual validation time by 50%
Performed data analysis and visualization using Power BI, and QlikView to present insights to business users and stakeholders
Worked on NoSQL Databases HBase, and SPARK for real-time streaming of data into the cluster
Designed complex SSIS Packages for Extract, Transform and Load (ETL) with data from different sources
Maintained version control and utilized collaboration tools such as Git, Jira, and Confluence to track changes and ensure seamless teamwork.

Education

Master of Science in Business Analytics and Project Management -

University of Connecticut

Stamford, CT

05.2024

Skills

SDLC
Agile
Waterfall
Python
SQL
Java
R
Scala
NumPy
Pandas
Matplotlib
SciPy
Scikit-learn
TensorFlow
Seaborn
Tableau
Power BI
Advanced Excel (Pivot Tables, VLOOKUP)
Visual Studio Code
PyCharm
Jupyter
IntelliJ
AWS Cloud
Microsoft Azure
Google Cloud Platform
MySQL
PostgreSQL
MongoDB
T-SQL
Apache Spark
Apache Hadoop
Apache Kafka
Apache Beam
Flink
NiFi
Git
GitHub
Windows
Linux
Mac iOS
Azure Data Factory

AWS Glue
Apache Airflow
Informatica
Talend
Snowflake
Amazon Redshift
Google BigQuery
Azure Synapse Analytics
Docker
Kubernetes
Jenkins
Terraform
CloudFormation
Data Quality
Compliance (HIPAA)
Data Encryption
GDPR
Clustering
Association Rules
Decision Trees
Neural Networks
ER Diagrams
Star and Snowflake Schemas
Normalization
Dimensional Modeling
Machine Learning (TensorFlow, Scikit-learn, PySpark)
Predictive Analytics
Statistical Analysis
Deep Learning (TensorFlow, Scikit-learn)
CircleCI
REST APIs
GRPC
API Development
ETL development
Data Warehousing
Data Modeling
Data Pipeline Design
SQL Expertise
NoSQL Databases
Data Analysis

Websites

https://www.linkedin.com/in/dharma-sai-aakarsh-317287199/

Personal Information

Title: Data Engineer

Timeline

Data Engineer

Citigroup

08.2023 - Current

Data Engineer

Accenture

09.2019 - 06.2022

Data Engineer Intern

NEXTGEN Healthcare

07.2018 - 08.2019

Master of Science in Business Analytics and Project Management -

University of Connecticut

Dharma Sai Aakarsh Mangamuri

Summary

Overview

Work History

Data Engineer

Data Engineer

Data Engineer Intern

Education

Master of Science in Business Analytics and Project Management -

Skills

Websites

Personal Information

Timeline

Data Engineer

Data Engineer

Data Engineer Intern

Master of Science in Business Analytics and Project Management -

Similar Profiles

Amra FarchegAmra Farcheg

Ed BurdettEd Burdett

KARIM MERCHANTKARIM MERCHANT

Pallavi KondrajuPallavi Kondraju

Mahalakshmi SomasundaramMahalakshmi Somasundaram