Summary
Overview
Work History
Skills
Timeline
Generic

Monika Bompelly

New Haven,CT

Summary

Transitioning from data-centric environment with focus on developing efficient data solutions and optimizing workflows. Skilled in data architecture, database management, SQL, and Python, with track record of enhancing data-driven decision-making processes. Seeking to apply these transferrable skills in new field, bringing consultative approach to solving complex problems and improving operational efficiency.

Overview

11
11
years of professional experience

Work History

Sr. Data Engineer/ Cloud Data Engineer

Tapestry Inc
08.2023 - Current
  • Involved in the complete Big Data flow of the application starting from data ingestion upstream to HDFS, processing the data in HDFS, and analyzing the data involved
  • Involved in developing a roadmap for migration of enterprise data from multiple data sources like SQL Server, and provider databases into S3 which serves as a centralized data hub across the organization
  • Loaded and transformed large sets of structured and semi-structured data from various downstream systems
  • Working knowledge of Data Build Tool (Dbt) with Snowflake and experience in writing SQL queries against snowflake
  • Developed ETL pipelines using Spark and Hive for performing various business-specific transformations
  • Responsible for analyzing the business requirement estimating the tasks and preparing the design documents for the existing and Teradata code for converting into hive/spark SQL
  • Worked with Sage Maker to build, train, and deploy machine learning models, incorporating predictive analytics into data workflows
  • Managed and stored large volumes of data in AWS S3, integrated with AWS Glue Catalog for metadata management
  • Migrated on-prem data warehouses to cloud environments and designed workflows to ensure data integrity and consistency during the migration
  • Worked on building real-time pipelines using Kafka and Spark Streaming
  • Worked closely with our data scientist teams and business consumers to shape the datasets as per the requirements
  • Automated the data pipeline to ETL all the Datasets along with full loads and incremental loads of data
  • Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generated data visualizations using Tableau
  • Scheduled Airflow DAGs to run multiple Hive and Pig jobs, which independently run with time and data availability
  • Designed and deployed server less applications using AWS Lambda, automating backend processes and event-driven workflows to increase operational efficiency
  • Implemented data partitioning and optimized ETL performance using Spark SQL and Hive to reduce processing time for large datasets by 30%
  • Integrated AWS Lambda to automate real-time ETL jobs, reducing manual intervention and enabling event-driven data processing
  • Used AWS glue catalog with crawler to get the data from S3 and perform sql query operations using AWS Athena
  • Integrated real-time data streaming using AWS Kinesis and MSK to ingest and process high-volume data streams for real-time analytics and alerting systems
  • Engineered large-scale data analytics solutions with AWS Redshift, performing complex SQL queries on massive datasets to drive business intelligence and decision-making
  • Hands-on experience managing security with AWS IAM, service roles, and KMS for data encryption
  • Used AWS Secrets Manager to securely manage credentials for accessing sensitive data across multiple AWS services
  • Created custom triggers and integrations with other AWS services, such as AWS S3 and Amazon SNS (Simple Notification Service), to build scalable and responsive applications
  • Loaded Json-styled documents in a NoSQL database like MongoDB and deployed the data in the cloud service Amazon Redshift
  • Responsible for using Flume sink to remove the date from the Flume channel and deposit it in a No-SQL database like MongoDB
  • Worked on building input adapters for data dumps from FTP Servers using Apache Spark
  • Wrote spark applications to perform operations like data inspection, cleaning, loading, and transforming large sets of structured and semi-structured data
  • Developed Spark application with Scala and Spark-SQL for testing and processing of data
  • Reporting the spark job stats, monitoring, and running data quality checks are made available for each dataset
  • Developed the Spark SQL logic which mimics the Teradata ETL logics and points the output Delta back to Newly Created Hive Tables and as well the existing TERADATA Dimensions, Facts, and Aggregated Tables
  • Ensuring data quality, reliability, and integrity across the data pipeline and maintaining a robust data governance framework
  • Technical Stack: AWS S3, AWS Redshift, Jenkins, GIT, Hadoop, Hive, Pig, Sqoop, Oozie, Spark, Scala, Airflow, Oracle, DB2, Salesforce, Mainframe, DataStage, Grafana, Rally, ServiceNow, Unix, DoM.

Sr. Data Engineer

Elevance
01.2022 - 07.2023
  • Developed and managed complex ETL pipelines using Apache Nifi, transforming and loading terabytes of data into AWS Redshift with minimal downtime
  • Automated data quality checks and validation processes using Python and SQL, reducing data errors by 40%
  • Developed ETL processes to extract, transform, and load data from various sources, including SQL Server, Oracle, and MongoDB, ensuring data accuracy and integrity
  • Collaborated with data scientists and analysts to develop data models and algorithms, leading to improved predictive analytics and business insights
  • Managed and optimized data storage solutions using AWS services such as S3, Redshift, and Glue, reducing storage costs by 25%
  • Implemented data governance frameworks and policies, ensuring compliance with GDPR and CCPA regulations
  • Built and maintained real-time data streaming applications using Kafka, enabling real-time analytics and decision-making
  • Utilized Docker to containerize data processing applications, ensuring consistency across different environments
  • Employed Kubernetes for orchestrating containerized data processing workloads, improving scalability and resource utilization
  • Participated in Agile Scrum ceremonies, contributing to sprint planning, daily stand-ups, and retrospectives to improve team productivity
  • Developed Complex database objects like Stored Procedures, Functions, Packages and Triggers using SQL and PLSQL
  • Experience in designing and implementing Data Warehouse applications mainly using ETL tool Talend Data Fabric for Big data integration and data ingestion
  • Integrated data quality checks into CI/CD pipelines, ensuring data integrity and reliability in production
  • Led the design and implementation of scalable data pipelines using technologies such as Apache Spark, Apache Kafka, and Apache Nifi to efficiently process and analyze large volumes of data
  • Build application and database servers using AWS EC2, create AMIs, and use RDS for PostgreSQL
  • Carried Deployments and builds on various environments using the continuous integration tool Jenkins
  • Designed the project workflows/pipelines using Jenkins as a CI tool
  • Informatica Intelligent Cloud Services (IICS), Informatica Data Quality
  • Implemented Apache Hadoop ecosystem components, including HDFS, MapReduce, Hive, and HBase, to effectively manage and process extensive datasets
  • Automated data extraction processes from multiple sources, including RESTful APIs, databases, and flat files, reducing manual intervention by 50%
  • Solid knowledge of Data warehousing, Data Marts, Operational Data Store (ODS), Dimensional Data Modeling (Star Schema Modeling
  • Expertise in Data Architect, Data Modeling, Metadata, Data Migration, Data mining, Data Science
  • Evaluating Azure, Collibra, Alation, Informatica data catalog tools
  • Setting up self-service analytics process and standards using Power BI to utilize data assets
  • Involved in Linux shell scripts for business processes and loaded data from different systems into the HDFS
  • Implemented ETL processing, which consists of data transformation, data sourcing, mapping, conversion, and loading
  • Utilized Apache Spark and PySpark to process and analyze large datasets, achieving significant reductions in processing time from hours to minutes
  • Created interactive dashboards in Tableau to visualize key business metrics, empowering stakeholders with actionable insights
  • Developing and maintaining technical roadmap for Enterprise Modern Data Platform for different platform capabilities
  • Developed custom data visualizations in Power BI to illustrate complex data patterns and trends
  • Leveraged AWS S3, EC2, and EMR instances extensively for deploying and testing applications across various environments (DEV, QA, PROD)
  • Utilized Terraform to allow infrastructure to be expressed as code in building EC2, Lambda, RDS, and EMR
  • Built analytical warehouses in Snowflakes and queried data in staged files by referencing metadata columns in a staged file
  • Designed a Data Quality Framework to perform schema validation and data profiling on Spark (PySpark)
  • Utilized Pandas API to put the data in a time series and tabular form for timestamp data manipulation and retrieval to handle time series data and do data manipulation
  • Using API interface Alation to query data and managed tables
  • Implemented Spark Structured streaming to consume real-time data, build feature calculations from various sources like Data Lake and Snowflake, and produce them back to Kafka
  • PLSQL resource on the project I developed an abstraction layer of complex views to support backward compatibility for legacy data warehouse data consumers
  • Extensively worked with Avro and Parquet files and converted the data from either format
  • Parsed semi-structured JSON data was converted to Parquet using data frames in PySpark
  • Developed a Python Script to load the CSV files into the S3 buckets, created AWS S3 buckets, performed folder management in each bucket, and managed logs and objects within each bucket
  • Created Hive DDL on Parquet and Avro data files residing in both HDFS and S3 buckets
  • Configured Glue Dev Endpoints to point Glue Job to specify EMR cluster or EC2 instance
  • Technical Stack: Python, SQL, ETL, Apache Nifi, AWS Redshift, SQL Server, Oracle, MongoDB, S3, Glue, GDPR, CCPA, Kafka, Docker, Kubernetes, Agile, Scrum, CI/CD, Spark, EC2, AMIs, RDS, PostgreSQL, Jenkins, CI, MapReduce, Hive, HBase, RESTful APIs, Flat files, HDFS, PySpark, Tableau, Power BI, EMR, Lambda, Snowflakes, Pandas API, Data Lake, JSON, Parquet, CSV, S3 buckets, Avro

Sr. Data Engineer

Nike
07.2021 - 12.2022
  • Automated ETL processes using PySpark Data Frame APIs, reducing manual intervention and ensuring data consistency and accuracy
  • Integrated Azure Databricks into end-to-end ETL pipelines, facilitating seamless data extraction, transformation, and loading
  • Implemented complex data transformations using Spark RDDs, DataFrames, and Spark SQL to meet specific business requirements
  • Developed real-time data processing applications using Spark Streaming, capable of handling high-velocity data streams
  • Developed and implemented data security and privacy solutions, including encryption and access control, to safeguard sensitive healthcare data stored in Azure
  • Enhanced search performance by implementing and maintaining ElasticSearch clusters, reducing query response time by 30%
  • Ensured high availability and fault tolerance by managing ElasticSearch cluster health and scaling
  • Designed and implemented PostgreSQL database schemas and table structures based on normalized data models and relational database principles
  • Created interactive and insightful dashboards and reports in Power BI, translating complex data sets into visually compelling insights for data-driven decision-making
  • Designed efficient HBase schemas for improved data retrieval and storage, decreasing latency and boosting read/write performance
  • Leveraged expertise in Azure Data Factory for proficient data integration and transformation, optimizing processes for enhanced efficiency
  • Managed Azure Cosmos DB for globally distributed, highly available, and secure NoSQL databases, ensuring optimal performance and data integrity
  • Created end-to-end solutions for ETL transformation jobs involving Informatica workflows and mappings
  • Demonstrated extensive experience in ETL tools, including Teradata Utilities, Informatica, and Oracle, ensuring efficient and reliable data extraction, transformation, and loading processes
  • Integrated, transformed, and loaded data from various sources using Spark ETL pipelines, ensuring data integrity and consistency
  • Seamlessly integrated HBase with data processing pipelines, facilitating real-time analytics and data ingestion
  • Utilized Python, including pandas and numpy packages, along with PowerBI to create various data visualizations, while also performing data cleaning, feature scaling, and feature engineering tasks
  • Developed machine learning models such as Logistic Regression, KNN, and Gradient Boosting with Pandas, NumPy, Seaborn, Matplotlib, and Scikit-learn in Python
  • Designed and coordinated with the Data Science team in implementing advanced analytical models in Hadoop Cluster over large datasets, contributing to efficient data workflows
  • Automated the provisioning of Azure resources using Terraform scripts, ensuring consistent and repeatable environment setups
  • Managed infrastructure changes using Terraform, enabling version-controlled and auditable infrastructure deployments
  • Implemented CI/CD pipelines with Jenkins for automated testing and deployment of ETL processes, reducing manual errors
  • Integrated CI/CD workflows with GitLab for continuous integration and delivery, enhancing the efficiency of development cycles
  • Leveraged Git for version control to manage code changes and collaborate on ETL development, ensuring code quality
  • Coordinated with teams using GitLab repositories, facilitating collaborative development and code reviews
  • Configured Jenkins pipelines to automate the testing and deployment of data integration jobs, improving release management
  • Automated deployments by integrating Jenkins with Azure and containerized ETL workflows with Docker for consistent environments across all stages
  • Utilized Docker to deploy scalable and reproducible environments for data processing applications
  • Deployed containerized data processing applications on Kubernetes clusters for enhanced scalability and reliability
  • Managed Kubernetes deployments using Helm to simplify the deployment and scaling of ETL pipelines
  • Technical Stack: Azure, Azure Data Factory, Azure CosmosDB, ETL, Informatica, PySpark, Azure HDInsight, Apache Spark, Hadoop, Spark-SQL, Scikit-learn, Pandas, NumPy, PostgreSQL, MySQL, Python, Scala, Power BI, SQL.

Data Engineer

United Health Group
04.2019 - 06.2021
  • Developed and maintained data pipelines in azure datafactory, integrating data from manufacturing, sales, and customer service for comprehensive analytics
  • Managed azure datalake storage solutions for scalable and secure data storage, enabling efficient data access and analysis across global teams
  • Utilized azure databricks for big data processing and analytics, applying machine learning models to predict vehicle performance and maintenance needs
  • Automated deployment processes using jenkins and ansible, improving the efficiency and reliability of data infrastructure provisioning
  • Wrote and maintained shell scripts to automate routine data management tasks, enhancing operational efficiency and reducing manual errors
  • Deployed and maintained ssis packages across multiple environments, ensuring smooth data flow operations between development, staging, and production systems
  • Configured azure service bus and event hub for real-time data ingestion and event streaming, facilitating immediate insights into manufacturing and operational data
  • Administered azure sql databases and cosmos db, optimizing performance and ensuring high availability for critical automotive data applications
  • Deployed containerized applications using azure kubernetes service (aks) and azure container registry (acr), supporting scalable and resilient data services
  • Secured sensitive data using azure key vault, implementing best practices for managing secrets, keys, and certificates
  • Managed azure vm creation and configuration, ensuring optimized resource utilization for data processing and analysis workloads
  • Maintained infrastructure as code (iac) using yaml templates, streamlining deployment and management of azure resources
  • Configured and maintained weblogic and azure webapp environments, supporting web-based applications and services for internal and customer-facing portals
  • Wrote efficient, scalable code in python for data processing and automation tasks, contributing to the development of predictive analytics models
  • Managed source code and version control using git, ensuring code integrity and facilitating team collaboration
  • Coordinated project tasks and tracked progress using jira, enhancing project visibility and team productivity
  • Configured ssrs report subscriptions and alerts, ensuring timely delivery of reports via email or shared network drives to end-user
  • Implemented data governance and compliance measures, aligning data management practices with automotive industry standards and regulations
  • Conducted thorough testing and validation of data pipelines and analytics models, ensuring accuracy and reliability of insights provided to decision-makers
  • Technical Stack: Azure, Red hat linux, Jenkins, Ansible, Shell scripting, Azure datalake, Azure datafactory, Azure ad, Azure service bus, Azure sql, cosmos db, Log analytics, aks, event hub, service bus, key vault, app insights, Azure vm creation, Acr, Azure function app, Azure webapp, Azure sql, and Azure sql mi, ssh, yaml, WebLogic, Python, Azure devops, git, maven, jira.

Data Engineer

HP
04.2015 - 03.2019
  • Implemented data pipelines on AWS Glue to effectively extract, transform, and load various datasets for Chevron's analytics, improving operational and decision-making insights
  • Responsible for the execution of big data analytics, predictive analytics, and machine learning initiatives
  • Built real-time data pipelines by developing Kafka producers and Spark Streaming applications for processing large-scale data from oil and gas operations
  • Monitored Spark jobs using the UI interface (Name Node Manager, Resource Manager ETS) in AWS
  • Utilized AWS services with a focus on big data architecture, analytics, enterprise data warehouses, and business intelligence solutions
  • Experience in AWS services like EC2, EMR, DynamoDB, Athena, and Redshift
  • Automated data workflows using Python and Apache Airflow, resulting in increased efficiency and reduced manual errors
  • Developed Spark SQL scripts using PySpark to perform transformations and actions on Data Frames and Data Sets in Spark for faster data processing
  • Created data pipelines for extracting, transforming, and loading data from various sources, including internal and external APIs
  • Conducted performance tuning and optimization of SQL queries on AWS Redshift to enhance data processing efficiency
  • Developed Spark scripts using Python on AWS EMR for data aggregation, cleansing, and mining
  • Developed and maintained data orchestration workflows using AWS Step Functions to manage complex ETL tasks and dependencies
  • Worked together with data scientists to enable real-time model inference through SQS-triggered Lambda functions and to run scripts in response to events in DynamoDB and S3
  • Collaborated with cross-functional teams to understand business requirements and translate them into actionable Tableau visualizations
  • Proficient in using Python for DynamoDB interactions, including Boto3 library for seamless integration
  • Implemented CRUD operations on DynamoDB tables using Python scripts, ensuring data consistency
  • Generating reports using Python as per the business requirement and create visualization
  • Participate in the design, build and deployment of NoSQL implementations like MongoDB
  • Added support for Amazon AWS S3 and RDS to host static/media files and the database into Amazon Cloud
  • Extensive code reviewing using GitHub pull requests, improved code quality, also conducted meetings among Team
  • Managed and processed large datasets using Hadoop MapReduce, improving data processing efficiency
  • Developed scripts to migrate data from proprietary database to Postgres SQL
  • Followed Agile Methodologies and SCRUM Process
  • Technical Stack: Python, Django, HTML5, CSS, Bootstrap, jQuery, JSON, JavaScript, PostgreSQL, MongoDB, Ansible, MySQL, Google Cloud, Amazon AWS S3, Bugzilla, JIRA, Hadoop, Hive, Apache Airflow

SQL Developer

HP
09.2013 - 03.2015
  • Involved in the Installation and Configuration of SQL Server 2008 and SQL Server 2012 with the latest Service Packs
  • Used DDL and DML for writing Triggers, Stored procedures, and Data manipulation
  • Created Ad Hoc and Parameterized reports using SQL Server Reporting Services (SSRS)
  • Used performance point services, SSRS, Excel as the reporting tools and wrote the expressions in SSRS wherever necessary
  • Created reports from OLAP, sub reports, bar charts and matrix reports using SSRS
  • Deployed the SSRS reports in Microsoft Office SharePoint portal server (MOSS) 2012
  • Designed and developed SSIS Packages to import and export data from MS Excel, SQL Server, and Flat files
  • Involved in the development of complex mappings using SSIS to transform and load the data from Oracle into the SQL 2008 R2/2012 Server target staging database
  • Created Linked Servers to connect OLE DB data sources and providers participated in designing a data warehouse to store all information from OLTP to Staging and Staging to Enterprise data warehouse to do better analysis
  • Conducted and automated the ETL operations to Extract data from multiple data sources, transform inconsistent and missing data to consistent and reliable data, and finally load it into the multi-dimensional data warehouse
  • Developed T-SQL programs required to retrieve data from data repository using cursors and exception handling and created T-SQL scripts to monitor deadlocks
  • Developed SQL queries and PL/SQL procedures in Oracle database for the Application
  • Modified the existing Universe and created new Universe against Oracle database as per the reporting requirement to add new features to the reports
  • Documented Design Documents for reports to provide detailed design and explanation of the reports and documented Unit Test documents to evaluate and validate reports.

Skills

  • Programming: Python, Scala, Java, Golang
  • Cloud Platforms: AWS (EMR, S3, Glue, Redshift), Azure (Data Lake, Databricks), Google Cloud Platform
  • Big Data: Hadoop (HDFS, Hive, Pig, Spark), Apache Kafka, PySpark, MapReduce
  • ETL Tools: Talend, Informatica, Microsoft Integration Services, SnowSQL
  • Databases: SQL Server, NoSQL (DynamoDB, MongoDB, Cassandra)
  • Data Visualization: Tableau, Power BI
  • Infrastructure as Code: Terraform
  • API Development: JWT, OAuth2, API keys
  • CI/CD: Jenkins, Docker, Bitbucket, Git
  • Testing Tools: Apache JMeter, QuerySurge
  • Machine Learning & AI: TensorFlow, PyTorch
  • Methodologies: Agile, Scrum, Test-Driven Development (TDD)

Timeline

Sr. Data Engineer/ Cloud Data Engineer

Tapestry Inc
08.2023 - Current

Sr. Data Engineer

Elevance
01.2022 - 07.2023

Sr. Data Engineer

Nike
07.2021 - 12.2022

Data Engineer

United Health Group
04.2019 - 06.2021

Data Engineer

HP
04.2015 - 03.2019

SQL Developer

HP
09.2013 - 03.2015
Monika Bompelly