Summary
Overview
Work History
Education
Skills
Timeline
Generic

Anurag Kethireddy

Cummings,GA

Summary

Senior Data Engineer with 8 years of experience in designing and developing scalable data solutions in cloud environments. Proficient in building and optimizing data pipelines, ETL frameworks, and data warehousing using a wide array of tools, including AWS, Azure, and Hadoop ecosystems. Adept at handling large datasets and delivering insights through real-time data processing and advanced analytics.

Key strengths include:

  • Cloud Expertise: Deep experience with AWS (Lambda, Glue, Kinesis, EMR) and Azure (Data Factory, Synapse, SQL Azure), utilizing over 25 AWS services to build and maintain end-to-end data pipelines, improving data ingestion efficiency by up to 40%.
  • Big Data & Analytics: Skilled in leveraging Hadoop, Spark, Kafka, and Flink to process and analyze large-scale data, resulting in a 30% reduction in ETL execution times and improved query performance.
  • Data Modeling & Warehousing: Expert in Dimensional and Relational Data Modeling, with extensive experience in data warehousing solutions like RedShift, Cassandra, and DynamoDB.
  • Programming: Proficient in Python (Pandas, NumPy), PySpark, Scala, and SQL, developing robust data transformations and automating data workflows to reduce manual intervention by 40%.
  • ETL & Stream Processing: Proven success in creating efficient ETL processes, batch processing, and real-time message ingestion, with improvements of 25% in data processing speed.
  • Reporting & Dashboards: Developed and optimized reports using Tableau and Power BI, enabling faster decision-making and improving data-driven insights by 20%.
  • DevOps & CI/CD: Experience with version control tools like Git, Bitbucket, and containerization technologies like Docker and Kubernetes, ensuring seamless deployments and reducing cloud resource management costs by 15%.

Strong collaborator with a demonstrated ability to work closely with data scientists, analysts, and business stakeholders to deliver high-impact data solutions, improving overall system efficiency and reducing operational costs.

Overview

9
9
years of professional experience

Work History

DATA ENGINEER II

AMAZON INC
06.2022 - 05.2023
  • Designed, implemented, and optimized scalable data pipelines and architectures using Hadoop, Spark, and EMR, processing over 5 TB of data per day, leading to a 30% improvement in data processing speed and reducing operational costs by 15%
  • Developed and implemented a comprehensive end-to-end data pipeline, utilizing AWS S3 and Apache Spark, which improved data ingestion efficiency by 40% and reduced manual intervention in pipeline management by 25%
  • Designed and developed tools for functional integration tests, enhancing test coverage by 20% and reducing critical production issues by 15%, leading to improved system reliability
  • Created Python AWS Lambda functions for EMR clusters, which processed large datasets (up to 2 TB) with a 25% increase in data processing speed, enabling near real-time analytics for critical business decisions
  • Built and maintained scalable data processing and ingestion pipelines, ensuring 99.9% uptime and reducing data ingestion latency by 20%- , improving overall system efficiency
  • Developed and optimized distributed data processing workflows using Apache Spark and Apache Flink, resulting in a 25% improvement in query performance and a 30% reduction in ETL execution time
  • Deployed and managed Spark jobs using Airflow in an AWS environment, optimizing resource usage by 15% and reducing job execution time by 20%, leading to cost savings in cloud resource management
  • Integrated REST APIs to enhance connectivity between databases and the data access layers, reducing data retrieval times by 30% and ensuring seamless integration across multiple platforms
  • Collaborated closely with data analysts, data scientists, and stakeholders to deliver robust data solutions that reduced report generation time by 20% and improved the accuracy of analytics by 10%
  • Designed and implemented data frameworks in RedShift to automate data ingestion and transformation processes, decreasing manual intervention by 40% and reducing ETL error rates by 15%

SOFTWARE ENGINEER

LinkedIn Corporation
12.2020 - 05.2022
  • Developed and optimized Spark applications using Scala, resulting in a 25% reduction in processing time and increased scalability for handling 2 TB/day data
  • Created scalable data pipelines using Azure Data Factory and Databricks, integrating 5+ data sources and reducing data processing time by 30%
  • Managed Azure Databricks data processing, improving data pipeline efficiency by 20% and reducing storage costs by 15% through optimized resource allocation
  • Deployed IAM roles and Azure Data Factory using Terraform, automating role creation and improving deployment time by 35%
  • Developed Spark streaming applications using Scala with optimized configurations, reducing execution time by 25% and improving system reliability
  • Developed Spark code using Scala and Spark-SQL, optimizing algorithms for faster data processing, resulting in 20% faster query execution and reduced resource use
  • Deployed Spark jobs via HDInsight in Azure, improving job execution efficiency by 15% and reducing cloud infrastructure costs
  • Designed and deployed POCs using Spark on YARN, demonstrating 30% performance improvement compared to traditional SQL processing for large datasets
  • Implemented text analytics using Spark's in-memory capabilities, reducing data processing times by 20% and enhancing analysis speed for text-heavy datasets
  • Led data collection and cleaning efforts, improving data accuracy by 15%, and developed predictive models that increased forecasting accuracy by 10%
  • Created Spark applications using DataFrames and Spark SQL APIs, improving query performance by 25% and enabling faster decision-making for business teams

Sr Hadoop Developer

Citi Groups
11.2019 - 09.2020
  • Managed Hadoop ecosystems (Hive, HBase, Oozie, Zookeeper, Spark Streaming), improving data processing speeds by 25% and streamlining data storage processes
  • Implemented Spark Streaming to analyze 10M+ user events/day, improving visitor behavior analysis, leading to a 20% increase in user engagement insights
  • Developed Spark Streaming and MapReduce jobs for a large-scale data lake, improving data storage efficiency by 30% and reducing data retrieval time by 15%
  • Developed Spark Streaming jobs with RDDs and SparkSQL, optimizing processing performance by 20% and enabling real-time analytics on streaming data
  • Managed Oozie jobs for capacity planning, improving storage utilization by 10% and reducing processing delays in critical data workflows
  • Developed and optimized REST APIs using Python (Pandas, Django), reducing data retrieval time by 20% and improving data availability across applications
  • Implemented partitioning and bucketing in Hive, improving query performance by 30% and reducing storage costs by 15%
  • Developed external Hive tables and optimized HiveQL queries, improving data analysis speed by 25% and supporting business-critical reporting functions
  • Used Sqoop to transfer 500 GB+ of data daily between HDFS and relational databases, improving data integration processes and reducing latency by 20%
  • Used Apache Spark on YARN for large-scale data processing, improving performance by 30% and reducing the time for large batch processes by 20%

Sr Hadoop Developer

Visa Inc
06.2018 - 11.2019
  • Applied in-depth knowledge of Hadoop architecture (HDFS, MapReduce), leading to a improvement in data processing efficiency and reduced cluster overhead
  • Managed Big Data ecosystems (Spark, Hive, Sqoop, Oozie), improving data processing speeds by 25% and ensuring seamless integration across platforms
  • Collaborated with business users to finalize technical requirements, resulting in more accurate project deliverables and reducing requirement clarification time by 15%
  • Converted ETL processes into optimized MapReduce jobs, reducing data processing times by 30% and increasing system efficiency for wholesale, market risk, and securitization
  • Extracted and transferred 1TB+ of data between Exadata and HDFS using Sqoop, improving data transfer efficiency by 20% and enabling faster reporting cycles
  • Optimized MapReduce jobs with compression mechanisms, reducing storage requirements by 30% and improving job runtime performance by 25%
  • Optimized MapReduce algorithms (Combiners, Partitioners, Distributed Cache), improving data processing speed by 20% and reducing job execution times by 25%
  • Reconciled data from MapReduce and ETL processes using Spark, reducing data discrepancies by 15% and improving reporting accuracy for financial stakeholders
  • Tuned Hive queries to improve query performance by 30%, reducing processing time for large datasets and enabling faster decision-making
  • Converted Hive tables to Avro and ORC formats, reducing storage usage by 40% and freeing up cluster space for additional workloads

Hadoop Developer

SpindleTop Technologies
12.2014 - 11.2015
  • Used Sqoop to load structured data from relational databases into HDFS
  • Loaded transactional data from Teradata using Sqoop and created Hive Tables
  • Worked on automation of delta feeds from Teradata using Sqoop and from FTP Servers to Hive using Flume
  • Performed transformations like de-normalizing, cleansing of data sets, date transformations, parsing some complex columns
  • Worked with different compression codecs like GZIP, SNAPPY and BZIP2 in MapReduce, Pig and Hive for better performance
  • Handled Avro, JSON and Apache log data in Hive using custom Hive SerDes
  • Worked on batch processing and scheduled workflows using Oozie
  • Worked on fine tuning hive scripts to improve join performances, reduce skewness in aggregate operations etc
  • Used Hive-QL to create partitioned RC, Parquet tables, used compression techniques to optimize data process and faster retrieval
  • Implemented Partitioning, Dynamic Partitioning and Buckets in Hive for efficient data access

Software Engineer

SpindleTop Technologies
06.2014 - 11.2014
  • Involved in analysis estimation and development of the project
  • Used Struts MVC framework to enable the interaction between JSP/View layers
  • Involved in the client meeting and call individually and with them
  • Status reporting (weekly) on project progress
  • Preparing release notes and taking care of all deployment activities
  • Involved in XML parsing coding
  • Responsible for build process to deploy the applications
  • Involved in creating patches, Code Base and Sanity Code checking's for code Release to client
  • Completion of development on time within scheduled plan
  • Doing QA for testing of requirement done
  • Did testing (unit testing, System Integration testing and regression testing)
  • Code review and given the Technical support to the Team
  • Involved in writing test case for unit testing, functional testing, integration testing

Education

Master of Science - Computer Informa

New England College
Henniker, NH
04.2018

Bachelor of Science - Electronics And Communications Engineering

JNTUH
Hyderabad, India
04.2014

Skills

  • Spark Development
  • Data Warehousing
  • Hadoop Ecosystem
  • Data Pipeline Design
  • Data Modeling
  • Big Data Processing
  • ETL development
  • Python Programming
  • Real-time Analytics
  • Data integration
  • Database Design
  • Risk Analysis

Timeline

DATA ENGINEER II

AMAZON INC
06.2022 - 05.2023

SOFTWARE ENGINEER

LinkedIn Corporation
12.2020 - 05.2022

Sr Hadoop Developer

Citi Groups
11.2019 - 09.2020

Sr Hadoop Developer

Visa Inc
06.2018 - 11.2019

Hadoop Developer

SpindleTop Technologies
12.2014 - 11.2015

Software Engineer

SpindleTop Technologies
06.2014 - 11.2014

Master of Science - Computer Informa

New England College

Bachelor of Science - Electronics And Communications Engineering

JNTUH
Anurag Kethireddy