Summary
Overview
Work History
Education
Skills
Certification
Timeline
Generic

Koushik Gaddam

Manassas,VA

Summary

Accomplished Sr. Data Engineer at Metrix IT Solutions Inc, adept at optimizing ETL processes and enhancing data integration using Spark and AWS Glue. Proven ability to streamline workflows, achieving a 40% increase in analytics efficiency. Strong analytical skills combined with a collaborative approach to problem-solving drive impactful results in data-driven environments.

Overview

9
9
years of professional experience
1
1
Certification

Work History

Sr. Data Engineer

Metrix IT Solutions Inc
01.2020 - Current
  • Streamlined AWS cloud migration processes and supported application modernization efforts by implementing Datadog for real-time monitoring, performance insights, and GitHub for version control.
  • Designed and implemented scalable APIs and query layers (GraphQL, REST) using AWS services (API Gateway, Lambda, DynamoDB, and RDS) to enable data access and integration across distributed systems.
  • Enhanced Spark workload efficiency and cost-effectiveness using Amazon EMR on EKS(Kubernetes) and optimized containerized workflows with Docker.
  • Optimized performance and cost by implementing GCP BigQuery clustering, partitioning, and slot reservations.
  • Automated real-time AWS big data workflows using Jenkins to enhance CI/CD deployment efficiency.
  • Designed and developed real-time ETL pipelines to efficiently ingest and process data in and out of the Snowflake data warehouse, leveraging a combination of Python and Snow-SQL.
  • Automated end-to-end data pipelines using AWS Step Functions, orchestrating data ingestion from Kinesis, transformation via AWS Glue, and storage in S3 and Redshift, reducing manual intervention by 50%.
  • Developed and maintained scalable data pipelines and ETL processes using Apache Spark and Scala, facilitating data ingestion, transformation, and integration for large-scale distributed systems.
  • Designed and implemented a modern data platform using Data Build Tool (dbt) on AWS Cloud, streamlining data transformation processes and enhancing analytics efficiency by 40%.
  • Analyzed historical datasets (>10TB), implemented Delta Lake and Data Lake setups, and optimized data ingestion pipelines for seamless Snowflake integration.
  • Configured Snowflake features like RBAC controls, query optimization, and time travel, improving resource efficiency and query performance by 25%.
  • Ingested real-time data from Kafka into Amazon Timestream using Kafka Connect for efficient time-series data.
  • Developed and deployed Azure Databricks Notebooks to design and implement batch data pipelines tailored for diverse data varieties, including structured, semi-structured, and unstructured data.
  • Created a Python boto script to automate the launching of AWS resources for data pipeline management.
  • Proficient in creating Snowflake Stored Procedures and UDFs to automate business logic and enable real-time data validation, transformation, and aggregation.
  • Designed and implemented transactional data models in DynamoDB (NoSQL), ensuring high availability and achieving 99.9% system uptime.
  • Developed real-time data pipelines for seamless data integration using Apache Kafka, AWS Glue, and Amazon Redshift.
  • Utilized Spark, Hadoop, HBase and Spark Streaming methods including classifications, regressions, dimensional reduction etc. and utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
  • Implemented scalable real-time data processing pipelines leveraging Apache Spark Streaming and Kafka within the AWS ecosystem.
  • Experience in handling Teradata, SQLs, Procedures, Views, and Python de-construction scripts.
  • Developing ETL pipelines in and out of data warehouses using a combination of Python and Amazon Redshift SQL, writing complex SQL queries to extract, transform, and load data for analytics and reporting.
  • Developed SQL queries scripts to validate the data such as checking duplicates, null values, truncated values and ensuring correct data aggregations.
  • Built scalable architecture with Amazon DynamoDB and AWS Lambda to process real-time user data and trigger automated actions.
  • Implemented real-time scalable analytics by integrating AWS Glue with Google Big-Query, optimizing data processing and insights generation.
  • Read and write an S3 Iceberg table using the AWS Glue Iceberg REST Catalog with Open-Source Apache Spark.
  • Hands-on experience in designing, automation and building (CI/CID) data pipelines using spark, Ansible and Jenkins.
  • Hands on experience working with close to Jupyter Notebook, Informatica and Data Science Platform operations with focus around Cloudera Hadoop, Jupiter Notebook, Management and Administration.

Software Data engineer

TOPSYSIT Solutions LLC
Alpharetta, GA
10.2018 - 01.2020
  • Hands-on experience in upgrading Apache, Cloudera and Hortonworks platforms.
  • Hands-on experience in managing BigData ecosystems such as (Map Reduce, HDFS, Yarn, Hive, Spark, HBase, pig, MapReduce, Pyspark, Zookeeper, Ranger, Kafka, Solr).
  • Experienced with Spark improving the performance and optimization of the existing algorithms Hadoop in using Spark Context, Spark-SQL, Data Frame, and YARN.
  • Developed SQL queries /scripts to validate the data such as checking duplicates, null values, truncated values and ensuring correct data aggregations.
  • Very good knowledge of RDBMS topics, where Injected transactional data into RDBMS using AWS DMS (Database Migration Service) for real-time replication.
  • Deployed and managed scalable EMR clusters for large-scale data jobs using MapReduce, Hive, Spark, and Presto, reducing processing time by 40%.
  • Designed and deployed CI/CD pipelines for Amazon MWAA (Apache Airflow) using Code-Commit and Code-Pipeline, accelerating deployment cycles by 40%.
  • Enhanced ML workflows by integrating PyTorch models into scalable ETL pipelines, improving data processing efficiency by 30% and reducing latency.
  • Deployed real-time data pipelines using Apache Kafka on AWS, leveraging Lambda, S3, and Kinesis for efficient data ingestion, processing, and storage.
  • Proficient in creating Snowflake Stored Procedures and UDFs to automate business logic and enable real-time data validation, transformation, and aggregation.
  • Automated data lineage collection through integration of Spline Agent as a Spark plug in with Terraform, streamlining ETL job tracking across centralized API.
  • Designed and implemented Glue ETL jobs using PySpark on large-scale EMR clusters, processing datasets over 10TB daily to enable real-time analytics.
  • Handling high severity situations and providing recommendations to improve performance when processing high scale data jobs, troubleshooting productions job failures on Big-data Services (EMR, Glue, Data-Pipeline, Manage Airflow).
  • Deployed TensorFlow models on AWS Lambda using container images for seamless execution.
  • Good understanding of Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also working on tuning and optimization of Hive, Presto queries on large datasets.
  • Built and design architecture models such as partitioning and loading daily data ingestion on S3 using EMR data processing applications such as Spark, Presto, and Hive then writing complex queries using SQL to extract values from partitioning data.
  • Providing best practices to Implement data pipelines using Hive and Sqoop to ingest, transform and analyze operational data.
  • Creating tables on all AWS logs (Cloudtrail, CW, WAF, ALB) on athena and providing sql queries to customers to find the api activity calls and find the error code issues.
  • Provided technical expertise in data storage structures, data mining, and data cleansing to support data integrity and accessibility.
  • Explore, identify and document business requirements focusing on understanding and prioritizing business dependencies.
  • Analyzing client data to improve efficiency and quality of the data systems.
  • Implemented SQL scripts, views, and procedure for loading, transforming the data to target and Data Mart tables.
  • Developed conceptual, logical, and physical data models for Business Intelligence (BI) using Power-BI.

Software Data engineer

Conversant Solutions INC
Charlotte, NC
06.2017 - 09.2018
  • Built data warehousing solution using DBT on Amazon Redshift, enabling real-time data integration and transformation, streamlining ETL processes for timely analytics and reporting.
  • Hands-on experience in building, managing and monitoring BigData clusters for large-scale data using (Hadoop, Hive, Mapreduce, HBase, Spark, Zookeeper and Presto).
  • Possessed extensive knowledge of Hive, Spark, and MapReduce frameworks in data processing.
  • Expert in writing and debugging reports using Spark SQL queries, store procedures, query optimization and performance tuning.
  • Expert in SQL implementing complex queries on EMR presto and Athena services and Having hands on experience with Tableau (dashboard creation, report authoring and troubleshooting, data source management).
  • Designed centralized dashboards in Amazon Quick-Sight, integrating CloudTrail and system logs to analyze cost and usage, providing actionable insights through Athena, Looker, and Power-BI.
  • Provisioned and managed Amazon Redshift with Terraform, automating deployment, scaling, and maintenance of clusters using infrastructure-as-code.
  • Experienced in troubleshooting UNIX/Linux systems, utilizing basic commands and developing shell scripts to automate processes and resolve system issues efficiently.

Education

Master of Science - Information Systems

Stratford University
Washington DC, USA
01.2016

Bachelor’s - information technology

GITAM University
Telangana, INDIA
01.2013

Skills

  • HDFS
  • MapReduce
  • Hive
  • Sqoop
  • Spark
  • ETL
  • Presto
  • Flink
  • Airflow
  • HBase
  • Hadoop processing
  • Zookeeper
  • Oozie
  • Kafka
  • Nifi
  • Ranger
  • Yarn
  • EMR
  • Athena
  • Glue
  • Data pipelines
  • Dataflow
  • S3
  • IAM
  • CloudFormation
  • Kinesis
  • EC2
  • Lambda
  • VPC
  • EKS
  • Batch
  • Big Query
  • Kibana
  • Health data
  • SageMaker
  • Dataproc
  • Azure integration
  • Databricks
  • Snowflakes
  • RDS
  • Version Control
  • Tableau
  • Alteryx
  • ELK Stack
  • Power BI
  • Gitlab
  • Excel
  • SQL
  • Python
  • R
  • Scala
  • Hive-QL
  • Pyspark
  • Shell Scripting

Certification

  • AWS Solutions Architect
  • Data Engineer Professional

Timeline

Sr. Data Engineer

Metrix IT Solutions Inc
01.2020 - Current

Software Data engineer

TOPSYSIT Solutions LLC
10.2018 - 01.2020

Software Data engineer

Conversant Solutions INC
06.2017 - 09.2018

Master of Science - Information Systems

Stratford University

Bachelor’s - information technology

GITAM University
Koushik Gaddam