Summary
Overview
Work History
Education
Skills
Accomplishments
Work Preference
Certification
Work Availability
Timeline
BusinessDevelopmentManager
HARSHA VARDHAN KOSURU

HARSHA VARDHAN KOSURU

Data Engineer
DeKalb,IL

Summary

I’m a highly skilled data engineer with over five years of experience in designing, developing, and optimizing data integration solutions in diverse environments. Adept at leveraging a wide range of technologies including Apache Hadoop, Spark, Kafka, AWS, and various relational and NoSQL databases to build robust ETL pipelines and real-time data processing systems. Demonstrated my expertise in Agile methodologies, data orchestration using tools like NiFi and Airflow, and containerization with Docker and Kubernetes. I have proven my ability to implement data governance frameworks ensuring data quality and compliance, while effectively utilizing advanced analytics tools such as Zeppelin and Jupyter Notebooks for deriving insights from large datasets. Extensive experience in cloud-based data warehousing solutions, cloud infrastructure management, and automating CI/CD processes. Also have strong background in developing scalable data architectures and maintaining enterprise data warehouses, coupled with hands-on experience in machine learning, data visualization, and performance tuning. I can proudly say I’m a collaborative team player with a track record of delivering high-quality data solutions that drive business value.

Overview

6
6
years of professional experience
2
2

Certifications

Work History

Intern as Data Engineer

Thrive Software Solutions
, WA
02.2024 - 05.2024
  • Utilized Apache Zeppelin and Jupyter Notebooks for advanced analytics, deriving insights from large datasets through statistical techniques and machine learning algorithms
  • Managed data orchestration and workflows efficiently with Apache NiFi and Luigi, handling various data formats including JSON, XML, Parquet, CSV, and ORC
  • Used Docker for containerization and Kubernetes for orchestration, facilitating the deployment and management of containerized applications
  • Implemented data governance frameworks with Apache Atlas and Collibra, ensuring data quality, privacy, and regulatory compliance
  • Leveraged Apache Kafka Streams and Amazon Kinesis for real-time data processing, optimizing streaming data pipelines for high-throughput and real-time analytics.

Graduate Research Assistant

Northern Illinois University
Dekalb, IL
01.2023 - 01.2024
  • Enhanced distributed data processing efficiency by leveraging Apache Hadoop, Spark, and Flink, focusing on in-memory and real-time stream processing
  • Implemented advanced resource management and scalable architectures using containerization and load balancing techniques
  • Developed optimized ETL processes with incremental loading and real-time data processing capabilities using Apache Kafka
  • Integrated automated monitoring and self-healing mechanisms into data pipelines, utilizing Apache Airflow for workflow orchestration
  • Ensured data quality and optimized performance by incorporating robust validation steps and profiling tools to identify and resolve bottlenecks.

AWS Data Engineer

Mindtree Ltd
Hyderabad
11.2020 - 07.2022
  • Developed cloud migration strategy and implemented best practices using AWS services like database migration service and server migration service
  • Setup and build AWS infrastructure using resources such as VPC, EC2, S3, DynamoDB, IAM, EBS, Route53, SNS, SES, SQS, CloudWatch, CloudTrail, Security Group, Auto Scaling, and RDS using CloudFormation templates
  • Implemented new tools like Kubernetes with Docker for auto-scaling and continuous integration (CI), deploying Docker images through Kubernetes, and using the Kubernetes dashboard for monitoring
  • Utilized AWS Lambda for serverless computing and trigger-based code execution
  • Worked on implementing data warehouse solutions in AWS Redshift and migrating data from various databases to AWS services
  • Developed scripts in BASH and Python for AWS infrastructure creation and automation tasks
  • Orchestrated and migrated CI/CD processes using CloudFormation, Terraform, and Docker, setup in OpenShift, AWS, and VPCs
  • Developed Python programs for automating tasks like extracting metadata and lineage from tools, saving significant manual effort
  • Utilized Spark for improving performance and optimizing existing algorithms in Hadoop environments
  • Integrated real-time monitoring for data ingestion processes using AWS CloudWatch
  • Configured Airflow connection to AWS EMR cluster and developed bash shell bootstrap scripts for initializing the cluster with necessary configurations
  • Defined, created, and deployed Star Schema, Snowflake Schema, and Dimensional Data Modeling on an Enterprise Data Warehouse (EDW).

Big Data Engineer

Arcesium
Hyderabad
07.2018 - 10.2020
  • Worked in Agile environments using tools like Rally to maintain user stories and tasks
  • Utilized Agile methodology and SCRUM process, providing daily reports and participating in design and development phases
  • Developed Spark/PySpark-based ETL pipelines for migrating credit card transactions, account, and customer data into an enterprise Hadoop Data Lake
  • Migrated MapReduce jobs to Spark for better performance and used Spark RDDs, Python, and Scala for data transformations
  • Maintained data integration programs in Hadoop and RDBMS environments from both structured and semi-structured data sources
  • Developed data pipelines using Spark, Hive, Pig, Python, Impala, and HBase
  • Utilized AWS services such as EMR, S3, Lambda, and SNS for data processing and storage
  • Designed and implemented database solutions in Azure SQL Data Warehouse and Azure SQL
  • Designed SSIS Packages for ETL from various environments into SQL Server for SSAS cubes
  • Transformed Teradata scripts and stored procedures to SQL and Python for Snowflake's cloud platform
  • Defined, created, and deployed Star Schema, Snowflake Schema, and Dimensional Data Modeling on an EDW
  • Implemented Composite server for data virtualization and created restricted data access views using a REST API
  • Batch processed data from S3 to MongoDB, PostgreSQL, and MySQL
  • Queried and analyzed data from Cassandra using CQL and joined various tables using Spark and Scala
  • Built and published customized interactive Tableau reports and dashboards
  • Created multiple dashboards in Tableau for various business needs and used SQL Server Reporting Services (SSRS) for formatted reports
  • Performed performance tuning on Hive queries and UDFs
  • Supervised data profiling and validation to ensure accuracy between source and target systems
  • Configured Topics in new Kafka clusters across environments and brought data into Hadoop and Cassandra using Kafka
  • Implemented Apache Drill on Hadoop to join data from SQL and NoSQL databases for storage.

Hadoop Developer

GENPACT
Hyderabad
01.2018 - 06.2018
  • Installed Oozie workflow engine to run multiple Hive and Pig Jobs
  • Developed Simple to complex Map/reduce Jobs using Hive and Pig
  • Developed Map Reduce Programs for data analysis and data cleaning
  • Implemented Avro and parquet data formats for Apache Hive computations to handle custom business requirements
  • Integrating external data sources and APIs into GCP data solutions, ensuring data quality and consistency
  • Building data transformation pipelines using GCP services like Dataflow and Apache Beam to cleanse, normalize, and enrich data
  • Build machine-learning models to showcase big data capabilities using PySpark
  • Designed, implemented, and deployed within a customer’s existing Hadoop / Cassandra cluster a series of custom parallel algorithms for various customer-defined metrics and unsupervised learning models
  • Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster
  • Extensively used SSIS transformations such as Lookup, Derived column, Data conversion, Aggregate, Conditional split, SQL task, Script task and Send Mail task etc
  • Performed data cleansing, enrichment, mapping tasks and automated data validation processes to ensure meaningful and accurate data was reported efficiently
  • Implemented Apache PIG scripts to load data from and to store data into Hive.

Education

Master of Science - Management Information Systems

Northern Illinois University (NIU)
May 2024

Bachelor of Technology - undefined

Amity University
May 2018

Skills

  • TECHNICAL SKILLS
  • AWS services : Amazon EMR ,Amazon S3, AWS Lambda, Amazon SNS, Amazon SQS,AWS CloudWatch
  • Big Data Technologies : MapReduce, Hive, Sqoop, Oozie, Zookeeper, Apache spark, PySpark, YARN, Hadoop, Apache Spark
  • Apache HBase, Apache Kafka, Databricks Delta lake
  • Data Storage and Warehousing : Snowflake, Data Warehouse, DB2, Cassandra, HDFS, Hadoop (Hortonworks), Terraform
  • Operating systems : Windows, LINUX
  • ETL process and tools : Apache Airflow
  • Data Formats and Protocols : JSON, XML, Parquet, CSV, ORC
  • Messaging Services : Apache Kafka, Azure Service Bus
  • Databases : MS SQL Server, Azure SQL Database, Oracle, Snowflake, RDBMS, MS Excel, MS Access
  • Azure Cosmos DB
  • Containerization & orchestration: Docker, Kubernetes, Apache NiFi/Luigi
  • Reporting Tool : Tableau, Power BI
  • Miscellaneous : REST API, Scrum, Agile methodology, waterfall methodology, Project Management

Accomplishments

  • AWS Cloud Practitioner
  • Google Agile Project Management
  • Public
  • Public

Work Preference

Work Type

Full TimeContract Work

Work Location

On-SiteRemoteHybrid

Important To Me

Company CultureCareer advancementTeam Building / Company Retreats

Certification

  • AWS Cloud Practioner, AWS - 01/09/2024-09/01/2027
  • Certified Project Management, Google

Work Availability

monday
tuesday
wednesday
thursday
friday
saturday
sunday
morning
afternoon
evening
swipe to browse

Timeline

Intern as Data Engineer

Thrive Software Solutions
02.2024 - 05.2024

Graduate Research Assistant

Northern Illinois University
01.2023 - 01.2024

AWS Data Engineer

Mindtree Ltd
11.2020 - 07.2022

Big Data Engineer

Arcesium
07.2018 - 10.2020

Hadoop Developer

GENPACT
01.2018 - 06.2018

Master of Science - Management Information Systems

Northern Illinois University (NIU)

Bachelor of Technology - undefined

Amity University
HARSHA VARDHAN KOSURUData Engineer