Summary
Overview
Work History
Skills
Timeline
Generic

Siva Varma

Summary

Over 5 years of experience in Data Engineering, specializing in GCP, and AWS with expertise in building scalable data lakes, ETL pipelines, real-time data processing, and cloud-based analytics. Designed and developed centralized data lakes on GCP, leveraging Cloud Storage, Dataproc, BigQuery, and BigTable for efficient data storage and processing. Built scalable ETL workflows using Cloud Dataflow, Dataproc with Spark, and Apache Airflow, automating data ingestion and transformation pipelines. Implemented real-time data ingestion architectures using Druid and Kafka on GCP, ensuring low-latency data processing for analytics and reporting. Collaborated with ML engineers to integrate data pipelines with AI/ML models. Strengthened data security and compliance by implementing GCP IAM policies, role-based access control, encryption, and data masking techniques. Orchestrated multi-source data ingestion pipelines using Cloud Composer and Cloud Dataproc, ensuring seamless integration with diverse data sources. Developed scalable, high-performance code in Python and Scala for complex data transformations and workflow automation. Designed and maintained AWS-based data solutions, developing data lakes on Amazon S3 and optimizing them with partitioning strategies and lifecycle policies. Built and managed ETL pipelines using AWS Glue, Lambda, and Apache Airflow, streamlining automated data transformations and processing. Implemented real-time streaming solutions with AWS Kinesis, Spark Streaming, and Apache Kafka, ensuring continuous data availability and processing. Worked on multi-terabyte data migrations from Oracle to AWS, storing optimized copies in Amazon Redshift for business intelligence and reporting. Optimized Redshift clusters, focusing on schema design, query performance tuning, and workload management for enhanced data analytics. Involved in data security on AWS by configuring IAM roles, S3 bucket policies, and AWS KMS encryption, ensuring regulatory compliance. Developed real-time dashboards and analytics solutions using Tableau, and AWS Athena improving data accessibility and business insights. Automated Tableau dashboard updates using Python and AWS Lambda, reducing manual effort and improving real-time reporting efficiency.

Overview

6
6
years of professional experience

Work History

Data Engineer

Marsh McLennan Agency
Dallas, TX
01.2023 - Current
  • Developed a centralized data lake on Google Cloud Platform (GCP) using key services such as Cloud Storage, Dataproc, BigQuery, and BigTable
  • Created PySpark scripts for data cleansing and enrichment of clickstream data, optimizing real-time analytics performance
  • Implemented real-time data ingestion architectures leveraging Druid on GCP, enhancing transformation and query efficiency
  • Built scalable and fault-tolerant data pipelines using Spark Streaming to handle high-volume data streams, ensuring continuous data availability for business-critical operations
  • Developed and managed ETL and data flow jobs using Apache Airflow on GCP, automating daily incremental loads with various Airflow operators
  • Collaborated with machine learning engineers to integrate data pipelines with ML models
  • Designed and optimized BigQuery schemas, leveraging partitioning and clustering to improve query performance and reduce execution time by 40%
  • Involved in enhancing data security and compliance by implementing GCP IAM policies, role-based access control (RBAC), data masking, and encryption, ensuring secure access and data protection
  • Integrated Google Cloud Search and BigQuery to enable fast, full-text search capabilities across extensive datasets, improving data retrieval efficiency and query performance
  • Worked on large-scale migration of datasets from PostgreSQL to Google BigQuery, optimizing performance, reducing query execution times, and lowering operational costs
  • Designed real-time ingestion pipelines to integrate data from Database, APIs, and streaming services enhancing analytics and reporting
  • Automated and orchestrated multi-source data ingestion and transformation workflows using Cloud Dataproc
  • Established a comprehensive data governance framework to maintain data integrity and ensure compliance across all data interactions
  • Developed and maintained efficient, high-quality code in Spark, Scala, and Python for complex data transformations, improving system scalability and reliability

Big Data Developer

Digno Solutions
Hyderabad, IN
05.2021 - 07.2022
  • Extracted, transformed, and loaded data from various source systems into AWS storage services using AWS Glue, Amazon EMR, and Amazon S3
  • Built data pipelines in AWS Glue by leveraging connections, jobs, and workflows to extract, transform, and load data from multiple sources, including Amazon RDS, Amazon S3, Amazon Redshift, and PostgreSQL
  • Developed Spark applications in Scala and Spark SQL for data extraction, transformation, and aggregation across various file formats to generate insights into customer behavior
  • Developed end-to-end data ingestion workflows using AWS Glue, integrating them with BMC Control-M scheduling tools to automate and streamline data processing and workflow management
  • Managed and optimized Databricks clusters by handling upgrades, performance monitoring, and workload tuning to improve cost efficiency
  • Wrote complex SQL queries in AWS Redshift to perform advanced transformations, aggregations, and optimizations
  • Worked on migration of data from PostgreSQL to AWS, ensuring data consistency, optimizing performance, and facilitating business analytics and reporting
  • Optimized Spark applications on AWS EMR by fine-tuning batch intervals, parallelism settings, and memory configurations, enhancing processing performance and resource efficiency
  • Worked on orchestrating multi-source data ingestion and transformation workflows using AWS Glue and Amazon CloudWatch, ensuring efficient data processing and monitoring
  • Wrote and maintained high-quality, scalable code in Scala, and Python to support complex data transformation processes and enhance system reliability
  • Worked on real-time streaming pipelines utilizing Apache Kafka and Spark Streaming for efficient data processing
  • Developed Tableau reports for real-time claims analysis and financial forecasting, assisting in compliance audits and regulatory reporting
  • Automated Tableau dashboard updates using Python scripts and AWS Lambda, ensuring real-time insights for executive decision-making and reducing manual workload

Data Engineer

Amogus Technologies
Hyderabad, IN
04.2019 - 05.2021
  • Engineered and managed data pipelines leveraging AWS Glue and Lambda, automating ETL processes across multiple data sources
  • Designed and optimized data lakes on Amazon S3 by implementing partitioning strategies and lifecycle policies to enhance performance and reduce costs
  • Developed real-time data ingestion workflows utilizing AWS Kinesis Data Streams ensuring low-latency processing
  • Configured and fine-tuned Redshift clusters, focusing on schema design, query performance optimization, and workload management for efficient analytics
  • Worked on orchestrating multi-step data pipelines using AWS Data Pipeline and AWS Step Functions, ensuring reliable workflow execution and efficient data processing
  • Implemented SQL-based querying solutions using AWS Athena and Glue Catalog, enabling efficient serverless analytics on Amazon S3
  • Assisted in debugging and troubleshooting ETL workflows, identifying and resolving data pipeline failures and bottlenecks
  • Wrote and optimized SQL queries for data extraction and transformation within Redshift and S3-based datasets
  • Collaborated with senior engineers to optimize data pipeline performance and improve data processing efficiency
  • Gained experience with version control tools like Git and GitHub, ensuring code versioning and collaboration

Skills

Cloud Computing Platforms :

Google Cloud Platform (GCP)

Amazon Web Services (AWS)

GCP Services :

Cloud Storage

BigQuery

BigTable

Dataproc

Cloud Dataflow

Cloud Composer (Apache Airflow)

Pub/Sub

Cloud IAM

Cloud Monitoring

AWS Services :

S3

Athena

Glue Crawler

Glue Catalog

Redshift

Lambda

RDS

EMR

Kinesis

SNS

IAM

CloudFormation

Terraform

CloudWatch

Cost Explorer

Data Warehouses :

Google BigQuery

Amazon Redshift

Programming Languages :

Python

Scala

SQL

Bash

Big Data & Streaming Frameworks :

Apache Spark (PySpark, Spark SQL, Spark Streaming)

Apache Kafka

Apache Beam

Druid

Elasticsearch

Data Pipeline & Orchestration :

Apache Airflow

Cloud Composer

AWS Glue

AWS Data Pipeline

Databases & Storage :

PostgreSQL

MySQL

Amazon RDS

Google Cloud SQL

Version Control & Development Tools :

Git

GitHub

VS Code

Data Formats :

JSON

CSV

Parquet

Avro

ORC

XML

Visualization & Reporting :

Tableau

AWS QuickSight

Timeline

Data Engineer

Marsh McLennan Agency
01.2023 - Current

Big Data Developer

Digno Solutions
05.2021 - 07.2022

Data Engineer

Amogus Technologies
04.2019 - 05.2021
Siva Varma