Summary
Overview
Work History
Education
Skills
Timeline
Hi, I’m

VENKATA SAI

Sr. GCP Data Engineer
Little Elm,TX
VENKATA SAI

Summary

10+ years of experience in Data Engineering, my experience encompasses a broad range of cloud platforms, including GCP, AWS, and Azure, with specialized expertise in the GCP Cloud Data Platform, also skilled in a variety of Big Data Ecosystem technologies, such as Hadoop, Map Reduce, Pig, Hive, and Spark, and have a strong background in data visualization, reporting, and data quality solutions Extensive experience with different phases of the project (project initiation, project requirement, and specification gathering, designing system, coding, testing, and debugging existing client-server-based applications). Responsive expert experienced in monitoring database performance, troubleshooting issues and optimizing database environment. Possesses strong analytical skills, excellent problem-solving abilities, and deep understanding of database technologies and systems. Equally confident working independently and collaboratively as needed and utilizing excellent communication skills.

Overview

10
years of professional experience
4
years of post-secondary education

Work History

Travelport, Englewood, CO

Sr. GCP Data Engineer
2022.05 - Current (2 years & 4 months)

Job overview

  • Developed ETL pipelines on GCP using Apache Beam and Dataflow to process large-scale data in real-time, resulting in a 20% improvement in data processing time
  • Built and deployed data pipelines using Cloud Composer and Cloud Functions, enabling seamless integration with other GCP services such as BigQuery, Pub/Sub, and Cloud Storage
  • Implemented monitoring and alerting mechanisms using Stackdriver, enabling proactive issue identification and resolution in GCP data pipelines
  • Designed and executed end-to-end testing strategies for GCP data pipelines, ensuring the accuracy and completeness of data from ingestion to analysis
  • Utilized DevOps practices and tools such as Jenkins, Terraform, and Ansible to automate GCP infrastructure deployment and configuration, resulting in a 50% reduction in deployment time
  • Worked with Python, SQL, and Bash scripts to develop custom data transformations and data quality rules, resulting in a 25% reduction in data processing errors
  • Developed and maintained CI/CD pipelines on GCP using Cloud Build and Cloud Run, enabling seamless code deployment and testing in a controlled environment
  • Implemented data versioning and lineage tracking using tools such as Data Catalog and Data Studio, enabling auditability and traceability of healthcare data in GCP
  • Conducted capacity planning and scaling of GCP data pipelines using Kubernetes and Cloud Autoscaling, ensuring optimal performance and cost-efficiency
  • Developed multi-cloud strategies in better using GCP (for its PAAS) and Azure (for its SAAS)
  • Designed and developed Spark jobs with Scala to implement end-to-end data pipelines for batch processing
  • Developed data pipeline using Flume, Kafka, and Spark Stream to ingest data from their weblog server and apply the transformation
  • Developed data validation scripts in Hive and Spark and perform validation using Jupiter Notebook by spinning up the query cluster in AWS EMR
  • Executed Hadoop and Spark jobs on AWS EMR using data stored in Amazon S3
  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries
  • Developed PySpark script to encrypt the raw data by using Hashing algorithms concepts on client-specified columns
  • Developed Stored Procedures, Views, and Triggers, and was responsible for the design, development, and testing of the database
  • Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysi
  • Environment: GCP, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Cloud SQL, Big Query, Cloud Data Proc, GCS, Cloud Composer, Talend for Big Data, Airflow, Hadoop, Hive, Teradata, SAS, Spark, Python, SQL Server, AWS, Kubernetes, Docker.

Edward Jones, St.Louis, MO

GCP Data Engineer
2019.11 - 2022.04 (2 years & 5 months)

Job overview

  • Designed robust, reusable and scalable data driven solutions and data pipeline frameworks to automate the ingestion, processing and delivery of both structured and unstructured batch and real time data streaming data using Python Programming
  • Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators
  • Explored with the Spark improving the performance and optimization of the existing algorithms in Hadoop using spark context, Spark SQL, Data Frame, Spark Yarn
  • Worked with spark streaming to ingest data into an ingestion platform, an inbuilt application
  • Developed rest API's using python with flask and Django framework and done the integration of various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files
  • Designed the ETL runs performance tracking sheet in different phases of the project and shared with the production team
  • Involved in identifying and designing most efficient and cost-effective solution through research and evaluation of alternatives
  • Developed Py-spark script to process and transfer the files to third party vendor on Automated Basis
  • Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators
  • Experience in GCP Dataproc, GCS, Cloud functions, BigQuery
  • Experience in moving data between GCP and Azure using Azure Data Factory
  • Experience in building power bi reports on Azure Analysis services for better performance
  • Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
  • Loaded the Continuous data transfers through Snow pipe and written SnowSQL queries to analyze the data
  • Experience in developing Spark programs in Scala to perform Data Transformations, creating Datasets, Dataframes, and writing spark SQL queries, spark streaming, windowed streaming application
  • Worked Loading and transforming sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries
  • Used AVRO, Parquet file formats for serialization of data
  • Good understanding of monitoring and managing the Hadoop cluster through Cloudera
  • Involved in testing and bug fixing
  • Involved in performance tuning the application at various levels, Hive, Spark, etc
  • Worked in scrum/Agile environment, using tools such as JIRA
  • Implemented Spark using Scala and SparkSql for faster testing and processing of data
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs with Scala
  • Used MongoDB to stored data in JSON format and developed and tested many features of dashboard using Python, Bootstrap, CSS, and JavaScript
  • Worked on NiFi data pipelines that are built for consuming data into Data Lake
  • Implemented PySpark scripts in Python to perform extraction of required data from the data sets and storing it on HDFS
  • Developed Python code to gather the data from HBase and designs the solution to implement using PySpark
  • Perform Data Cleaning, features scaling, features engineering using pandas and numpy packages in python
  • Used Spark using Python to create data frames on client requests data for computing analytics and saved it as a text file for sending it to analytical report generation
  • Created an analytical report using Tableau desktop on number client requests received per day
  • Environment: GCP, GCP Data proc, Apache Beam, Airflow, Hadoop, Hive, Teradata, SAS, Teradata Spark, EMR, S3, Python, Sqoop, snowflake, Spark SQL, SQL.

AAL, American Airlines, Fort Worth, TX

AWS Data Engineer
2017.06 - 2019.10 (2 years & 4 months)

Job overview

  • Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, Dynamo DB
  • Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, and S3
  • Implemented the machine learning algorithms using python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis fire hose and S3 Data Lake
  • Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon Dynamo DB
  • Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources
  • Design and develop ETL data pipeline using pySpark App to fetch data from Legacy system and third-party API, social media sites
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala
  • Design and develop ETL data pipeline using Spark App to fetch data from Legacy system and third-party API, social media sites
  • Worked on spark applications and launching clusters with spark in EMR console
  • Importing the data into Spark from Kafka Consumer group using Spark Streaming APIs
  • Implemented Spark RDD transformations to map business analysis and apply actions on top of transformations
  • Used Hadoop cluster for building the applications and for building, testing, deploying the applications
  • Worked with SQOOP import and export functionalities to handle large data set transfer between MySql database and HDFS
  • Develop spark SQL tables & queries to perform Adhoc data analytics for analyst team
  • Write and build Azkaban workflow jobs to automate the process
  • Monitoring spark clusters
  • Involved in developing and designing POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala
  • Involved in migrating Hive queries into Spark transformations using Data frames, Spark SQL, SQL Context, and Scala
  • Implemented test scripts to support test driven development and continuous integration
  • Perform data analytics and load data to Snowflake
  • For automating deployment process developed Shell Scripts
  • Implemented AWS Step Functions to automate and orchestrate the Amazon Sage Maker related tasks such as publishing data to S3, training ML model and deploying it for prediction
  • Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon Sage Maker AWS EMR, S3, RDS, Redshift, Lambda, Boto3, Dynamo DB, Amazon Sage Maker, Apache
  • Worked with file formats TEXT, AVRO, PARQUET and SEQUENCE files
  • Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, Dynamo DB, Amazon Sage Maker, Apache Spark, Hadoop, Scala, pySpark, Spark SQL, Hive, git, Spark cluster, Sqoop, UNIX Shell Scripting, Domo, Snowflake.

Amigos Software Solutions

Azure Data Engineer
2015.09 - 2017.03 (1 year & 6 months)

Job overview

  • Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
  • Migrate data from traditional database systems to Azure databases
  • Co-ordination with external team members and other stake holders to understand the impact of their changes to complete release work with comfort
  • That helps a lot to avoid any integration issues in Explore.MS application
  • Analyze, design and build Modern data solutions using Azure PaaS service to support visualization of data
  • Understand current Production state of application and determine the impact of new implementation on existing business processes
  • Co-ordination with external team members and other stake holders to understand the impact of their changes to complete release work with comfort
  • That helps a lot to avoid any integration issues in VLIn-Box application
  • Review of VL-In-Box application’s test plan and test cases during System Integration and User Acceptance testing
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks
  • Design and implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools
  • Used Azure Synapse to manage processing workloads and served data for BI and prediction needs
  • Experience in DWH/BI project implementation using Azure Data Factory
  • Interacts with Business Analysts, Users, and SMEs on elaborating requirements
  • Design and implement end-to-end data solutions (storage, integration, processing, and visualization) in Azure
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards
  • Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark data bricks cluster
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning
  • Performed ETL using Azure Data Bricks
  • Migrated on-premise Oracle ETL process to Azure Synapse Analytics
  • To meet specific business requirements wrote UDF’s in Scala and Pyspark
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity
  • Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS)
  • Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
  • Propose architectures considering cost/spend in Azure and develop recommendations to right-size data infrastructure
  • Setup and maintain the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse
  • Develop conceptual solutions & create proofs-of-concept to demonstrate viability of solutions
  • Implement Copy activity, Custom Azure Data Factory Pipeline Activities
  • Responsible for creating Requirements Documentation for various projects
  • Environment: Azure SQL, Azure Storage Explorer, Azure Blob Storage, Azure Backup, Azure Files, Azure Data Lake Storage, SQL Server Management Studio, Visual Studio, VSTS, Azure Blob, Power BI, PowerShell, C#, .Net, SSIS, DataGrid, ETL Extract Transformation and Load, Business Intelligence (BI)

Ceequence Technologies, Hyderabad

Big Data/Hadoop Engineer
2013.05 - 2015.08 (2 years & 3 months)

Job overview

  • Involved in Agile development methodology active member in scrum meetings
  • Worked in Azure environment for development and deployment of Custom Hadoop Applications
  • Designed and implemented scalable Cloud Data and Analytical architecture solutions for various public and private cloud platforms using Azure
  • Involved in start to end process of Hadoop jobs that used various technologies such as Sqoop, PIG, Hive, Map Reduce, Spark, and Shells scripts
  • Implemented various Azure platforms such as Azure SQL Database, Azure SQL Data Warehouse, Azure Analysis Services, HD Insight, Azure Data Lake and Data Factory
  • Extracted and loaded data into Data Lake environment (MS Azure) by using Sqoop which was accessed by business users
  • Installed Hadoop, Map Reduce, HDFS, Azure to develop multiple Map Reduce jobs in PIG and Hive for data cleansing and pre-processing
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data
  • Developed a Spark job in Java which indexes data into Elastic Search from external Hive tables which are in HDFS
  • Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS
  • Explored with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN
  • Import the data from different sources like HDFS/HBase into Spark RDD and developed a data pipeline using Kafka and Storm to store data into HDFS
  • Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and No SQL databases such as HBase and Cassandra
  • Documented the requirements including the available code which should be implemented using Spark, Hive, HDFS, HBase and Elastic Search
  • Performed transformations like event joins, filter boot traffic and some pre-aggregations using Pig
  • Explored MLLib algorithms in Spark to understand the possible Machine Learning functionalities that can be used for our use case
  • Used windows Azure SQL reporting services to create reports with tables, charts, and maps
  • Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements
  • Configured Oozie workflow to run multiple Hive and Pig jobs which run independently with time and data availability
  • Imported and exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.

Education

Bennett University New Delhi, India

Bachelor of Technology from Electrical, Electronics And Communications Engineering
2009.08 - 2013.05 (3 years & 9 months)

University Overview

Skills

    Big Data Technologies: Hadoop, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, Yarn, Apache Spark, Mahout, Sparklib

undefined

Timeline

Sr. GCP Data Engineer
Travelport
2022.05 - Current (2 years & 4 months)
GCP Data Engineer
Edward Jones
2019.11 - 2022.04 (2 years & 5 months)
AWS Data Engineer
AAL, American Airlines
2017.06 - 2019.10 (2 years & 4 months)
Azure Data Engineer
Amigos Software Solutions
2015.09 - 2017.03 (1 year & 6 months)
Big Data/Hadoop Engineer
Ceequence Technologies
2013.05 - 2015.08 (2 years & 3 months)
Bennett University
Bachelor of Technology from Electrical, Electronics And Communications Engineering
2009.08 - 2013.05 (3 years & 9 months)
VENKATA SAISr. GCP Data Engineer