Summary

Overview

Work History

Education

Skills

Timeline

Hi, I’m

VENKATA SAI

Sr. GCP Data Engineer

Little Elm,TX

Summary

10+ years of experience in Data Engineering, my experience encompasses a broad range of cloud platforms, including GCP, AWS, and Azure, with specialized expertise in the GCP Cloud Data Platform, also skilled in a variety of Big Data Ecosystem technologies, such as Hadoop, Map Reduce, Pig, Hive, and Spark, and have a strong background in data visualization, reporting, and data quality solutions Extensive experience with different phases of the project (project initiation, project requirement, and specification gathering, designing system, coding, testing, and debugging existing client-server-based applications). Responsive expert experienced in monitoring database performance, troubleshooting issues and optimizing database environment. Possesses strong analytical skills, excellent problem-solving abilities, and deep understanding of database technologies and systems. Equally confident working independently and collaboratively as needed and utilizing excellent communication skills.

Overview

years of professional experience

years of post-secondary education

Work History

Travelport, Englewood, CO

Sr. GCP Data Engineer

2022.05 - Current (2 years & 4 months)

Job overview

Developed ETL pipelines on GCP using Apache Beam and Dataflow to process large-scale data in real-time, resulting in a 20% improvement in data processing time
Built and deployed data pipelines using Cloud Composer and Cloud Functions, enabling seamless integration with other GCP services such as BigQuery, Pub/Sub, and Cloud Storage
Implemented monitoring and alerting mechanisms using Stackdriver, enabling proactive issue identification and resolution in GCP data pipelines
Designed and executed end-to-end testing strategies for GCP data pipelines, ensuring the accuracy and completeness of data from ingestion to analysis
Utilized DevOps practices and tools such as Jenkins, Terraform, and Ansible to automate GCP infrastructure deployment and configuration, resulting in a 50% reduction in deployment time
Worked with Python, SQL, and Bash scripts to develop custom data transformations and data quality rules, resulting in a 25% reduction in data processing errors
Developed and maintained CI/CD pipelines on GCP using Cloud Build and Cloud Run, enabling seamless code deployment and testing in a controlled environment
Implemented data versioning and lineage tracking using tools such as Data Catalog and Data Studio, enabling auditability and traceability of healthcare data in GCP
Conducted capacity planning and scaling of GCP data pipelines using Kubernetes and Cloud Autoscaling, ensuring optimal performance and cost-efficiency
Developed multi-cloud strategies in better using GCP (for its PAAS) and Azure (for its SAAS)
Designed and developed Spark jobs with Scala to implement end-to-end data pipelines for batch processing
Developed data pipeline using Flume, Kafka, and Spark Stream to ingest data from their weblog server and apply the transformation
Developed data validation scripts in Hive and Spark and perform validation using Jupiter Notebook by spinning up the query cluster in AWS EMR
Executed Hadoop and Spark jobs on AWS EMR using data stored in Amazon S3
Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries
Developed PySpark script to encrypt the raw data by using Hashing algorithms concepts on client-specified columns
Developed Stored Procedures, Views, and Triggers, and was responsible for the design, development, and testing of the database
Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysi
Environment: GCP, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Cloud SQL, Big Query, Cloud Data Proc, GCS, Cloud Composer, Talend for Big Data, Airflow, Hadoop, Hive, Teradata, SAS, Spark, Python, SQL Server, AWS, Kubernetes, Docker.

Edward Jones, St.Louis, MO

GCP Data Engineer

2019.11 - 2022.04 (2 years & 5 months)

Job overview

Designed robust, reusable and scalable data driven solutions and data pipeline frameworks to automate the ingestion, processing and delivery of both structured and unstructured batch and real time data streaming data using Python Programming
Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators
Explored with the Spark improving the performance and optimization of the existing algorithms in Hadoop using spark context, Spark SQL, Data Frame, Spark Yarn
Worked with spark streaming to ingest data into an ingestion platform, an inbuilt application
Developed rest API's using python with flask and Django framework and done the integration of various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files
Designed the ETL runs performance tracking sheet in different phases of the project and shared with the production team
Involved in identifying and designing most efficient and cost-effective solution through research and evaluation of alternatives
Developed Py-spark script to process and transfer the files to third party vendor on Automated Basis
Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators
Experience in GCP Dataproc, GCS, Cloud functions, BigQuery
Experience in moving data between GCP and Azure using Azure Data Factory
Experience in building power bi reports on Azure Analysis services for better performance
Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
Loaded the Continuous data transfers through Snow pipe and written SnowSQL queries to analyze the data
Experience in developing Spark programs in Scala to perform Data Transformations, creating Datasets, Dataframes, and writing spark SQL queries, spark streaming, windowed streaming application
Worked Loading and transforming sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries
Used AVRO, Parquet file formats for serialization of data
Good understanding of monitoring and managing the Hadoop cluster through Cloudera
Involved in testing and bug fixing
Involved in performance tuning the application at various levels, Hive, Spark, etc
Worked in scrum/Agile environment, using tools such as JIRA
Implemented Spark using Scala and SparkSql for faster testing and processing of data
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs with Scala
Used MongoDB to stored data in JSON format and developed and tested many features of dashboard using Python, Bootstrap, CSS, and JavaScript
Worked on NiFi data pipelines that are built for consuming data into Data Lake
Implemented PySpark scripts in Python to perform extraction of required data from the data sets and storing it on HDFS
Developed Python code to gather the data from HBase and designs the solution to implement using PySpark
Perform Data Cleaning, features scaling, features engineering using pandas and numpy packages in python
Used Spark using Python to create data frames on client requests data for computing analytics and saved it as a text file for sending it to analytical report generation
Created an analytical report using Tableau desktop on number client requests received per day
Environment: GCP, GCP Data proc, Apache Beam, Airflow, Hadoop, Hive, Teradata, SAS, Teradata Spark, EMR, S3, Python, Sqoop, snowflake, Spark SQL, SQL.

AAL, American Airlines, Fort Worth, TX

AWS Data Engineer

2017.06 - 2019.10 (2 years & 4 months)

Job overview

Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, Dynamo DB
Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, and S3
Implemented the machine learning algorithms using python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis fire hose and S3 Data Lake
Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon Dynamo DB
Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources
Design and develop ETL data pipeline using pySpark App to fetch data from Legacy system and third-party API, social media sites
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala
Design and develop ETL data pipeline using Spark App to fetch data from Legacy system and third-party API, social media sites
Worked on spark applications and launching clusters with spark in EMR console
Importing the data into Spark from Kafka Consumer group using Spark Streaming APIs
Implemented Spark RDD transformations to map business analysis and apply actions on top of transformations
Used Hadoop cluster for building the applications and for building, testing, deploying the applications
Worked with SQOOP import and export functionalities to handle large data set transfer between MySql database and HDFS
Develop spark SQL tables & queries to perform Adhoc data analytics for analyst team
Write and build Azkaban workflow jobs to automate the process
Monitoring spark clusters
Involved in developing and designing POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata
Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala
Involved in migrating Hive queries into Spark transformations using Data frames, Spark SQL, SQL Context, and Scala
Implemented test scripts to support test driven development and continuous integration
Perform data analytics and load data to Snowflake
For automating deployment process developed Shell Scripts
Implemented AWS Step Functions to automate and orchestrate the Amazon Sage Maker related tasks such as publishing data to S3, training ML model and deploying it for prediction
Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon Sage Maker AWS EMR, S3, RDS, Redshift, Lambda, Boto3, Dynamo DB, Amazon Sage Maker, Apache
Worked with file formats TEXT, AVRO, PARQUET and SEQUENCE files
Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, Dynamo DB, Amazon Sage Maker, Apache Spark, Hadoop, Scala, pySpark, Spark SQL, Hive, git, Spark cluster, Sqoop, UNIX Shell Scripting, Domo, Snowflake.

Amigos Software Solutions

Azure Data Engineer

2015.09 - 2017.03 (1 year & 6 months)

Job overview

Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
Migrate data from traditional database systems to Azure databases
Co-ordination with external team members and other stake holders to understand the impact of their changes to complete release work with comfort
That helps a lot to avoid any integration issues in Explore.MS application
Analyze, design and build Modern data solutions using Azure PaaS service to support visualization of data
Understand current Production state of application and determine the impact of new implementation on existing business processes
Co-ordination with external team members and other stake holders to understand the impact of their changes to complete release work with comfort
That helps a lot to avoid any integration issues in VLIn-Box application
Review of VL-In-Box application’s test plan and test cases during System Integration and User Acceptance testing
Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics
Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks
Design and implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools
Used Azure Synapse to manage processing workloads and served data for BI and prediction needs
Experience in DWH/BI project implementation using Azure Data Factory
Interacts with Business Analysts, Users, and SMEs on elaborating requirements
Design and implement end-to-end data solutions (storage, integration, processing, and visualization) in Azure
Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards
Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark data bricks cluster
Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning
Performed ETL using Azure Data Bricks
Migrated on-premise Oracle ETL process to Azure Synapse Analytics
To meet specific business requirements wrote UDF’s in Scala and Pyspark
Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity
Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS)
Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
Propose architectures considering cost/spend in Azure and develop recommendations to right-size data infrastructure
Setup and maintain the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse
Develop conceptual solutions & create proofs-of-concept to demonstrate viability of solutions
Implement Copy activity, Custom Azure Data Factory Pipeline Activities
Responsible for creating Requirements Documentation for various projects
Environment: Azure SQL, Azure Storage Explorer, Azure Blob Storage, Azure Backup, Azure Files, Azure Data Lake Storage, SQL Server Management Studio, Visual Studio, VSTS, Azure Blob, Power BI, PowerShell, C#, .Net, SSIS, DataGrid, ETL Extract Transformation and Load, Business Intelligence (BI)

Ceequence Technologies, Hyderabad

Big Data/Hadoop Engineer

2013.05 - 2015.08 (2 years & 3 months)

Job overview

Involved in Agile development methodology active member in scrum meetings
Worked in Azure environment for development and deployment of Custom Hadoop Applications
Designed and implemented scalable Cloud Data and Analytical architecture solutions for various public and private cloud platforms using Azure
Involved in start to end process of Hadoop jobs that used various technologies such as Sqoop, PIG, Hive, Map Reduce, Spark, and Shells scripts
Implemented various Azure platforms such as Azure SQL Database, Azure SQL Data Warehouse, Azure Analysis Services, HD Insight, Azure Data Lake and Data Factory
Extracted and loaded data into Data Lake environment (MS Azure) by using Sqoop which was accessed by business users
Installed Hadoop, Map Reduce, HDFS, Azure to develop multiple Map Reduce jobs in PIG and Hive for data cleansing and pre-processing
Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data
Developed a Spark job in Java which indexes data into Elastic Search from external Hive tables which are in HDFS
Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS
Explored with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN
Import the data from different sources like HDFS/HBase into Spark RDD and developed a data pipeline using Kafka and Storm to store data into HDFS
Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and No SQL databases such as HBase and Cassandra
Documented the requirements including the available code which should be implemented using Spark, Hive, HDFS, HBase and Elastic Search
Performed transformations like event joins, filter boot traffic and some pre-aggregations using Pig
Explored MLLib algorithms in Spark to understand the possible Machine Learning functionalities that can be used for our use case
Used windows Azure SQL reporting services to create reports with tables, charts, and maps
Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements
Configured Oozie workflow to run multiple Hive and Pig jobs which run independently with time and data availability
Imported and exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.

Education

Bennett University New Delhi, India

Bachelor of Technology from Electrical, Electronics And Communications Engineering

2009.08 - 2013.05 (3 years & 9 months)

University Overview

Skills

Big Data Technologies: Hadoop, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, Yarn, Apache Spark, Mahout, Sparklib

undefined

Timeline

Sr. GCP Data Engineer

Travelport

2022.05 - Current (2 years & 4 months)

GCP Data Engineer

Edward Jones

2019.11 - 2022.04 (2 years & 5 months)

AWS Data Engineer

AAL, American Airlines

2017.06 - 2019.10 (2 years & 4 months)

Azure Data Engineer

Amigos Software Solutions

2015.09 - 2017.03 (1 year & 6 months)

Big Data/Hadoop Engineer

Ceequence Technologies

2013.05 - 2015.08 (2 years & 3 months)

Bennett University

Bachelor of Technology from Electrical, Electronics And Communications Engineering

2009.08 - 2013.05 (3 years & 9 months)

VENKATA SAI

Summary

Overview

Work History

Travelport, Englewood, CO

Job overview

Edward Jones, St.Louis, MO

Job overview

AAL, American Airlines, Fort Worth, TX

Job overview

Amigos Software Solutions

Job overview

Ceequence Technologies, Hyderabad

Job overview

Education

Bennett University New Delhi, India

University Overview

Skills

Timeline

Similar Profiles

DISHA SAINIDISHA SAINI

Ziad HabchiZiad Habchi

Robert BöhmRobert Böhm