Summary

Overview

Work History

Education

Skills

Timeline

Praveen Thoomati

Chicago,IL

Summary

● Over 11+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer.

● Configure Zookeeper to coordinate and support Kafka, Spark, Spark Streaming, HBase and HDFS

● Setting up Azure infrastructure like storage accounts, integration runtime, service principal id, app registrations to enable scalable and optimized utilization of business user analytical requirements in Azure.

● Working on JSON scripts generation and writing UNIX shell scripting to call the SQOOP Import/Export

● Exploratory Data Analysis and Data Cleaning with Python.

● Have good knowledge on NoSQL databases like HBase, Cassandra and MongoDB.

● Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage and Composer.

● Experienced implementation of a log producer in Scala that watches for application logs, transforms incremental log and sends them to a Kafka and Zookeeper based log collection platform.

● Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice- versa and load into Hive tables, which are partitioned. Good knowledge in streaming applications using Apache Kafka.

● Experience in writing complex MapReduce jobs and Hive data modeling.

● Work experience with cloud infrastructure like Amazon Web Services (AWS) and AZURE.

● Expertise working with AWS cloud services like EMR, S3, Redshift, AWS Glue, EMR cloud watch, for big data development.

● Experience in fine-tuningMapReduce jobs for better scalability and performance and converting them to Spark.

● Experience in working with Spark RDD, Data Frames and Data Sets using different file formats like Json, Avro, Parquet and compression techniques.

● Worked extensively on enrichment/ETL in real time stream jobs using PySpark Streaming, Spark SQL and loads into HBase.

● Experienced in working with big data technologies like Spark Core, Spark SQL.

● Lambda functions for pre-processing data or post-processing Glue Job results.

● S3 events to trigger the Step Functions workflow when new data arrives in a specific S3 bucket.

● Implemented Step Functions to create new state machines and start the execution of Glue jobs.

● Orchestrated the workflow using AWS Step Functions to chain multiple AWS Services like Lambda, Glue in a defined sequence.

● Experience in working with Apache Flink and Supporting the jobs consuming Json data from Apache Pulsar.

● Experience in working with Apache Flink Data Streams and loading the data to PSQL.

● Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.

● Working in relational SQL and NoSQL databases, including Oracle, Hive, Sqoop and HBase

● Designed and executed Oozie workflows in a manner that allowed for scheduling Sqoop and Hive job actions to extract, transform and load data

● Migrate databases to cloud platform SQL Azure and as well the performance tuning.

● Experienced on Hadoop/Hive on AWS, using both EMR and non-EMR-Hadoop in EC2.

● Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data.

● Developed ETL/ELT pipelines using Apache Spark on Azure Databricks, including data cleaning, data enrichment, and data aggregation using Spark SQL and Spark DataFrames.

● Written Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration

● Implemented sentiment analysis and text analytics on Twitter social media feeds and market news using Scala and Python.

● Provided production support for ETL and reporting systems, investigating and resolving issues, and maintaining system stability and availability. `

● Very keen in knowing newer techno stack that Google Cloud platform (GCP) adds.

● Hands on experience in installing, configuring Cloudera Apache Hadoop ecosystem components like Flume, HBase, Zookeeper, Oozie, Hive, Sqoop and Pig.

● Installed Hadoop, Map Reduce, HDFS, and AWS and developed multiple Map Reduce jobs in PIG and Hive for data cleaning pre-processing. and

● Worked with real-time data processing and streaming techniques using Spark streaming and Kafka

● Pipeline development skills with Apache Airflow, Kafka, and NiFi.

● Extensively using open-source languages Python, Scala and Java.

● Migrated projects from Cloudera Hadoop Hive storage to Azure Data Lake Store to satisfy Confidential transformation strategy

● Doing data synchronization between EC2 and S3, Hive stand-up, and AWS profiling.

● Using Spark Data frame API in Scala for analyzing data.

Overview

years of professional experience

Work History

Senior Data Engineer

AgFirst

01.2024 - Current

Architect and implement ETL and data movement solutions using Azure Data Factory, SSIS
Understand Business requirements, analysis and translate into Application and operational requirements
Designed a one-time load strategy for moving large databases to Azure SQL DWH
Extract Transform and Load data from Sources Systems to Azure Data Storage services using Azure Data Factory and HDInsight
Created a framework to do data profiling, cleansing, automatic restart ability of batch pipeline and handling rollback strategy
Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
Lead a team of six developers to migrate the application
Implemented masking and encryption techniques to protect sensitive data
Building ADF pipelines to extract and manipulate data from Azure Blob storage/Azure Data Lake/CosmosDB/SQL Server on cloud
Extensively used Azure Databricks for data validations and analysis on Cosmos structured steams
Developed mapping document to map columns from source to target
Created azure data factory (ADF pipelines) using Azure blob
Performed ETL using Azure Data Bricks
Migrated on-premises Oracle ETL process to Azure Synapse Analytics
Involved in migration of large amount of data from OLTP to OLAP by using ETL Packages
Worked on python scripting to automate generation of scripts
Data curation done using azure data bricks
Worked on azure data bricks, PySpark, HDInsight, Azure ADW and hive used to load and transform data
Implemented and Developing Hive Bucketing and Partitioning
Implemented Kafka, spark structured streaming for real time data ingestion
Used Azure Data Lake as Source and pulled data using Azure blob
Good experience working on analysis tools like Tableau, Splunk for regression analysis, pie charts and bar graphs
Developed reports, dashboards using Tableau for quick reviews to be presented to Business and IT users
Used stored procedure, lookup, execute pipeline, data flow, copy data, azure function features in ADF
Worked on creating a star schema for drilling data
Created PySpark procedures, functions, packages to load data
Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics
Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks
Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster
Creating Databricks notebooks using SQL, Python and automated notebooks using jobs
Involved in running the Cosmos Scripts in Visual Studio 2017/2015 for checking the diagnostics
Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data
Create and maintain optimal data pipeline architecture in cloud Microsoft Azure using Data Factory and Azure Databricks
Environment: ADF, Databricks and ADL Spark, Hive, HBase, Sqoop, Flume, ADF, Blob, cosmos DB, MapReduce, HDFS, Cloudera, SQL, Apache Kafka, Azure, Python, power BI, Unix, SQL Server.

Senior Data Engineer

UBS

09.2022 - 12.2023

Designed solutions to process high volume data stream ingestion, processing and low latency data provisioning using Hadoop Ecosystems Hive, Scoop and Kafka, Python, Spark, Scala, NoSql, Nifi
Optimized ETL workflows by orchestrating data extraction, transformation, and loading processes using AWS Step Functions, leading to improved performance and scalability
Capable of using AWS utilities such as EMR, S3 and Cloud Watch to run and monitor Hadoop and Spark jobs on AWS
Used AWS Athena extensively to ingest structured data from S3 into other systems such as Redshift or to produce reports
Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done
Designed and developed end-to-end ETL processing from Oracle to AWS using Amazon S3, EMR, and Spark
Generated a script in AWS Glue to transfer the data and utilized AWS Glue to run ETL jobs and run aggregation on PySpark code
Extensive experience with ETL tools like IBM DataStage and Informatica IICS for efficient data integration, transformation, and loading, skilled in deploying and managing Google Cloud services using Terraform, ensuring seamless scalability and reliability
Converted Hive/SQL queries into Spark transformations using Spark RDD and Python and utilized Streams and Lambda expressions available as part of Java 8 to store and process data, improving service performance
Develop and maintain data models for the data warehouse, considering star schema and snowflake schema
Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS)
Developed the PySpark code for AWS Glue jobs and for EMR
Worked publishing interactive data visualizations dashboards, reports on Tableau
Designed and implemented big data ingestion pipelines to ingest multi-TB data from various data source using Kafka, Spark streaming including data quality checks, transformation, and stored as efficient storage formats Performing data wrangling on multi-terabyte datasets from various data sources for a variety of downstream purposes such as analytics using PySpark
Exception handling in python to add logs to the application
Created comprehensive documentation detailing HubSpot and PostgreSQL integration workflows, ETL processes, and AWS architecture
Designed continuous data flows to handle real-time data ingestion from HubSpot into AWS S3 and Redshift
Established data governance policies using AWS Glue Data Catalog to manage HubSpot and PostgreSQL metadata
Leveraged NumPy and Pandas for statistical analysis of datasets
Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance
Created an end-to-end data workflow automation solution using AWS Step Functions, integrating data sources, processing steps, and storage solutions to streamline operations
Created various complex SSIS/ETL packages to Extract, Transform and Load data
Developed PSQL code to manipulate and organize data within Oracle databases, ensuring standardized formatting across attributes
Implemented automated validation checks for HubSpot data using AWS Glue and Lambda functions to ensure data integrity
Designed a scalable architecture to handle increasing volumes of HubSpot data, leveraging AWS services like EMR and DynamoDB
Integrating PostgreSQL with cloud-based services and storage solutions
Utilizing MongoDB features like replica sets to scale read operations horizontally across multiple nodes
Setting up Change Data Capture processes to track and capture changes in source data for incremental updates to MongoDB
Optimized SQL queries within DBT to ensure efficient and performant data transformations
Adapt Data Vault structures to accommodate changes in source systems or evolving business requirements
Developed Kubernetes deployment configurations (YAML manifests) to define the desired state of applications, services, and infrastructure components
Collaborated with cross-functional teams to establish a robust framework, facilitating the automated generation of daily ad-hoc reports for timely decision-making
Participated in the full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies
Environment: Hadoop, Lambda, Oracle, Kafka, Python, Informatica, Snowflake Schema, SSRS, MYSQL Dynamo DB, PostgreSQL, DBT, Data Vault, Tableau, Kubernetes, Git Hub

Data Engineer/PySpark Developer

State Street

11.2021 - 08.2022

Worked on pipelines to source data from Kafka topics and push to Postgres DB
Implemented PySpark pipelines in Argo CD to load data from Ericsson/Nokia/Samsung Kafka topics
Identifying the root cause and resolution for aborted jobs on daily basis and fixing the issues
Profiling the Kubernetes resources based on resource utilization
Designed, developed, and deployed Data Lakes, Data Marts and Datawarehouse using AWS cloud like AWS S3, AWS RDS and AWS Redshift and terraform
Designed and implemented solutions for sourcing healthcare data using FHIR standards into AWS cloud infrastructure for scalable and compliant data storage
Automated the extraction, transformation, and loading (ETL) of FHIR data from on-premise systems to AWS services, ensuring seamless data migration and normalization
Enabled healthcare analytics by integrating FHIR data into AWS Redshift and Amazon Athena, facilitating advanced querying and reporting on healthcare records
Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP
Designed, developed, and deployed Datawarehouse, AWS Redshift, applied my best practices
Designed, developed, and deployed both batch and streaming pipelines using Confluent Kafka, Kubernetes, docker and AWS Cloud
Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators both old and newer operators
Developed Snowflake connectors to pull data from Snowflake and run ETL
Created Airflow Dag’s to extract data from Snowflake and load it to our data warehouse
Build Self-service data pipelines using AWS Services like, SNS, Step Function, Lambda, Glue, EMR, EC2, Athena, Sage Maker, Quick Sight, Redshift etc
Moved large amount of data from AWS S3 buckets to AWS Redshift using Glue and EMR
Analyzed large and critical datasets using EMR, Glue and Spark
Worked in a Flink project which consumes Json data from Apache Pulsar and loading the data into Postgres
Respond promptly to any job failures or incidents, diagnose root causes, and implement corrective actions to minimize downtime and service disruptions
Proactive communication and setting up war room calls with platform teams to fix the K8 cluster issues
Implement a Flink application to consume data from Kafka topics and perform transformations based on business logics and produce it into Apache Pulsar for Data Science team consumption
Supporting the SLA bound jobs and made sure to send the job complete notifications to leadership
Optimized spark loads based on the data it pulls from the base tables
Developed Python API to get kml files from Storm Prediction Center archives
Data Wrangling using Beautiful Soup and converted them to pandas Data frame to pull tornado effected regions
Plotted the data points using mat plot library to display the effected cell towers in a region
Clean, transform, and format data to ensure accuracy and consistency for analysis
Skills: AWS, GCP, pyspark, Apache Spark, Spring boot, Apache Flink, Python, Control-M, Kubernetes, Docker, GitLab, CI/CD

Data Engineer

NTT Data

07.2018 - 10.2021

Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing
Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds
Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design in Hadoop and Big Data
Involved in SQOOP implementation which helps in loading data from various RDBMS sources to Hadoop systems and vice versa
Developed a Python Script to load the CSV files into the S3 buckets and created AWS S3 buckets, performed folder management in each bucket, managed logs and objects within each bucket
Involved in Analyzing system failures, identifying root causes, and recommended course of actions, Documented the systems processes and procedures for future references
Involved in Configuring Hadoop cluster and load balancing across the nodes
Involved in Hadoop installation, Commissioning, Decommissioning, Balancing, Troubleshooting, Monitoring and, debugging Configuration of multiple nodes using Hortonworks platform
Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS
Load D-Stream data into Spark RDD and do in memory data Computation to generate Output response
Involved in performance tuning of Spark jobs using Cache and using complete advantage of cluster environment
Wrote script for Location Analytic project deployment on a Linux cluster/farm & AWS Cloud deployment using Python
Worked extensively on Informatica Partitioning when dealing with huge volumes of data
Used Teradata External Loaders like Multi Load, T Pump and Fast Load in Informatica to load data into Teradata database
Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data
Created several types of data visualizations using Python and Tableau
Extracted Mega Data from AWS using SQL Queries to create reports
Experience in using Avro, Parquet, RCFile and JSON file formats, developed UDFs in Hive and Pig
Involved in loading data from rest endpoints to Kafka Producers and transferring the data to Kafka Brokers
Developed Pre-processing job using Spark Data frames to flatten JSON documents to flat file
Environment: Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Flume, Spark, Impala, Cassandra, Pig, Hdfs, Scala, Spark RDD, Spark SQL, Kafka.

Data Engineer/SQL Developer, Manager

Luxoft Pvt Ltd

02.2016 - 06.2018

Creation, manipulation and supporting the SQL Server databases
Involved in the Data modeling, Physical and Logical Design of Database
Helped in integration of the front end with the SQL Server backend
Created Stored Procedures, Triggers, Indexes, User defined Functions, Constraints etc on various database objects to obtain the required results
Import & Export of data from one server to other servers using tools like Data Transformation Services (DTS)
Wrote T-SQL statements for retrieval of data and involved in performance tuning of TSQL queries
Transferred data from various data sources/business systems including MS Excel, MS Access, Flat Files etc to SQL Server using SSIS/DTS using various features like data conversion etc
Also Created derived columns from the present columns for the given requirements
Supported team in resolving SQL Reporting services and T-SQL related issues and Proficiency in creating different types of reports such as Crosstab, Conditional, Drill-down, Top N, Summary, Form, OLAP and Sub reports, and formatting them
Provided via the phone, application support
Developed and tested Windows command files and SQL Server queries for Production database monitoring in 24/7 support
Created logging for ETL load at package level and task level to log number of records processed by each package and each task in a package using SSIS
Developed, monitored and deployed SSIS packages
IBM WebSphere DataStage EE/7.0/6.0 (, , Designer, Director, Administrator), Ascential Profile Stage 6.0, Ascential Quality Stage 6.0, Erwin, TOAD, Autosys, Oracle 9i, PL/SQL, SQL, UNIX Shell Scripts, Sun Solaris, Windows 2000.

Big Data Engineer

Amigos Software Solutions

08.2013 - 01.2016

Extracted feeds from social media sites such as Facebook, Twitter using Python scripts
Developed and implemented Apache NIFI across various environments, written QA scripts in Python for tracking files
Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables
Involved in running all the hive scripts through hive
Hive on Spark and some through Spark SQL
Enhancing Data Ingestion Framework by creating more robust and secure data pipelines
Involved in complete Bigdata flow of the application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS
Implemented reporting in PySpark, Zeppelin & querying through Airpal & AWS Athena
Wrote Junit tests and Integration test cases for those Microservice
Work heavily with Python, C++, Spark, SQL, Airflow, and Looker
Developed Star and Snowflake schemas based dimensional model to develop the data warehouse
Actively participated in data mapping activities for the data warehouse
Build machine learning models to showcase Big data capabilities using PySpark and MLlib
Enhancing Data Ingestion Framework by creating more robust and secure data pipelines
Worked on Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data
Environment: Hortonworks, Hadoop, HDFS, Pig, Sqoop, Hive, Oozie, Zookeeper, NoSQL, HBase, Shell Scripting, Scala, Spark, Spark SQL.

Education

Bachelor of Science - Computer Science

NBKR INSTITUTE OF SCIENCE & TECHNOLOGY

AndhraPradesh,India

05-2012

Skills

Git version control
ETL development
Big data processing
Python programming
Kafka streaming
NoSQL databases
Data pipeline design
Data modeling
API development
Hadoop ecosystem
Performance tuning
Data warehousing
Advanced SQL
Spark development
Machine learning
Data security
Data quality assurance
Metadata management
Scala programming
Real-time analytics
Data curating
Linux administration

Continuous integration
Data integration
SQL and databases
SQL programming
Database design
RDBMS
Data migration
Big data technologies
SQL transactional replications
Data governance
Data acquisitions
Amazon redshift
Large dataset management
Advanced data mining
Data operations
Data repositories
Teamwork and collaboration
Problem-solving
Time management
Problem-solving abilities
Organizational skills
Team collaboration

Timeline

Senior Data Engineer

AgFirst

01.2024 - Current

Senior Data Engineer

UBS

09.2022 - 12.2023

Data Engineer/PySpark Developer

State Street

11.2021 - 08.2022

Data Engineer

NTT Data

07.2018 - 10.2021

Data Engineer/SQL Developer, Manager

Luxoft Pvt Ltd

02.2016 - 06.2018

Big Data Engineer

Amigos Software Solutions

08.2013 - 01.2016

Bachelor of Science - Computer Science

NBKR INSTITUTE OF SCIENCE & TECHNOLOGY

Praveen Thoomati

Summary

Overview

Work History

Senior Data Engineer

Senior Data Engineer

Data Engineer/PySpark Developer

Data Engineer

Data Engineer/SQL Developer, Manager

Big Data Engineer

Education

Bachelor of Science - Computer Science

Skills

Timeline

Senior Data Engineer

Senior Data Engineer

Data Engineer/PySpark Developer

Data Engineer

Data Engineer/SQL Developer, Manager

Big Data Engineer

Bachelor of Science - Computer Science

Similar Profiles

Bernadine Louis PierreBernadine Louis Pierre

Tracey JonesTracey Jones

Journi CaldwellJourni Caldwell

Erika HarrisErika Harris

James WebbJames Webb