Summary
Overview
Work History
Education
Skills
Timeline
Generic

Praveen Thoomati

Chicago,IL

Summary

● Over 11+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer.

● Configure Zookeeper to coordinate and support Kafka, Spark, Spark Streaming, HBase and HDFS

● Setting up Azure infrastructure like storage accounts, integration runtime, service principal id, app registrations to enable scalable and optimized utilization of business user analytical requirements in Azure.

● Working on JSON scripts generation and writing UNIX shell scripting to call the SQOOP Import/Export

● Exploratory Data Analysis and Data Cleaning with Python.

● Have good knowledge on NoSQL databases like HBase, Cassandra and MongoDB.

● Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage and Composer.

● Experienced implementation of a log producer in Scala that watches for application logs, transforms incremental log and sends them to a Kafka and Zookeeper based log collection platform.

● Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice- versa and load into Hive tables, which are partitioned. Good knowledge in streaming applications using Apache Kafka.

● Experience in writing complex MapReduce jobs and Hive data modeling.

● Work experience with cloud infrastructure like Amazon Web Services (AWS) and AZURE.

● Expertise working with AWS cloud services like EMR, S3, Redshift, AWS Glue, EMR cloud watch, for big data development.

● Experience in fine-tuningMapReduce jobs for better scalability and performance and converting them to Spark.

● Experience in working with Spark RDD, Data Frames and Data Sets using different file formats like Json, Avro, Parquet and compression techniques.

● Worked extensively on enrichment/ETL in real time stream jobs using PySpark Streaming, Spark SQL and loads into HBase.

● Experienced in working with big data technologies like Spark Core, Spark SQL.

Lambda functions for pre-processing data or post-processing Glue Job results.

S3 events to trigger the Step Functions workflow when new data arrives in a specific S3 bucket.

● Implemented Step Functions to create new state machines and start the execution of Glue jobs.

● Orchestrated the workflow using AWS Step Functions to chain multiple AWS Services like Lambda, Glue in a defined sequence.

● Experience in working with Apache Flink and Supporting the jobs consuming Json data from Apache Pulsar.

● Experience in working with Apache Flink Data Streams and loading the data to PSQL.

● Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.

● Working in relational SQL and NoSQL databases, including Oracle, Hive, Sqoop and HBase

● Designed and executed Oozie workflows in a manner that allowed for scheduling Sqoop and Hive job actions to extract, transform and load data

● Migrate databases to cloud platform SQL Azure and as well the performance tuning.

● Experienced on Hadoop/Hive on AWS, using both EMR and non-EMR-Hadoop in EC2.

● Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data.

● Developed ETL/ELT pipelines using Apache Spark on Azure Databricks, including data cleaning, data enrichment, and data aggregation using Spark SQL and Spark DataFrames.

● Written Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration

● Implemented sentiment analysis and text analytics on Twitter social media feeds and market news using Scala and Python.

● Provided production support for ETL and reporting systems, investigating and resolving issues, and maintaining system stability and availability. `

● Very keen in knowing newer techno stack that Google Cloud platform (GCP) adds.

● Hands on experience in installing, configuring Cloudera Apache Hadoop ecosystem components like Flume, HBase, Zookeeper, Oozie, Hive, Sqoop and Pig.

● Installed Hadoop, Map Reduce, HDFS, and AWS and developed multiple Map Reduce jobs in PIG and Hive for data cleaning pre-processing. and

● Worked with real-time data processing and streaming techniques using Spark streaming and Kafka

● Pipeline development skills with Apache Airflow, Kafka, and NiFi.

● Extensively using open-source languages Python, Scala and Java.

● Migrated projects from Cloudera Hadoop Hive storage to Azure Data Lake Store to satisfy Confidential transformation strategy

● Doing data synchronization between EC2 and S3, Hive stand-up, and AWS profiling.

● Using Spark Data frame API in Scala for analyzing data.

Overview

11
11
years of professional experience

Work History

Senior Data Engineer

AgFirst
01.2024 - Current
  • Architect and implement ETL and data movement solutions using Azure Data Factory, SSIS
  • Understand Business requirements, analysis and translate into Application and operational requirements
  • Designed a one-time load strategy for moving large databases to Azure SQL DWH
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using Azure Data Factory and HDInsight
  • Created a framework to do data profiling, cleansing, automatic restart ability of batch pipeline and handling rollback strategy
  • Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
  • Lead a team of six developers to migrate the application
  • Implemented masking and encryption techniques to protect sensitive data
  • Building ADF pipelines to extract and manipulate data from Azure Blob storage/Azure Data Lake/CosmosDB/SQL Server on cloud
  • Extensively used Azure Databricks for data validations and analysis on Cosmos structured steams
  • Developed mapping document to map columns from source to target
  • Created azure data factory (ADF pipelines) using Azure blob
  • Performed ETL using Azure Data Bricks
  • Migrated on-premises Oracle ETL process to Azure Synapse Analytics
  • Involved in migration of large amount of data from OLTP to OLAP by using ETL Packages
  • Worked on python scripting to automate generation of scripts
  • Data curation done using azure data bricks
  • Worked on azure data bricks, PySpark, HDInsight, Azure ADW and hive used to load and transform data
  • Implemented and Developing Hive Bucketing and Partitioning
  • Implemented Kafka, spark structured streaming for real time data ingestion
  • Used Azure Data Lake as Source and pulled data using Azure blob
  • Good experience working on analysis tools like Tableau, Splunk for regression analysis, pie charts and bar graphs
  • Developed reports, dashboards using Tableau for quick reviews to be presented to Business and IT users
  • Used stored procedure, lookup, execute pipeline, data flow, copy data, azure function features in ADF
  • Worked on creating a star schema for drilling data
  • Created PySpark procedures, functions, packages to load data
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks
  • Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster
  • Creating Databricks notebooks using SQL, Python and automated notebooks using jobs
  • Involved in running the Cosmos Scripts in Visual Studio 2017/2015 for checking the diagnostics
  • Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data
  • Create and maintain optimal data pipeline architecture in cloud Microsoft Azure using Data Factory and Azure Databricks
  • Environment: ADF, Databricks and ADL Spark, Hive, HBase, Sqoop, Flume, ADF, Blob, cosmos DB, MapReduce, HDFS, Cloudera, SQL, Apache Kafka, Azure, Python, power BI, Unix, SQL Server.

Senior Data Engineer

UBS
09.2022 - 12.2023
  • Designed solutions to process high volume data stream ingestion, processing and low latency data provisioning using Hadoop Ecosystems Hive, Scoop and Kafka, Python, Spark, Scala, NoSql, Nifi
  • Optimized ETL workflows by orchestrating data extraction, transformation, and loading processes using AWS Step Functions, leading to improved performance and scalability
  • Capable of using AWS utilities such as EMR, S3 and Cloud Watch to run and monitor Hadoop and Spark jobs on AWS
  • Used AWS Athena extensively to ingest structured data from S3 into other systems such as Redshift or to produce reports
  • Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done
  • Designed and developed end-to-end ETL processing from Oracle to AWS using Amazon S3, EMR, and Spark
  • Generated a script in AWS Glue to transfer the data and utilized AWS Glue to run ETL jobs and run aggregation on PySpark code
  • Extensive experience with ETL tools like IBM DataStage and Informatica IICS for efficient data integration, transformation, and loading, skilled in deploying and managing Google Cloud services using Terraform, ensuring seamless scalability and reliability
  • Converted Hive/SQL queries into Spark transformations using Spark RDD and Python and utilized Streams and Lambda expressions available as part of Java 8 to store and process data, improving service performance
  • Develop and maintain data models for the data warehouse, considering star schema and snowflake schema
  • Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS)
  • Developed the PySpark code for AWS Glue jobs and for EMR
  • Worked publishing interactive data visualizations dashboards, reports on Tableau
  • Designed and implemented big data ingestion pipelines to ingest multi-TB data from various data source using Kafka, Spark streaming including data quality checks, transformation, and stored as efficient storage formats Performing data wrangling on multi-terabyte datasets from various data sources for a variety of downstream purposes such as analytics using PySpark
  • Exception handling in python to add logs to the application
  • Created comprehensive documentation detailing HubSpot and PostgreSQL integration workflows, ETL processes, and AWS architecture
  • Designed continuous data flows to handle real-time data ingestion from HubSpot into AWS S3 and Redshift
  • Established data governance policies using AWS Glue Data Catalog to manage HubSpot and PostgreSQL metadata
  • Leveraged NumPy and Pandas for statistical analysis of datasets
  • Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
  • Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance
  • Created an end-to-end data workflow automation solution using AWS Step Functions, integrating data sources, processing steps, and storage solutions to streamline operations
  • Created various complex SSIS/ETL packages to Extract, Transform and Load data
  • Developed PSQL code to manipulate and organize data within Oracle databases, ensuring standardized formatting across attributes
  • Implemented automated validation checks for HubSpot data using AWS Glue and Lambda functions to ensure data integrity
  • Designed a scalable architecture to handle increasing volumes of HubSpot data, leveraging AWS services like EMR and DynamoDB
  • Integrating PostgreSQL with cloud-based services and storage solutions
  • Utilizing MongoDB features like replica sets to scale read operations horizontally across multiple nodes
  • Setting up Change Data Capture processes to track and capture changes in source data for incremental updates to MongoDB
  • Optimized SQL queries within DBT to ensure efficient and performant data transformations
  • Adapt Data Vault structures to accommodate changes in source systems or evolving business requirements
  • Developed Kubernetes deployment configurations (YAML manifests) to define the desired state of applications, services, and infrastructure components
  • Collaborated with cross-functional teams to establish a robust framework, facilitating the automated generation of daily ad-hoc reports for timely decision-making
  • Participated in the full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and other Agile methodologies
  • Environment: Hadoop, Lambda, Oracle, Kafka, Python, Informatica, Snowflake Schema, SSRS, MYSQL Dynamo DB, PostgreSQL, DBT, Data Vault, Tableau, Kubernetes, Git Hub

Data Engineer/PySpark Developer

State Street
11.2021 - 08.2022
  • Worked on pipelines to source data from Kafka topics and push to Postgres DB
  • Implemented PySpark pipelines in Argo CD to load data from Ericsson/Nokia/Samsung Kafka topics
  • Identifying the root cause and resolution for aborted jobs on daily basis and fixing the issues
  • Profiling the Kubernetes resources based on resource utilization
  • Designed, developed, and deployed Data Lakes, Data Marts and Datawarehouse using AWS cloud like AWS S3, AWS RDS and AWS Redshift and terraform
  • Designed and implemented solutions for sourcing healthcare data using FHIR standards into AWS cloud infrastructure for scalable and compliant data storage
  • Automated the extraction, transformation, and loading (ETL) of FHIR data from on-premise systems to AWS services, ensuring seamless data migration and normalization
  • Enabled healthcare analytics by integrating FHIR data into AWS Redshift and Amazon Athena, facilitating advanced querying and reporting on healthcare records
  • Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP
  • Designed, developed, and deployed Datawarehouse, AWS Redshift, applied my best practices
  • Designed, developed, and deployed both batch and streaming pipelines using Confluent Kafka, Kubernetes, docker and AWS Cloud
  • Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators both old and newer operators
  • Developed Snowflake connectors to pull data from Snowflake and run ETL
  • Created Airflow Dag’s to extract data from Snowflake and load it to our data warehouse
  • Build Self-service data pipelines using AWS Services like, SNS, Step Function, Lambda, Glue, EMR, EC2, Athena, Sage Maker, Quick Sight, Redshift etc
  • Moved large amount of data from AWS S3 buckets to AWS Redshift using Glue and EMR
  • Analyzed large and critical datasets using EMR, Glue and Spark
  • Worked in a Flink project which consumes Json data from Apache Pulsar and loading the data into Postgres
  • Respond promptly to any job failures or incidents, diagnose root causes, and implement corrective actions to minimize downtime and service disruptions
  • Proactive communication and setting up war room calls with platform teams to fix the K8 cluster issues
  • Implement a Flink application to consume data from Kafka topics and perform transformations based on business logics and produce it into Apache Pulsar for Data Science team consumption
  • Supporting the SLA bound jobs and made sure to send the job complete notifications to leadership
  • Optimized spark loads based on the data it pulls from the base tables
  • Developed Python API to get kml files from Storm Prediction Center archives
  • Data Wrangling using Beautiful Soup and converted them to pandas Data frame to pull tornado effected regions
  • Plotted the data points using mat plot library to display the effected cell towers in a region
  • Clean, transform, and format data to ensure accuracy and consistency for analysis
  • Skills: AWS, GCP, pyspark, Apache Spark, Spring boot, Apache Flink, Python, Control-M, Kubernetes, Docker, GitLab, CI/CD

Data Engineer

NTT Data
07.2018 - 10.2021
  • Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing
  • Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds
  • Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design in Hadoop and Big Data
  • Involved in SQOOP implementation which helps in loading data from various RDBMS sources to Hadoop systems and vice versa
  • Developed a Python Script to load the CSV files into the S3 buckets and created AWS S3 buckets, performed folder management in each bucket, managed logs and objects within each bucket
  • Involved in Analyzing system failures, identifying root causes, and recommended course of actions, Documented the systems processes and procedures for future references
  • Involved in Configuring Hadoop cluster and load balancing across the nodes
  • Involved in Hadoop installation, Commissioning, Decommissioning, Balancing, Troubleshooting, Monitoring and, debugging Configuration of multiple nodes using Hortonworks platform
  • Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS
  • Load D-Stream data into Spark RDD and do in memory data Computation to generate Output response
  • Involved in performance tuning of Spark jobs using Cache and using complete advantage of cluster environment
  • Wrote script for Location Analytic project deployment on a Linux cluster/farm & AWS Cloud deployment using Python
  • Worked extensively on Informatica Partitioning when dealing with huge volumes of data
  • Used Teradata External Loaders like Multi Load, T Pump and Fast Load in Informatica to load data into Teradata database
  • Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data
  • Created several types of data visualizations using Python and Tableau
  • Extracted Mega Data from AWS using SQL Queries to create reports
  • Experience in using Avro, Parquet, RCFile and JSON file formats, developed UDFs in Hive and Pig
  • Involved in loading data from rest endpoints to Kafka Producers and transferring the data to Kafka Brokers
  • Developed Pre-processing job using Spark Data frames to flatten JSON documents to flat file
  • Environment: Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Flume, Spark, Impala, Cassandra, Pig, Hdfs, Scala, Spark RDD, Spark SQL, Kafka.

Data Engineer/SQL Developer, Manager

Luxoft Pvt Ltd
02.2016 - 06.2018
  • Creation, manipulation and supporting the SQL Server databases
  • Involved in the Data modeling, Physical and Logical Design of Database
  • Helped in integration of the front end with the SQL Server backend
  • Created Stored Procedures, Triggers, Indexes, User defined Functions, Constraints etc on various database objects to obtain the required results
  • Import & Export of data from one server to other servers using tools like Data Transformation Services (DTS)
  • Wrote T-SQL statements for retrieval of data and involved in performance tuning of TSQL queries
  • Transferred data from various data sources/business systems including MS Excel, MS Access, Flat Files etc to SQL Server using SSIS/DTS using various features like data conversion etc
  • Also Created derived columns from the present columns for the given requirements
  • Supported team in resolving SQL Reporting services and T-SQL related issues and Proficiency in creating different types of reports such as Crosstab, Conditional, Drill-down, Top N, Summary, Form, OLAP and Sub reports, and formatting them
  • Provided via the phone, application support
  • Developed and tested Windows command files and SQL Server queries for Production database monitoring in 24/7 support
  • Created logging for ETL load at package level and task level to log number of records processed by each package and each task in a package using SSIS
  • Developed, monitored and deployed SSIS packages
  • IBM WebSphere DataStage EE/7.0/6.0 (, , Designer, Director, Administrator), Ascential Profile Stage 6.0, Ascential Quality Stage 6.0, Erwin, TOAD, Autosys, Oracle 9i, PL/SQL, SQL, UNIX Shell Scripts, Sun Solaris, Windows 2000.

Big Data Engineer

Amigos Software Solutions
08.2013 - 01.2016
  • Extracted feeds from social media sites such as Facebook, Twitter using Python scripts
  • Developed and implemented Apache NIFI across various environments, written QA scripts in Python for tracking files
  • Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables
  • Involved in running all the hive scripts through hive
  • Hive on Spark and some through Spark SQL
  • Enhancing Data Ingestion Framework by creating more robust and secure data pipelines
  • Involved in complete Bigdata flow of the application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS
  • Implemented reporting in PySpark, Zeppelin & querying through Airpal & AWS Athena
  • Wrote Junit tests and Integration test cases for those Microservice
  • Work heavily with Python, C++, Spark, SQL, Airflow, and Looker
  • Developed Star and Snowflake schemas based dimensional model to develop the data warehouse
  • Actively participated in data mapping activities for the data warehouse
  • Build machine learning models to showcase Big data capabilities using PySpark and MLlib
  • Enhancing Data Ingestion Framework by creating more robust and secure data pipelines
  • Worked on Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data
  • Environment: Hortonworks, Hadoop, HDFS, Pig, Sqoop, Hive, Oozie, Zookeeper, NoSQL, HBase, Shell Scripting, Scala, Spark, Spark SQL.

Education

Bachelor of Science - Computer Science

NBKR INSTITUTE OF SCIENCE & TECHNOLOGY
AndhraPradesh,India
05-2012

Skills

  • Git version control
  • ETL development
  • Big data processing
  • Python programming
  • Kafka streaming
  • NoSQL databases
  • Data pipeline design
  • Data modeling
  • API development
  • Hadoop ecosystem
  • Performance tuning
  • Data warehousing
  • Advanced SQL
  • Spark development
  • Machine learning
  • Data security
  • Data quality assurance
  • Metadata management
  • Scala programming
  • Real-time analytics
  • Data curating
  • Linux administration
  • Continuous integration
  • Data integration
  • SQL and databases
  • SQL programming
  • Database design
  • RDBMS
  • Data migration
  • Big data technologies
  • SQL transactional replications
  • Data governance
  • Data acquisitions
  • Amazon redshift
  • Large dataset management
  • Advanced data mining
  • Data operations
  • Data repositories
  • Teamwork and collaboration
  • Problem-solving
  • Time management
  • Problem-solving abilities
  • Organizational skills
  • Team collaboration

Timeline

Senior Data Engineer

AgFirst
01.2024 - Current

Senior Data Engineer

UBS
09.2022 - 12.2023

Data Engineer/PySpark Developer

State Street
11.2021 - 08.2022

Data Engineer

NTT Data
07.2018 - 10.2021

Data Engineer/SQL Developer, Manager

Luxoft Pvt Ltd
02.2016 - 06.2018

Big Data Engineer

Amigos Software Solutions
08.2013 - 01.2016

Bachelor of Science - Computer Science

NBKR INSTITUTE OF SCIENCE & TECHNOLOGY
Praveen Thoomati