Designed, Developed and maintained software solutions in Hadoop cluster and its components using Cloudera, HDFS, yarn, Pyspark, airflow, Databricks, Azure and UNIX shell scripting.
Migrated clickstream data from on-prem to azure storage and automated jobs using azure data factory and databricks to schedule on daily-basis.
Extract, transform and load data from on-prem to Azure data storage services using a combination of Azure data factory, spark, spar-sql and azure delta lake.
Built POC using delta live tables to move from on-prem to azure which involved a lot of analysis and debugging sessions to use cloud files and databricks techniques and features.
Analyzed spark architecture during POC which includes spark core, Dataframes, spark streaming, worker nodes, driver memory, executor memory, stages, auto scaling and execution hierarchy.
Built delta live tables for streaming clickstream data in Databricks environment, reading files as delta format and saving files as parquet in azure blob storage.
Retrieved Azure cost usage reports using powerBI which gives visualization charts of each job’s average cost on monthly and fiscal calendar basis.
Project: CCPA
Worked on California consumer privacy act project (CCPA law) to remove private data of Kroger customers from target locations using sha tokens and encryption techniques.
Developed orchestration process using airflow scheduled using cron scheduling techniques on daily and weekly run basis.
Retired projects from google cloud platform and made sure data and workflows were successfully migrated from Google cloud platform and storage systems. (GCP, GCS).
Built pipelines using Nifi RabbitMQ services to build to make data available in different target systems.
Data Engineer
Worldpay
02.2018 - 07.2019
Worldpay Group is a payment processing company. The company provides payment services for mail order and Internet retailers, as well as point of sale transactions.
Designed POC for building data marts in Hadoop environment to retire traditional code in PL/SQL.
Ingested transactional data from various source systems i.e BPM, FICO into Hadoop ecosystem using components hive, pig, spark, map reduce, impala, oozie workflow.
Developed Spark code using Scala and Spark-SQL/Streaming for transforming raw data into HDFS.
Implemented SQOOP import to load encrypted card data from RDBMS ( sql server, DB2) to unix server. Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS and created hive tables on top of it.
Performed SQL Joins among Hive tables to get input for Spark batch process. Migrated HiveQL queries on structured data into Spark QL to improve performance.
Pulled data from salesforce and applied ETL using data stage IBM and informatica to load data applying different quality rules into centralized Hadoop platform
Developed workflow in Oozie and automate the tasks using TWS scheduler.
Graduate Assistant
University Of Illinois, Springfield
01.2017 - 12.2017
As a graduate assistant, Contributed to research and data analysis within [Academic department] landscape.
Administered coursework, graded assignments and provided constructive feedback.
Assisted faculty members with data collection for potential academic publications.
Gathered, reviewed, and summarized literature from scientific journals such as SciFinder and PubMed and produced graphs and other scientific calculations using MS Excel.