Summary
Overview
Work History
Education
Skills
Timeline
Generic

Zimpa Baiji

Irving,TX

Summary

Over 5+ years of experience in Data Engineering, Data Pipeline Design, Development, and Implementation as a Data Engineer/Data Developer and Data Modeler. Strong experience in Software Development Life Cycle (SDLC) including requirements Analysis, Design Specification, and Testing as per Cycle in both Waterfall and Agile methodologies. Experience developing SPARK applications using Spark tools like RDD transformations, Spark core, Spark MLlib, Spark Streaming, and Spark SQL. Experience refactoring the existing spark batch process for different logs written in Scala. Experience in Hadoop Ecosystem components Map - Reduce, HDFS, Yarn/MRv2, Hive, HDFS, HBase, Spark, Kafka, Sqoop, Flume, Avro, Sqoop, AWS, Avro, Solr and Zookeeper. Experience developing applications using Map Reduce to analyze Big Data with different file formats. Experience with Apache Spark components including SPARK CORE, SPARK SQL, SPARK STREAMING, and SPARK MLLIB. Experience in data analysis using Hive, and Impala. Experience working on creating and running Docker images with multiple micro-services. Experience in Data Modeling and ETL processes in data warehouse environments such as star schema, and snowflake schema. Experience structural modifications using Map-Reduce, and Hive and analyze data using visualization/reporting tools (Tableau). Experience working with GitHub/Git source and version control systems. Experience in Azure Cloud, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical services, Azure Cosmos NO SQL DB, Azure HD Insight Big Data Technologies (Hadoop and Apache Spark), and Data bricks. Experience designing Azure Cloud Architecture and Implementation plans for hosting complex application workloads on MS Azure. Experience in Amazon Web Services (AWS) concepts like EC2, S3, EMR, ElasticCache, DynamoDB, Redshift, Aurora. Experience developing scripts using Python or Shell Scrips to Extract, Load, and Transform. Experience in developing JSON scripts for deploying the pipeline in Azure Data Factory (ADF), which processes the data using the Cosmos Activity. Experienced in code repositories like GitHub. Experience in using SQOOP to import and export data from RDBMS to HDFS and Hive. Experience in multiple databases like MongoDB, Cassandra, MySQL, ORACLE, and MS SQL Server. Experience in agile software development methodology. Ability to work effectively in cross-functional team environments, excellent communication and interpersonal skills. Excellent communication skills, interpersonal skills, problem-solving skills a very good team player along with a can-do attitude and ability to effectively communicate with all levels of the organization such as technical, management, and customers.

Overview

7
7
years of professional experience

Work History

Data Engineer

Verizon
07.2023 - Current
  • Involved in Analysis, Design, and Implementation/translation of Business User requirements
  • Responsible for building confidential data cube using the SPARK framework by writing Spark SQL queries in Scala to improve data processing efficiency and reporting query response time
  • Developed Spark code using Scala and Spark-SQL for faster testing and data processing
  • Developed Scala scripts using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries, and writing data back into the OLTP system through Sqoop
  • Developed Spark jobs using PySpark and Scala to create a generic framework to process files such as JSON, text, and CSV
  • Developed business logic using Kafka Direct Stream in Spark streaming and implemented business transformations
  • Involved in developing transformations using Python
  • These transformations help convert JSON into a structured relational data format and apply desired logic or conditions
  • Designed the ETL process and created the high-level design document including the logical data flows, source data extraction process, database staging, and the extract creation
  • Utilized capabilities of Tableau such as Data extracts, Data blending, Forecasting, Dashboard actions, and Table calculations
  • Responsible for writing Map-Reduce job which joins the incoming slices of data and picks only the fields needed for further processing
  • Used Apache Kafka to aggregate web log data from multiple servers and make them available in downstream systems for analysis
  • Used Kafka Streams to configure Spark streaming to get information and then store it in HDFS
  • Created dashboards for analyzing POS data using Power BI
  • Implemented data ingestion and handling clusters in real-time processing using Kafka
  • Involved in developing DAGS using the Airflow orchestration tool and monitored the weekly processes
  • Designed various Azure data factory pipelines to pull data from various data sources and load the data into Azure SQL database
  • Used stored procedures, lookup, execute the pipeline, data flow, copy data, and Azure function features in ADF
  • Worked on Azure data bricks, PySpark, Spark SQL, Azure ADW, and Hive used to load and transform data
  • Used Azure Data Lake as a Source and pulled data using Azure Polybase
  • Performed data cleaning and preparation of XML files
  • Created Hive tables and dynamic partitions, with buckets for sampling and working on them using Hive QL
  • Created HBase tables to store variable data formats from different portfolios
  • Used SQL queries and other tools to perform data analysis and profiling
  • Implemented Agile Methodology for building the data applications and framework development
  • Actively participated and provided constructive and insightful feedback during weekly Iterative review meetings to track the progress for each iterative cycle and figure out the issues
  • Environment: Spark, Scala, ETL, Kafka, Tableau, Hadoop, Python, Snowflake, HDFS, Hive, MapReduce, PySpark, Docker, Sqoop, Azure, Teradata, JSON, MongoDB, SQL, Agile, and Windows

Data Engineer

Delta Airlines
02.2023 - 07.2023
  • Involved in the Requirement gathering phase to gather the requirements from the business users to accommodate changing user requirements continuously
  • Developed Spark programs with Scala and applied principles of functional programming to do batch processing
  • Used various Spark Transformations and Actions for cleansing the input data and was involved in using the Spark application master to monitor the Spark jobs and capture the logs for the Spark jobs
  • Used Spark Data frames, and Spark-SQL extensively to build multiple ETL pipelines
  • Used Pyspark for data ingestion and perform complex transformations
  • Develop quality code adhering to Scala coding Standards and best practices
  • Analyzed large data sets to determine the optimal way to aggregate and report on them using Map Reduce programs
  • Responsible for data services and data movement infrastructures, worked with ETL concepts, building ETL solutions and Data modeling
  • Developed Sqoop scripts to import export data from relational sources and handled incremental loading on the customer, transaction data by date
  • Designed, developed, and implemented ETL pipelines using Python API (PySpark) of Apache Spark on AWS EMR
  • Development and implementation of several types of sub-reports, drill-down reports, summary reports, parameterized reports, and ad-hoc reports using Tableau
  • Developed interactive dashboards and reports using Power BI for day-to-day business decision-making and strategic planning needs
  • Created Airflow Scheduling scripts in Python
  • Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
  • Used Apache Kafka to aggregate web log data from multiple servers and make them available for analysis in downstream systems
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks
  • Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
  • Designed, developed, implemented, and maintained solutions for using Docker, Jenkins, and Git, for microservices and continuous deployment
  • Involved in loading data from rest endpoints to Kafka Producers and transferring the data to Kafka Brokers
  • Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket
  • Involved in creating Hive tables, loading and analyzing data using Hive queries, and writing complex Hive queries to transform the data
  • Created HBase tables to load large sets of structured data
  • Used SQL queries and other tools to perform data analysis and profiling
  • Involved in Agile methodologies, daily scrum meetings, and spring planning
  • Actively participated and provided feedback constructively and insightfully during weekly Iterative review meetings to track the progress for each iterative cycle and figure out the issues
  • Environment: Spark, Scala, ETL, Hadoop, Python, Snowflake, HDFS, Hive, Tableau, MapReduce, PySpark, Tableau, Teradata, Docker, JSON, XML, Azure, Apache Kafka, SQL, PL/SQL, Agile and Windows

Data Engineer

Edmund Optics
09.2020 - 08.2022
  • Participated in requirement-gathering sessions with business users and sponsors to understand and document the business requirements
  • Involved in developing Spark programs with Python, and applied principles of functional programming to process the complex structured data sets
  • Developed Spark scripts by using Scala as required to read/write JSON files
  • Develop near real-time data pipeline using spark
  • Wrote functions whenever required to make column validations, and data cleansing as required to achieve logics in Scala
  • Worked on storing the data frame in the hive as a table using Python (PySpark)
  • Developed ETL jobs using Python for various consumers
  • This involved creating Python modules to parse JSON files, extract data from relational tables, and flat files, and perform transformations
  • Optimized current pivot tables' reports using Tableau and proposed an expanded set of views in the form of interactive dashboards using line graphs, bar charts, heat maps, tree maps, trend analysis, Pareto charts, and bubble charts to enhance data analysis
  • Developed Hive queries to process the data for visualizing and worked on tuning the performance of Hive Queries
  • Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data
  • Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway
  • Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Scala
  • Created DataStage jobs using different stages like Transformer, Aggregator, Sort, Join, Merge, Lookup, Data Set, Funnel, Remove Duplicates, Copy, Modify, Filter, Change Data Capture, Change Apply, Sample, Surrogate Key, Column Generator, Row Generator, Etc
  • Performed Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export through Python
  • Extracted the data from Teradata into HDFS using the Sqoop
  • Developed PIG UDF to manipulate the data according to business requirements and also worked on developing custom PIG loaders
  • Wrote Pig scripts to process unstructured data and create structured data for use with Hive
  • Involved in converting Hive/SQL queries into Spark transformations using Spark data frame in Python
  • Performed data cleaning and preparation of XML files
  • Worked with JSON, CSV, Sequential, and Text file formats
  • Involved in creating, and modifying SQL queries, prepared statements, and stored procedures used by the application
  • Participated in the status meetings and status updates to the management team
  • Environment: Spark, Scala, Hadoop, Python, Pyspark, AWS, MapReduce, Pig, ETL, HDFS, Hive, HBase, SQL, Agile and Windows

Data Engineer

Verisk
01.2018 - 08.2020
  • Worked with the business users to gather, define business requirements, and analyze the possible technical solutions
  • Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it
  • Developed MapReduce programs for pre-processing and cleansing the data in HDFS obtained from heterogeneous data sources to make it suitable for ingestion into Hive schema for analysis
  • Responsible for building the ETL Pipelines (Extract, Transform, and Load) from Data Lake to different databases based on the requirements
  • Utilized AWS services with a focus on big data architect /analytics/enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, and performance, and to provide meaningful and valuable information for better decision-making
  • Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3, Teradata, and Snowflake
  • Developed PIG scripts for the analysis of semi-structured data
  • Used Pig as an ETL tool to do transformations, event joins, filters, and some pre-aggregations before storing the data onto HDFS
  • Extensively involved in writing SQL queries (sub-queries and join conditions) for building and testing ETL processes
  • Actively participating in the code reviews, and meetings and solving any technical issues
  • Environment: Spark, Scala, Hive, JSON, AWS, MapReduce, Hadoop, Python, XML, NoSQL, HBase, and Windows

Education

Master of Arts - Information Technology Management

Webster University
St Louis, MO
12.2023

Bachelor’s in management - Bachelor in Travel And Tourism Studies

Kathmandu Academy of Tourism and Hospitality
01.2020

Skills

  • Python
  • SQL
  • Scala
  • MATLAB
  • Red Hat Linux
  • Unix
  • Windows
  • MacOS
  • Snowflake
  • Teradata
  • Oracle
  • MySQL
  • Microsoft SQL
  • Postgre SQL
  • Azure
  • AWS
  • Docker

Timeline

Data Engineer

Verizon
07.2023 - Current

Data Engineer

Delta Airlines
02.2023 - 07.2023

Data Engineer

Edmund Optics
09.2020 - 08.2022

Data Engineer

Verisk
01.2018 - 08.2020

Bachelor’s in management - Bachelor in Travel And Tourism Studies

Kathmandu Academy of Tourism and Hospitality

Master of Arts - Information Technology Management

Webster University
Zimpa Baiji