9 years of IT Experience in Diverse Domains: Extensive background in end-to-end data analytics solutions encompassing Big Data, Hadoop, Informatica, Data Modeling, and System Analysis across Banking, Finance, Insurance, and Telecom sectors.
Overview
9
9
years of professional experience
Work History
Senior Data Engineer
Bank of America
08.2022 - Current
Led Agile delivery and integrated SAFe and DevOps frameworks, leveraging DevOps tools for end-to-end planning, building, testing, releasing, and monitoring processes
Architected, developed, and maintained robust CI/CD pipelines for Big Data solutions within Azure DevOps, ensuring seamless deployment of code to production
Constructed reusable YAML pipelines, leveraging Azure Data Factory, Data Lake, and Databricks, while effectively managing code changes with Git flow branching strategy
Employed PowerShell scripting, Bash, YAML, JSON, GIT, Rest API, and Azure Resource Management (ARM) templates to orchestrate and manage CI/CD pipelines
Set and enforced CI/CD standards and best practices, encompassing version control, code reviews, and scalable data processing pipelines implemented with PySpark for efficient data ingestion, transformation, and analysis
Leveraged PySpark's distributed computing capabilities to optimize large-scale data processing, significantly enhancing processing speed and overall performance
Orchestrated messaging queues using RabbitMQ to facilitate seamless data flow from HDFS for processing, and harnessed Kafka and RabbitMQ to capture data streams, all encapsulated within Docker virtualized test and dev environments
Proficiently designed and deployed SSIS packages, enabling seamless extraction, transformation, and loading of data into Azure SQL Database and Azure Data Lake Storage
Adeptly configured and fine-tuned SSIS Integration Runtime for efficient execution of SSIS packages in Azure, optimizing overall performance
Designed Docker Containers, both Linux and Windows-based, utilizing existing Linux Containers, AMIs, and building from scratch, while effectively managing container clusters with Docker Swarm, Mesos, and Kubernetes
Collaborated closely with development teams to diagnose issues and debug code within Windows environments, additionally mentoring junior engineers on CI/CD best practices and cloud-native architectures
Developed robust Databricks solutions for data extraction, transformation, and aggregation from diverse sources, creating high-performance data ingestion pipelines via Azure Data Factory and Azure Databricks
Constructed SCD Type 2 Dimensions and Facts leveraging Delta Lake and Databricks capabilities, ensuring accurate and efficient data management
Engineered custom ETL solutions, encompassing batch processing and real-time data ingestion using PySpark and Shell Scripting, facilitating seamless data movement within Hadoop clusters
Crafted Azure Databricks (Spark) notebooks to efficiently extract and load data between Data Lake storage accounts, Blob storage accounts, and on-premises SQL server databases
Conducted comprehensive statistical analysis utilizing SQL, Python, Scala, R Programming, and Excel, augmenting data-driven insights and generating key conclusions
Employed Python and SAS to extract, transform, and load source data from transaction systems, producing transformative reports and insights, while seamlessly transferring data from Azure storage to Azure SQL on Azure Databricks platform
Automated Azure Databricks jobs and constructed SSIS packages to facilitate smooth data transfer from Azure SQL to on-premises servers
Designed and implemented ETL solutions in Databricks, adhering to bronze, silver, and gold layer architecture, and leveraged Azure Data Factory to orchestrate data preparation and loading into SQL Data Warehouse
Seamlessly integrated on-premises data sources (MySQL, HBase) with cloud platforms (Blob Storage, Azure SQL DB), applying transformations to facilitate loading into Azure Synapse via Azure Data Factory
Created, published, and deployed Docker container images via Azure Container Registry into Azure Kubernetes Service (AKS), ensuring efficient containerized deployments
Transferred metadata into Hive, seamlessly migrating existing tables and applications for Hive and Azure compatibility, while implementing complex transformations and manipulations using ADF, Scala, and Python
Streamlined data ingestion from varied sources, including relational and non-relational databases, through Azure Data Factory configurations, optimizing Apache Airflow performance with tailored settings
Designed and implemented DAGs within Apache Airflow to schedule ETL jobs, enhancing workflow efficiency and incorporating additional components like Pool, Executors, and multi-node functionality
Configured Spark streaming for real-time data reception from Apache Flume, employing Scala to store stream data in Azure Table and Data Lake, ultimately used for processing and analytics
Architected and executed cloud implementation strategies for hosting complex app workloads on MS Azure, ensuring optimal performance and scalability
Performed transformation layer operations using Apache Drill, Spark RDD, Data frame APIs, and Spark SQL, harnessing Spark's capabilities for various aggregations and data manipulations
Derived real-time insights and reports by harnessing Spark Scala functions, optimizing cluster performance and reliability through continuous monitoring and fine-tuning
Enhanced query performance by transitioning log storage from Cassandra to Azure SQL Data Warehouse, resulting in improved overall data processing efficiency
Engineered custom input adapters utilizing Spark, Hive, and Sqoop to seamlessly ingest analytics data from diverse sources (Snowflake, MS SQL, MongoDB) into HDFS
Leveraged Scala for concurrency and parallel processing to optimize large dataset processing efficiency, while developing map-reduce jobs for streamlined data processing
Accelerated data processing by developing and optimizing Spark jobs using Python and Spark SQL, fine-tuning parameters like batch interval time and parallelism
Implemented indexing for data ingestion using Flume sink, facilitating direct writing to cluster-based indexers
Managed and delivered data for analytics and Business Intelligence needs using Azure Synapse, ensuring seamless and reliable data availability
Bolstered security by integrating Azure DevOps, VSTS (Visual Studio Team Services), Active Directory, and Apache Ranger for robust CI/CD and authentication mechanisms, effectively managing resource allocation and scheduling through Azure Kubernetes Service
Collaborated on ETL (Extract, Transform, Load) tasks, maintaining data integrity and verifying pipeline stability
Designed and implemented effective database solutions and models to store and retrieve data.
Senior Data Engineer
Fannie mae
01.2021 - 07.2022
Architected and executed intricate data pipelines via AWS Glue for efficient transformation and loading of extensive data from diverse sources
Automated ETL processes through maintenance and optimization of Glue jobs and crawlers, ensuring seamless data processing and analysis
Designed and implemented robust data processing pipelines using Amazon EMR, PySpark, and AWS Glue, enabling efficient extraction, transformation, and loading of large-scale datasets into Redshift for analysis
Optimized PySpark jobs for performance and scalability, achieving 40% reduction in processing time and enabling real-time insights from data
Leveraged AWS Glue to automate ETL workflows, reducing manual intervention and improving data accuracy by creating automated data quality checks and transformations
Developed serverless data processing solutions using AWS Lambda and Step Functions, enabling event-driven data processing and reducing operational overhead
Architected serverless ETL pipelines with Lambda and Step Functions, achieving cost savings of up to 30% compared to traditional infrastructure setups
Implemented logging, monitoring, and error handling mechanisms within serverless applications, ensuring robustness and reliability of data processing workflow
Orchestrated containerized data processing workloads using Amazon ECS with Fargate, achieving seamless scaling and resource isolation for data-intensive applications
Designed and implemented Docker containers for data processing tasks, allowing for consistent environments and efficient deployment across multiple stages of data pipeline
Utilized ECS with Fargate to achieve cost optimization by matching container resources precisely to workload demands, resulting in a 20% reduction in infrastructure costs
Designed and implemented holistic data pipelines incorporating AWS services like S3, Glue, and Redshift, seamlessly integrated with Snowflake as a cloud data warehouse
Crafted and maintained scalable, performant data models within Snowflake, optimizing Pyspark jobs for Kubernetes Cluster execution to enhance data processing speed
Developed a framework for migrating PowerCenter mappings to PySpark (Python and Spark) jobs, guiding and enforcing quality standards for development team
Orchestrated PySpark integration with Hadoop, Hive, and other big data technologies, establishing comprehensive end-to-end data processing pipelines
Employed AWS EMR to deploy and manage big data processing applications, utilizing frameworks like Spark and Hadoop for advanced data processing
Engineered Spark, Hive, Pig, Python, Impala, and HBase data pipelines for seamless customer data ingestion and processing
Designed RESTful APIs with Django Rest Framework (DRF) and Flask-RESTful, ensuring seamless integration with external systems
Generated SQL and PL/SQL scripts for managing database objects, encompassing tables, views, primary keys, indexes, and sequences
Orchestrated Amazon EC2 instances creation, troubleshooting, and health monitoring, alongside other AWS services for multi-tier application deployment
Designed and executed high-availability, fault-tolerant, and auto-scaling multi-tier applications utilizing AWS services like EC2, Route53, S3, RDS, DynamoDB, SNS, SQS, and IAM
Employed Apache Spark and Python for Big Data Analytics and Machine Learning applications, with expertise in Spark ML and MLlib
Provided Linux and Windows cloud instances support on AWS, configuring Elastic IP, Security Groups, and Virtual Private Cloud
Configured Amazon EC2, S3, Elastic Load Balancing, and security components in VPC, ensuring robust network security
Automated data backups to S3 buckets, EBS, and AMIs using AWS CLI, ensuring data safety for critical production servers
Created OpenShift namespaces for on-premises applications transitioning to cloud in OpenShift Pass environment
Virtualized servers using Docker for testing and development environments, streamlining configuration through Docker containers
Managed Docker clusters, including Docker Swarm, Mesos, and Kubernetes, integrating them with Amazon AWS/EC2 and Google's Kubernetes
Developed Jenkins CI/CD pipeline jobs for end-to-end automation, overseeing artifact management in Nexus repository, and utilizing Jenkins nodes for parallel builds
Collaborated on ETL (Extract, Transform, Load) tasks, maintaining data integrity and verifying pipeline stability
Designed and implemented effective database solutions and models to store and retrieve data
Prepared documentation and analytic reports, delivering summarized results, analysis and conclusions to stakeholders
Used GDP on validation protocols, test cases and changed control documents
Analyzed complex data and identified anomalies, trends, and risks to provide useful insights to improve internal controls
Developed, implemented and maintained data analytics protocols, standards, and documentation
Senior Data Engineer
AutoZone
10.2019 - 12.2020
Developed RESTful APIs using Python with Flask and Django frameworks, seamlessly integrating diverse data sources such as Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files
Leveraged Apache Spark with Python to architect and execute sophisticated Big Data Analytics and Machine Learning applications, successfully implementing machine learning use cases within Spark ML and MLlib
Designed and deployed SSIS packages for data loading and transformation within Azure databases and storage environments
Configured and managed SSIS Integration Runtime for seamless execution of SSIS packages in Azure infrastructure
Employed Spark and Python to craft regular expression (regex) projects within the Hadoop/Hive ecosystem, spanning Linux and Windows environments for comprehensive big data processing
Developed Spark streaming modules for efficient data acquisition from RabbitMQ and Kafka sources
Proficiently profiled structured, unstructured, and semi-structured data across diverse sources, adeptly identifying data patterns
Implemented data quality metrics via essential queries and Python scripts tailored to source characteristics
Analyzed, designed, and engineered contemporary, scalable, and distributed data solutions using Hadoop and Azure cloud services.
Data Engineer
Hike
08.2017 - 09.2019
Contributed to the analysis, design, and development phases of the Software Development Lifecycle (SDLC)
Proficient in agile methodologies, actively participating in sprint planning, scrum calls, and retrospective meetings
Managed project tracking with JIRA and version control via GitHub
Designed, developed, and maintained transformation processes in both non-production and production environments within Azure
Crafted data pipelines using PySpark Programming, employing technologies like Spark, Hive, Pig, Python, Impala, and HBase for effective customer data ingestion
Utilized Spark Streaming to segment streaming data into batches for seamless input to Spark engine, facilitating efficient batch processing
Developed Spark applications for tasks such as data validation, cleansing, transformation, and custom aggregation
Employed Spark engine and Spark SQL for comprehensive data analysis, providing valuable insights for data scientists' further investigations
Engineered RESTful APIs using Python with Flask and Django frameworks, seamlessly integrating diverse data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files
Leveraged Apache Spark with Python to architect and execute advanced Big Data Analytics and Machine Learning applications, successfully executing machine learning use cases within Spark ML and MLlib
Developed Spark and Python solutions for regular expression (regex) projects within the Hadoop/Hive environment, proficiently operating across Linux and Windows platforms for robust big data processing
Created Spark streaming modules for efficient data acquisition from RabbitMQ and Kafka sources
Proficiently profiled structured, unstructured, and semi-structured data from various sources, identifying key data patterns
Implemented data quality metrics through tailored queries and Python scripts based on source-specific characteristics
Designed, constructed, and managed SSIS packages to facilitate efficient data integration and transformation within Azure
Skillfully configured and optimized SSIS Integration Runtime for seamless package execution on the Azure platform
Analyzed, designed, and constructed modern, scalable distributed data solutions utilizing Hadoop and Azure cloud services.
Data Engineer
Myntra
12.2014 - 06.2017
Played a pivotal role in capturing comprehensive business, system, and design requirements
Conducted gap analysis, and illustrated findings through use case diagrams and flow charts
Architecture a dynamic, cross-device, cross-browser, and mobile-friendly web dashboard utilizing Angular JS
Empowered the management of multiple chatbots across diverse environments
Orchestrated the development of Bot framework conversation flows, utilizing NODE-RED, NodeJS, MS Bot framework
Crafted the user interface for the web dashboard utilizing HTML, CSS, Bootstrap, and Angular JS
Designed, constructed, and managed SSIS packages, enabling seamless data integration and transformation within Azure
Skillfully configured and optimized SSIS Integration Runtime for efficient package execution on the Azure platform
Pioneered the creation of custom nodes on NODE-RED dashboard, facilitating streamlined conversation building through Node.js over the MS Bot framework
Actively contributed to the implementation of user authentication mechanisms within the application, leveraging Stormpath and Passports for robust security measures
Employed a diverse array of Validation Controls for client-side validation
Crafted custom validation controls using Angular validation controls and Angular Material Design, enhancing data integrity
Engineered Spark applications using PySpark and Spark-SQL for robust data extraction, transformation, and aggregation
Analyzed and transformed data from multiple file formats, unveiling valuable insights into customer usage patterns
Successfully established a robust CI/CD pipeline leveraging Jenkins and Airflow for containerization via Docker and Kubernetes
Orchestrated ETL operations using SSIS, NIFI, Python scripts, and Spark Applications
Constructed data flow pipelines, expertly transforming data from legacy tables to Hive, HBase tables, and S3 buckets
This data was handed off to business stakeholders and Data scientists for advanced analytics
Implemented data quality checks using Spark Streaming, seamlessly categorizing data with bad and passable flags, ensuring data integrity and reliability.
Skills
Hadoop Proficiency:
Strong support experience across major Hadoop distributions - Cloudera, Amazon EMR, Azure HDInsight, Hortonworks Proficient with Hadoop tools - HDFS, MapReduce, Yarn, Spark, Kafka, Hive, Impala, HBase, Sqoop, Airflow, and more
Azure Cloud and Big Data Tools: Working knowledge of Azure components - HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, Cosmos DB Hands-on experience with Spark using Scala and PySpark
Database Migration: Expertise in migrating SQL databases to Azure Data Lake, Azure SQL Database, Data Bricks, and Azure SQL Data Warehouse Proficient in access control and migration using Azure Data Factory
Cloud Computing and Big Data Tools:
Proficient in Azure Cloud and Big Data tools - Hadoop, HDFS, MapReduce, Hive, HBase, Spark, Azure Cloud, Amazon EC2, DynamoDB, S3, Kafka, Flume, Avro, Sqoop, PySpark
Real-time Data Solutions: Build real-time data pipelines and analytics using Azure components like Data Factory, HDInsight, Azure ML Studio, Stream Analytics, Azure Blob Storage, and Microsoft SQL DB
Database Expertise: Work with SQL Server and MySQL databases Skilled in working with Parquet files, parsing, and validating JSON formats Hands-on experience in setting up workflows with Apache Airflow and Oozie
API Development and Integration: Develop highly scalable and resilient RESTful APIs, ETL solutions, and third-party platform integrations as part of an Enterprise Site platform
IDE and Version Control: Proficient use of IDEs like PyCharm, IntelliJ, and version control systems SVN and Git
RDD, Spark SQL (Data Frames and Dataset), and Spark Streaming
Cloud Infrastructure
Azure, GCP
Databases
Oracle, Teradata, My SQL, SQL Server, NoSQL Database (HBase, MongoDB)
Scripting &Query Languages
Shell scripting, SQL
Version Control
CVS, SVN and Clear Case, GIT
Build Tools
Maven, SBT
Containerization Tools
Kubernetes, Docker, Docker Swarm
Reporting Tools
Junit, Eclipse, Visual Studio, Net Beans, Azure Databricks, UNIX Eclipse, Visual Studio, Net Beans, Junit, CI/CD, Linux, Google Shell, Unix, Power BI, SAS and Tableau
Windows Scripting and Cloud Containerization: Proficient in scripting and debugging within Windows environments
Familiarity with container orchestration, Kubernetes, Docker, and AKS
Efficient Data Integration: Expertise in designing and deploying SSIS packages for data extraction, transformation, and loading into Azure SQL Database and Data Lake Storage
Configure SSIS Integration Runtime for Azure execution and optimize package performance
Data Visualization and Analysis: Create data visualizations using Python, Scala, and Tableau
Develop Spark scripts with custom RDDs in Scala for data transformation and actions
Conduct statistical analysis on healthcare data using Python and various tools
Big Data Ecosystem: Extensive experience with Amazon EC2 for computing, query processing, and storage
Proficiently set up Pipelines in Azure Data Factory using Linked Services, Datasets, and Pipelines for ETL tasks
Azure Data Services: ETL expertise using Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics
Ingest data to Azure Services and process within Azure Databricks
Developed JSON scripts for streamlined deployment of pipelines within Azure Data Factory (ADF), facilitating efficient data processing through SQL Activity
Demonstrated expertise in migrating SQL databases to Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Data Bricks, Delta Lake, and Azure SQL Data Warehouse
Successfully managed database access, control, and migration via Azure Data Factory
Executed Extract Transform Load (ETL) operations on Azure Data Storage services using a blend of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics
Profoundly knowledgeable in Clustering, NLP, and Neural Networks, effectively translating outcomes into interactive dashboards for visualization and presentation
Implemented Spark RDD transformations for comprehensive business analysis and subsequent actionable processes
Strategically engaged in data migration to Hadoop while optimizing Hive queries for performance enhancement
Expertly orchestrated data extraction, transformation, and loading across Azure Data Storage services through Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics
Subsequent data processing transpired within Azure Databricks
Automated script execution through Apache Airflow and shell scripting, ensuring seamless daily production procedures
Constructed pipelines in Azure Data Factory (ADF) encompassing Linked Services, Datasets, and Pipelines
Successfully extracted, transformed, and loaded data from diverse sources like Azure SQL, Blob storage, Azure SQL Data Warehouse, and more
Spearheaded Data Migration initiatives employing SQL, SQL Azure, Azure Storage, Azure Data Factory, SSIS, and PowerShell technologies
Proficiently profiled structured, unstructured, and semi-structured data from various sources, meticulously implementing data quality metrics via essential queries or Python scripts aligned with source attributes
Skillfully employed PowerShell and UNIX scripts for file management, transfer, emailing, and related tasks
Innovatively conceptualized a novel data model integrating NoSQL submodules within a relational structure through Hybrid data modeling concepts
Leveraged Sqoop to seamlessly transfer data between RDBMS and HDFS, streamlining data integration
Proficiently installed and configured Apache Airflow for Snowflake data warehouse, establishing robust DAGs (Directed Acyclic Graphs) for automated workflow execution
Employed MongoDB for data storage in JSON format, adeptly crafting and testing dashboard features utilizing Python, Bootstrap, CSS, and JavaScript
Ensured efficient code deployment to EMR via CI/CD using Jenkins
Possess sound expertise in developing highly scalable and resilient RESTful APIs, ETL solutions, and third-party platform integrations within Enterprise Site platforms
Proficiently navigated various IDEs including PyCharm, IntelliJ, and managed repositories using SVN and Git version control systems
Developed JSON scripts to facilitate streamlined deployment of pipelines within Azure Data Factory (ADF), ensuring efficient data processing through SQL Activity
Demonstrated expertise in migrating SQL databases to Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Data Bricks, Delta Lake, and Azure SQL Data Warehouse
Skillfully managed database access, control, and migration via Azure Data Factory
Executed Extract Transform Load (ETL) operations on Azure Data Storage services, employing a blend of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics
Profoundly knowledgeable in Clustering, NLP, and Neural Networks, translating outcomes into dynamic, interactive dashboards for data visualization and presentation
Successfully implemented Spark RDD transformations, effectively mapping business analyses and applying actionable processes atop the transformed data
Led data migration initiatives to Hadoop while optimizing Hive queries for heightened processing efficiency
Efficiently executed Extract Transform Load (ETL) operations on Azure Data Storage services, utilizing a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics
Accomplished subsequent data processing within Azure Databricks
Automated script execution using Apache Airflow and shell scripting, ensuring consistent daily production operations
Created robust pipelines within Azure Data Factory (ADF), utilizing Linked Services, Datasets, and Pipelines for seamless data extraction, transformation, and loading from diverse sources like Azure SQL, Blob storage, and Azure SQL Data Warehouse
Played a pivotal role in Data Migration initiatives utilizing SQL, SQL Azure, Azure Storage, Azure Data Factory, SSIS, and PowerShell technologies
Skillfully profiled structured, unstructured, and semi-structured data from various sources, meticulously implementing data quality metrics through essential queries or Python scripts tailored to the unique attributes of each source
Proficiently managed file transfer, emailing, and other file-related tasks using PowerShell and UNIX scripts
Conceptualized a novel data model embedding NoSQL submodules within a relational structure, employing Hybrid data modeling concepts for enhanced data representation
Leveraged Sqoop for seamless data transfer between RDBMS and HDFS, facilitating efficient data integration across platforms
Expertly installed and configured Apache Airflow for Snowflake data warehouse, establishing effective Directed Acyclic Graphs (DAGs) to orchestrate automated workflows
Utilized MongoDB to store data in JSON format, skillfully developing and testing dashboard features using Python, Bootstrap, CSS, and JavaScript
Ensured streamlined deployment via CI/CD using Jenkins
Possess extensive expertise in designing highly scalable and resilient RESTful APIs, ETL solutions, and third-party platform integrations within Enterprise Site platforms
Proficiently navigated various Integrated Development Environments (IDEs) including PyCharm and IntelliJ
Managed version control using SVN and Git repositories.