Dedicated Data Engineer with over 6 years of hands-on experience, specializing in the design, implementation, and optimization of end-to-end data pipelines. Proficient in harnessing cutting-edge technologies to transform raw data into actionable insights, driving well-informed decision-making.
Key Skills:
Extensive experience in designing, developing, and executing data pipelines and data lake requirements using the Big Data Technology stack, Python, PL/SQL, SQL, REST APIs, and the Azure cloud platform.
Proficiency with key Big Data tools, including HDFS, Kafka, Map Reduce, Spark, PIG, HIVE, Sqoop, HBase, Flume, and Zookeeper for designing and deploying comprehensive big data ecosystems.
Expertise in Spark Data Frame Operations for critical data validation and analytics on Hive data within Cloudera infrastructure.
Skilled in developing advanced MapReduce systems to process various file types, including Text, Sequence, XML, and JSON.
Successfully migrated on-premises applications to leverage Azure cloud databases and storage.
Hands-on experience with Azure services, including SQL Database, SQL Data Warehouse, Analysis Services, HDInsight, Data Lake, and Data Factory.
Proficient in building CI/CD pipelines on AWS using Code Commit, Code Build, Code Deploy, and Code Pipeline, as well as utilizing AWS Cloud Formation, API Gateway, and AWS Lambda for automation and infrastructure security.
Expertise in Azure data solutions, including storage account provisioning, Azure Data Factory, SQL server, SQL Databases, SQL Data Warehouse, Azure Data Bricks, and Azure Cosmos DB.
Strong understanding of Spark Architecture with Databricks and Structured Streaming.
Practical experience with Python and Apache Airflow to create, schedule, and monitor workflows.
Knowledge of data analytics services, such as Quick Sight, Glue Data Catalog, and Athena.
Proficiency in working with Apache Kafka and Confluent environments, including KTables, Global KTables, and KStreams for Kafka streaming.
Led data migration from on-premises SQL servers to Azure cloud databases, including Azure Synapse Analytics and Azure SQL DB.
Utilized Azure Data Factory, T-SQL, Spark SQL, and Azure Data Lake Analytics for ETL processes and data ingestion.
Performed data processing in Azure Databricks.
Worked with Kafka streaming for subscriber-side data processing, integrating messages into databases.
Leveraged Apache Spark for real-time data processing.
Designed a reusable Python pattern for Synapse integration, aggregations, ft change data capture,deduplication, and high watermark implementation.
Accelerated development and promoted standardization across teams.
Integrated Kubernetes with cloud-native services, such as AWS EKS and GCP GKE, to enhance scalability.
Led the migration of data to Snowflake and AWS from legacy data warehouses.
Contributed to the Data and Reporting team, creating actionable insights and visualizations for informed decision-making.
Extracted and analyzed data from various sources, implementing data wrangling and cleanup using Python-Pandas.
Demonstrated proficiency with common Python Data Engineering packages, including Pandas, Numpy, Pyarrow, Pytest, Scikit-Learn, and Boto3.
Created and maintained CI/CD pipelines, applying automation to environments and applications.
Utilized Python for data manipulation and wrote data into JSON files for testing Django websites.
Developed and maintained Docker container clusters managed by Kubernetes.
Managed infrastructure as code using AWS Terraform templates.
Configured Jenkins pipelines to execute various steps, including unit testing, integration testing, and static analysis tools.
Azure Data Engineer
DXC Technology
02.2020 - 11.2021
Implemented data quality checks, validations, and monitoring processes to ensure the accuracy and integrity of data in the pharmaceutical company.
Architected and implemented medium to large-scale Business Intelligence (BI) solutions on Azure using Azure Data Platform services, including Azure Data Lake, Data Factory, Data Lake Analytics, and Stream Analytics.
Utilized Azure Data Lake, Azure Data Factory, and Azure Databricks to efficiently move and transform on-premises data to the cloud, meeting the analytical needs of the organization.
Analyzed data using SQL, Python, and Apache Spark, creating and presenting analytical reports for management and technical teams.
Deployed models as Python packages, APIs for backend integration, and microservices within a Kubernetes orchestration layer for Docker containers.
Created pipelines in Azure Data Factory for data extraction, transformation, and loading (ETL) from diverse sources, including Azure SQL, Blob storage, and Azure SQL Data Warehouse.
Developed and implemented data acquisition jobs using Scala, Sqoop, Hive, and Pig, optimizing MapReduce jobs for efficient Hadoop Distributed File System (HDFS) usage.
Enhanced data processing efficiency by converting and parsing data formats using PySpark Data Frames, reducing data conversion and parsing time.
Established and maintained continuous integration and deployment (CI/CD) pipelines, applying automation to environments and applications.
Proficient in automation tools such as GIT, Terraform, and Ansible.
Implemented Python automation for Capital Analysis and Review, utilizing Pandas and NumPy modules for data manipulation and analysis, ensuring accurate reporting and streamlined decision-making.
Led the migration to AWS, utilizing Amazon Redshift for data warehousing and HiveQL for reporting, resulting in a 30% reduction in data retrieval and processing time.
Data Engineer
HDFC Bank
10.2018 - 01.2020
Spearheaded various stages of the Software Development Lifecycle (SDLC), encompassing requirement gathering, design, development, deployment, and application analysis.
Proficiently managed data import from diverse sources, executing transformations with Hive and MapReduce, and loading data into HDFS. Also, efficiently extracted data from SQL into HDFS using Sqoop.
Developed advanced analytical components utilizing Scala, Spark, Apache Mesos, and Spark Stream.
Installed and configured Hadoop, MapReduce, and HDFS, leading to the creation of multiple MapReduce jobs in PIG and Hive for data cleansing and pre-processing.
Expertly facilitated Big Data Integration and Analytics, leveraging technologies such as Hadoop, SOLR, Spark, Kafka, Storm, and web Methods.
Collaborated with the DevOps Team, utilizing CI/CD tools like Jenkins and Docker to establish end-to-end application processes, encompassing deployment in lower environments and delivery.
Designed and implemented Python code to collect data from HBase (Cornerstone) and devised a PySpark-based solution for data processing.
Engineered a Java API (Commerce API) for seamless connection to Cassandra via Java services.
Application Developer/ Data Engineer
Care Health Insurence
07.2017 - 09.2018
Created and managed workflows using Oozie, orchestrating MapReduce jobs and Hive Queries.
Developed Session Beans and controller Servlets to proficiently manage HTTP requests originating from Talend.
Proficiently executed data visualization, including the design of interactive dashboards using Tableau. Generated complex reports comprising charts, summaries, and graphs to effectively convey insights to the team and stakeholders.
Provided support for the development of web portals, completed data modeling in PostgreSQL, and contributed to front-end development using HTML/CSS and jQuery.
Engineered Python code to collect and process data from HBase (Cornerstone) and formulated a PySpark-based solution for implementation.
Designed and implemented a Java API (Commerce API) to enable seamless connectivity to Cassandra through Java services.
Education
Master of Science - Data Science
University of North Texas
Denton, TX
05.2023
Bachelor of Technology -
Kamala Institute of Technology And Science
2020
Skills
Microsoft Azure Cloud Exposure
Azure Databricks
Azure data factory
Azure Synapse analytics/SQL(Data warehouse)
Logic apps
Azure data lake
Azure Analysis services
Azure Key Vault services
Databases
Snowflake, MySQL, Teradata, Oracle, MS SQL SERVER, PostgreSQL, DB2