Responsible for migrating on-prem Data Lake to AWS Cloud S3 backed Data Lake
Responsible for building end to end data pipelines in cloud infrastructure
Responsible for fine-tuning, troubleshooting, and supporting the enterprise data pipelines at production scale
Written Python-based Spark applications for performing various data transformations, and other custom event processing
Involved in data cleansing, event enrichment, data aggregation, and data preparation needed for machine learning and reporting
We used Spark-SQL to read data from hive tables and perform various data cleansing, data validations, transformations, and aggregations as per downstream business team requirements
Deployed to Kubernetes, Created Pods, and managed using Kubernetes
Used Build Automation pipelines to drive all microservices builds out to the Docker registry in AWS
Automated resulting scripts and workflow using Airflow orchestration and shell scripting to ensure daily execution in production
Involved in continuous Integration of applications using Jenkins
Responsible for loading processed data to the Data Warehousing table to allow the Business reporting team to build dashboards
Work with cross functional teams within the data science, software engineering and analytics team to design, develop and execute solutions to derive business insights and solve client’s operational and strategic problems
Worked on data visualization and analytics with research scientists and business stakeholders
Superior communication skills, strong decision making and organizational skills along with outstanding analytical and problem-solving skills to undertake challenging jobs
Optimized data processing by implementing efficient ETL pipelines and streamlining database design.
Designed scalable and maintainable data models to support business intelligence initiatives and reporting needs.
.Fine-tuned query performance and optimized database structures for faster, more accurate data retrieval and reporting.
Evaluated various tools, technologies, and best practices for potential adoption in the company''s data engineering processes.
Data Engineer
Smart and Final
08.2023 - 03.2024
Responsible for building end-to-end data pipelines in Azure cloud infrastructure, ensuring efficient data handling and processing
Developed python-based spark applications for data transformations and event processing, contributing to the refinement of data analytics and reporting capabilities
Successfully designed, developed, and maintained complex data pipelines, including a 650 TB migration using Azure Data Factory, enhancing system reliability and integrity
Experience in Azure cloud platform, managing virtual networks and VM’s, Databricks, and optimizing cloud infrastructure for data engineering tasks
Skilled in automating cloud infrastructure with ARM templates for Function apps, Key-Vaults, Virtual networks etc
Embodying the principle of full ownership in build and deployment processes
Led the development and implementation of continuous integration and deployment pipelines, incorporating Git Action workflows for automated deployment of infrastructure and applications
Implemented cost-saving strategies in data storage management, transitioning between hot, cold, and archive tiers, resulting in significant savings (approximately $200K)
Diagnose and resolve production issues and resource utilization, improving performance and costefficiency
Created customer-focused data dashboards for analytics and monitoring, utilizing Python scripting for effective data integration and sharing between ADLS and Snowflake
Composed and maintained comprehensive documentation and deployment guides to streamline and standardize the build and release procedures, ensuring best practices and team alignment
Utilized AWS to aggregate clean files in Amazon S3 and deployed files into Buckets via Amazon EC2 Clusters
Developed a data pipeline on AWS to extract data from weblogs and store it in HDFS and migrated data from AWS S3 to HDFS using Kafka
Designed a Data Quality Framework for schema validation and data profiling using Spark (PySpark)
Employed PySpark-SQL to load JSON data, create schema RDDs and DataFrames, and integrate it into Hive Tables, managing structured data with Spark-SQL
Created views and templates with Python and Django’s view controller and templating language, employing MVC architecture to deliver a user-friendly interface
Developed ETL/ELT pipelines using data technologies such as PySpark, Hive, Presto, and Databricks
Applied best practices in data architecture, integration, and governance, including Data Catalogs, Governance frameworks, Metadata management, and Data Quality solutions
Successfully implemented ETL solutions between OLTP and OLAP databases to support Decision Support Systems, with expertise across all SDLC phases
Created Python scripts for managing AWS resources via Boto3 SDK and AWS CLI, and established CI/CD pipelines using Maven, GitHub, and AWS
Specialized in real-time processing and core job development with Kafka and Spark Streaming and developed UNIX shell scripts for parameterizing Sqoop and Hive jobs
Extensively imported metadata into Hive using Python and migrated existing tables and applications to AWS