Summary
Overview
Work History
Education
Skills
Timeline
Generic

Hamza Al - US CITIZEN

Staten Island,NY

Summary

Results-driven Senior Data Engineer with 7+ years of experience in big data, distributed systems, and cloud computing. Expertise in designing and implementing scalable ETL pipelines, data migrations, and real-time data processing using AWS, Azure, Hadoop, and Spark.

Proficient in AWS Database Migration Service (DMS), Azure Data Factory, Apache Spark, MapReduce, and SQL-based data transformations. Strong background in RDBMS, NoSQL databases (HBase, MarkLogic), and data integration tools like Sqoop and Flume. Skilled in optimizing performance, query tuning, and data modeling for large-scale datasets.

Experienced in CI/CD pipelines, automation, and workflow orchestration using Maven, Ant, Oozie, Zookeeper, and Terraform. Adept at building Scala and Python-based applications for structured and unstructured data processing in AWS and Azure environments.

Follows Agile and Scrum methodologies, ensuring efficient collaboration and iterative development

Overview

7
7
years of professional experience

Work History

Data Engineer

Delta Air Lines
09.2020 - Current
  • Developed and maintained serverless applications using AWS Lambda and AWS Step Functions, improving application performance and scalability.
  • Managed data infrastructure on AWS, including EC2 instances, RDS databases, and S3 buckets, ensuring high availability and reliability.
  • Designed and implemented data pipelines using AWS services such as S3, Glue, EMR, and Redshift, reducing data processing time by 50%.
  • Designed and implemented complex dashboards and reports to meet specific business requirements.
  • Integrated Flume with Kafka for reliable event streaming and seamless data flow to downstream applications.
  • Conducted performance tuning on Flume agents, optimizing throughput and minimizing latency.
  • Developed Spark applications using Scala utilizing Data frames and Spark SQL API for faster processing of data
  • Implemented custom Lambda layers for sharing code and dependencies across multiple functions.
  • Created Hive Tables, loaded transactional data from Teradata using Sqoop, and worked with highly unstructured and semi-structured data of 2 Petabytes in size.
  • Utilized Snowflake's data sharing features for securely sharing data across different Snowflake accounts.
  • Configured Snowflake stages and storage policies for efficient data storage and retrieval.
  • Developed MapReduce jobs for cleaning, accessing, and validating the data and created and worked Sqoop jobs with the incremental load to populate Hive External tables
  • Developed optimal strategies for distributing the weblog data over the cluster importing and exporting the stored web log data into HDFS and Hive using Sqoop.
  • Designed and optimized PostgreSQL queries for efficient data retrieval and improved performance.
  • Extracted and processed PostgreSQL data using Spark and PySpark, integrating structured datasets into AWS Redshift and Snowflake.
  • Developed ETL pipelines that ingested transactional data from PostgreSQL into Amazon S3 and Databricks for further transformations
  • Implemented PostgreSQL stored procedures and functions to handle complex data processing logic before integration with analytics platforms.
  • Performed database indexing and partitioning in PostgreSQL to enhance query performance for airline operational data
  • Automated PostgreSQL data ingestion workflows using AWS Glue and Step Functions, ensuring seamless data movement across AWS services.
  • Responsible for building scalable distributed data solutions using Hadoop Cloudera and designed and developed automation test scripts using Python
  • Analyzed the SQL scripts and designed the solution to implement using Spark and implemented Hive Generic UDF's to incorporate business logic into Hive Queries
  • Executed DBT tests for validating data integrity and ensuring the correctness of transformations. Utilized DBT snapshots for capturing historical changes in dimensions and facts.
  • Creating Hive tables and working on them using Hive QL and designed and Implemented Partitioning (Static, Dynamic) Buckets in HIVE.
  • Implemented materialized views in data warehousing environments for pre-aggregated data summaries. Utilized data warehousing features like clustering keys to improve query performance.
  • Employed Terraform workspaces for managing multiple environments with varying configurations.
  • Implemented GitLab CI/CD pipeline triggers based on specific branch activities for automated testing and deployment.
  • Developed Bash scripts for log rotation and retention policies in data processing environments. Implemented Ruby scripts for data validation and integrity checks in ETL workflows.
  • Collaborated on ETL (Extract, Transform, Load) tasks, maintaining data integrity and verifying pipeline stability.

Data Engineer

Fannie Mae
09.2019 - 08.2020
  • Developed and maintained serverless applications using AWS Lambda and AWS Step Functions, improving application performance and scalability.
  • Developed ETL workflows using Python and Apache Spark, ensuring data quality and consistency across multiple data sources
  • Involved in Hive/SQL queries performing spark transformations using Spark RDDs and Python (spark)
  • Implemented custom interceptors in Flume for preprocessing log data before ingestion.
  • Integrated Snowflake with Snowpipe for real-time data ingestion from external sources.
  • Utilized Snowflake stored procedures for encapsulating complex data manipulation logic.
  • Developed custom visualizations using Databricks notebooks for data exploration and analysis.
  • Utilized Databricks Jobs API for programmatically managing and scheduling ETL workflows.
  • Implemented Azure Data Factory pipelines for orchestrating complex data workflows.
  • Leveraged Azure Logic Apps connectors for integrating with various Azure and external services.
  • Integrated Flume with Elasticsearch for real-time log indexing and search capabilities.
  • Utilized Redshift federated queries to join data across Redshift and external databases.
  • Automated Redshift snapshots for regular backups and point-in-time recovery.
  • Implemented RESTful APIs using Flask for seamless communication between web applications.
  • Developed serverless applications using AWS Lambda for cost-effective and scalable solutions.
  • Utilized Lambda environment variables for dynamic configuration and parameterization.
  • Created a Serverless data ingestion pipeline on AWS using lambda functions
  • Developed Apache Spark Applications by using Scala, Python, and Implemented Apache Spark data processing module to handle data from various RDBMS and Streaming sources
  • Experience in developing and scheduling various Spark Streaming / batch Jobs using python (pyspark) and Scala
  • Developing spark code using pyspark to be applying various transformations and actions for faster data processing
  • Achieved high-throughput, scalable, fault-tolerant stream processing of live data streams using Apache Spark Streaming
  • Used Spark Stream processing using Scala to get data into in-memory, created RDDs, Data Frames and applied transformations and actions
  • Sqoop jobs and Hive queries were created for data ingestion from relational databases to analyze historical data
  • Experience in working with Elastic MapReduce (EMR) and setting up environments on amazon AWS EC2 instances
  • Knowledge on handling Hive queries using Spark SQL that integrates with Spark environment
  • Executed Hadoop/Spark jobs on AWS EMR using programs, stored in S3 Buckets
  • Involved in loading the structured and semi structured data into spark clusters using Spark SQL and Data Frames API
  • Utilized AWS CloudWatch to monitor the performance environment instances for operational and performance metrics during load testing
  • Scripting Hadoop package installation and configuration to support fully automated deployments
  • Installed and configured Hive in Hadoop cluster and help business users/application teams fine tune their HIVE QL for optimizing performance and efficient use of resources in cluster
  • Implemented Oozie workflow for ETL Process for critical data feeds across the platform

Data Engineer

Toyota
01.2018 - 08.2019
  • Implemented event-driven architectures using AWS services such as S3, Kinesis, and Lambda, enabling real-time data processing and analysis.
  • Launching Amazon EC2 Cloud Instances using Amazon Web Services (Linux/ Ubuntu/RHEL) and configuring launched instances with respect to specific applications
  • Importing of data from various data sources; perform transformations using Hive, MapReduce, load data into HDFS and extract the data from MySQL into HDFS using Sqoop
  • Designed and implemented Snowflake data sharing for secure cross-account data collaboration.
  • Utilized Snowflake stored procedures for encapsulating complex data manipulation logic.
  • Implemented Redshift workload management (WLM) to prioritize and optimize query execution.
  • Utilized Redshift federated queries to join data across Redshift and external databases.
  • Export the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team
  • Use Hive to analyze the partitioned and bucketed data and compute various metrics for reporting
  • Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS
  • Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS
  • Created custom python/shell scripts to import data via SQOOP from Oracle databases
  • Monitor and Troubleshoot Hadoop jobs using Yarn Resource Manager and EMR job logs using Genie and Kibana
  • Involved in converting Cassandra/Hive/SQL queries into Spark transformations using Spark RDDs in Scala
  • Log data collected from the web servers was channeled into HDFS using Flume and spark streaming
  • Developed Spark jobs using Scala in test environment for faster data processing and used Spark SQL for querying
  • Load and transform Design efficient Spark code using Python and Spark SQL, which can be forward engineered by our code generation developers
  • Utilized large sets of structured, semi structured, and unstructured data
  • Created big data workflows to ingest the data from various sources to Hadoop using OOZIE and these workflows comprises of heterogeneous jobs like Hive, SQOOP and Python Script

Education

Bachelors in Engineering Science -

College of Staten Island
Staten Island, NY

Skills

  • Flume
  • Spark (Java/Scala/Python)
  • PySpark
  • PyTorch
  • Fivetran
  • AWS Kinesis
  • ETL Development
  • SSIS (SQL Server Integration Services)
  • Matillion
  • AWS Glue
  • Data Warehousing

Cloud Services:

  • AWS (Redshift, Lambda, EC2, EMR, S3, Athena)
  • Azure (Data Factory)

Data Storage and Databases:

  • PostgreSQL
  • Git
  • GitHub
  • GitLab

Version Control and Collaboration:

  • GitHub
  • GitLab
  • Bitbucket

Web Frameworks:

  • Flask
  • Django

Automation and Orchestration:

  • Ansible
  • Jenkins
  • Bamboo

Real-Time Streaming:

  • Amazon Kinesis
  • Flume
  • Snowflake
  • Amazon DynamoDB
  • HDFS
  • S3

Big Data Technologies:

  • Hadoop
  • Databricks
  • Hive
  • HBASE

Code Infrastructure:

  • Terraform
  • Bash Scripting
  • Ruby Scripting

Business Intelligence and Visualization:

  • Tableau
  • DBT (Data Build Tool)
  • Pandas

Machine Learning:

  • PyTorch

Other Tools and Services:

  • AWS Services (Redshift, Lambda, EC2, EMR, S3, Snowflake, Databricks)
  • SSIS
  • Azure Data Factory
  • AWS Athena
  • AWS EMR
  • AWS Snowflake

Timeline

Data Engineer

Delta Air Lines
09.2020 - Current

Data Engineer

Fannie Mae
09.2019 - 08.2020

Data Engineer

Toyota
01.2018 - 08.2019

Bachelors in Engineering Science -

College of Staten Island
Hamza Al - US CITIZEN