Summary
Overview
Work History
Education
Skills
Technical Summary
Timeline
Generic

Dana Alkhatib

Staten Island,NY

Summary

Results-driven and highly skilled Data Engineer with over 9 years of extensive experience in Data Engineering, Data Analysis, Business Analysis, Hadoop Development, ETL Development, and Project Management. Adept in designing and developing Java/Scala and Python-based Spark applications, proficient in object-oriented programming (OOPS) with expertise in Python, Java, and C++. Excels in preparing interactive data visualizations using Tableau from diverse data sources.

Key Competencies:

  • Expertise in developing ETL pipelines for various domains including Telecommute, Medical, Healthcare, Insurance, and Financial sectors.
  • Proven experience in application development and project management, ensuring successful delivery of complex data engineering projects.
  • Proficient in developing Python APIs to retrieve, analyze, and structure data from NoSQL platforms like HBASE and DynamoDB.
  • Skilled in 'Business Intelligence' script development for data analysis in HIVE, contributing to informed decision-making.
  • Hands-on experience with C/CD tools such as Jenkins, Bamboo, Bitbucket, Ansible, Maven, Ant, Git, CVS, MKS, SVN.
  • Developed data ingestion tools using Spark-Scala, PySpark, and Python scripts, streamlining data processing workflows.
  • Extensive knowledge in working with SSIS, Matillion, AWS Athena, Glue, Amazon Kinesis for real-time streaming, DBT, Fivetran, Django, PostgreSQL, Terraform, Bash, Ruby, Flume, Flask, and various AWS services including Redshift, Lambda, EC2, EMR, S3, Snowflake, Databricks, Azure, and PyTorch.
  • Specialized in implementing data-migration plans to transfer data to HDFS, S3, DynamoDB, BQ using PySpark, Python, and Sqoop.

Demonstrated success in deploying comprehensive solutions, streamlining data processing workflows, and upholding data integrity across varied domains. A collaborative team member with robust analytical and problem-solving abilities. Eager to utilize a diverse skill set to pioneer inventive solutions and make meaningful contributions to the triumph of data-centric initiatives.

Overview

9
9
years of professional experience

Work History

Data Engineer

Bank of America
, NY
11.2022 - Current
  • Designed and implemented scalable data processing pipelines using Hadoop ecosystem tools such as Apache Hadoop and Apache Spark.
  • Managed and optimized Hadoop clusters for efficient storage and processing of large datasets.
  • Utilized MapReduce programming model to develop custom data processing applications.
  • Proficient in writing complex SQL queries for data retrieval, transformation, and analysis.
  • Designed and maintained relational databases, ensuring data integrity and optimal performance.
  • Experience with database administration tasks, including indexing, partitioning, and query optimization.
  • Developed and optimized Hive queries for efficient data analysis and reporting.
  • Implemented data modeling techniques in Hive for creating structured data representations.
  • Collaborated with data analysts to translate business requirements into Hive query logic.
  • Designed and executed data validation scripts to ensure the accuracy and completeness of incoming data.
  • Implemented data quality checks to identify and rectify inconsistencies in large datasets.
  • Spearheaded the development of a robust data processing pipeline using Hadoop ecosystem tools, reducing data processing time by 30%.
  • Managed and optimized Hadoop clusters, ensuring efficient storage, retrieval, and fault tolerance in HDFS.
  • Developed Spark applications using Scala and PySpark for large-scale data processing, implementing caching and persistence strategies for optimization.
  • Implemented and optimized Hive queries, resulting in a 20% improvement in query performance.
  • Developed validation frameworks to automate the verification of data integrity and adherence to business rules.
  • Led the implementation of data warehousing solutions using Hive, facilitating complex analytics on historical data.
  • Scheduled and monitored ETL workflows using Autosys, optimizing job dependencies for orderly execution.
  • Utilized Oracle databases for data extraction, transformation, and loading, conducting performance tuning on SQL queries.
  • Developed Spark applications using Scala and PySpark for large-scale data processing.
  • Implemented Spark RDD transformations and actions, optimizing workflows for fault tolerance.
  • Leveraged Spark SQL for querying structured data and Spark Streaming for real-time analytics.

Sr Data Engineer

Fannie Mae
12.2020 - 10.2022
  • Application design, Data analysis, migration, develop, test, deployment for real estate feature store
  • Development of data pipelines from snowflake/mysql to Databricks and development of online Feature store
  • Development of data pipelines in Databricks Delta Lake.
  • Development of PySpark application to transform the data and load to feature store with Python object oriented programming & Shell scripting
  • Creating Databricks notebooks using Pyspark, Python, SQL and automated notebooks using jobs
  • Wrote simple SQL scripts on the final database to prepare data for visualization with Tableau
  • Automating the backup jobs on a monthly/daily basis with AWS Cloud watch & Lambda services
  • Identified, designed and developed statistical data analysis routines to process data and visualize large sets of data to turn information into insights using multiple platforms with SQL Alchemy, Pandas & NumPy
  • Feature store integration with data API’s by home type & category
  • Development of Python APIs to dump the array structures in the Processor at the failure point for debugging
  • Development of snow pipe for continuous data load
  • ETL development with AWS Glue & Athena, Cloud watch and Lambda
  • Deployment of application for all Dev, Stage & Prod environments
  • Configuring and setting up Airflow DAG's for each workflow & for multiple environments
  • Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster
  • Developed the scripts to create Delta table DDL and Analyze table from Pyspark jobs
  • Automation of data validation through great expectations and stub functions
  • Development of Tableau dashboard with the views for weekly data category wise
  • Environment: Pyspark, Python, Glue, Databricks, Snowflake, Tableau,Snow SQL, AWS Lambda, EMR, EC2,S3,Cloud watch, Airflow, Shell-scripting, Linux, Jenkins, Bitbucket, Hadoop eco system , HDFS,Hive,Sqoop,MapReduce, Resource Managers, Oozie
  • Engineered data pipelines from Snowflake and MySQL to Databricks, ensuring seamless integration and efficient data transfer.
  • Spearheaded the implementation of an online feature store within Databricks Delta Lake, optimizing storage and query performance for feature retrieval.
  • Integrated Ruby scripts into the overall insurance data engineering workflow to enhance automation.
  • Leveraged Ruby for handling specific data transformations and manipulations.
  • Collaborated with cross-functional teams to architect end-to-end solutions, incorporating Amazon Kinesis into a broader AWS ecosystem.
  • Provided technical expertise in troubleshooting and optimizing Amazon Kinesis applications, ensuring high-performance data streaming.
  • Enhanced the scalability and maintainability of AWS infrastructure using Terraform.
  • Utilized Django for creating interactive and user-friendly interfaces for data visualization and reporting.
  • Integrated Django applications with backend data processing systems for seamless functionality.
  • Implemented Bash scripting for automating backup jobs and other routine data processing tasks.
  • Developed PySpark applications using object-oriented programming and shell scripting to transform and load data into the feature store.
  • Utilized the DBT cycle for iterative development and testing of SQL-based transformations, ensuring robustness and reliability.
  • Applied PyTorch for machine learning tasks, leveraging its capabilities for model development and deployment.
  • Utilized Pandas for data manipulation, cleaning, and preprocessing tasks, ensuring data quality and consistency.
  • Developed Flask applications for creating RESTful APIs, facilitating seamless integration of feature store data into various applications.
  • Integrated Flume for log collection and aggregation, enhancing real-time data ingestion capabilities.

Big Data Engineer ,Hadoop Developer

Freddie Mac
Scottsdale, AZ
01.2020 - 11.2020
  • Data analysis, Application design, develop, test, deployment for Insurance for Workers Compensation
  • Data pipeline for Insurance claims data from multiple TPAs
  • Development of PySpark application to extract data from multiple TPAs with Python & Shell scripting
  • Mentor and provided technical oversight on various projects and offered input
  • Automating the backup jobs on a daily basis with AWS Cloud watch & Lambda services
  • Development of spark jobs to classify the data on industry type & category
  • Designed the ETL data pipeline to transform the data from multiple sources to Master tables
  • Created several types of data visualizations using Python and Tableau
  • Development of Python APIs to dump the array structures in the Processor at the failure point for debugging
  • Build deployment on AWS ecosystem with EMR, S3, Cloud watch, Step functions and Lambda
  • Deployment of application for all Dev, Stage & Prod environments
  • Creating Databricks notebooks using Pyspark, Python, SQL and automated notebooks using jobs
  • Developed the scripts to create Hive table DDL and Analyze table from Pyspark jobs
  • Integrated Django applications with backend data processing systems for seamless functionality.
  • Development of Tableau dashboard with the views for weekly data category wise.
  • Implemented ETL processes using Matillion for seamless extraction, transformation, and loading of data from multiple sources.
  • Utilized Matillion's capabilities to enhance data transformation and integration in the insurance data engineering projects.
  • Developed SSIS packages for data extraction, transformation, and loading within the Microsoft SQL Server ecosystem.
  • Implemented SSIS solutions to handle complex data integration scenarios and improve data consistency.
  • Orchestrated infrastructure as code using Terraform for AWS deployments.
  • Automated the provisioning of cloud resources, including EMR clusters, S3 buckets, and Lambda functions.
  • Leveraged AWS Athena for interactive querying and analysis of data stored in Amazon S3.
  • Integrated Athena into the data engineering pipeline, enabling efficient querying and exploration of large datasets
  • Configured Fivetran connectors to automate data integration and synchronization between various data sources.

Big Data Engineer, Hadoop Developer

Dutch Bank
Malvern, PA
01.2017 - 12.2019
  • Data analysis, Application design, development, test, deployment for Electronic traded funds (ETFs)
  • Data analysis on ETFs data includes fund flows, Asset Under Management by different categories
  • Designed the PySpark application to extract data from multiple vendors through REST APIs
  • Spark application development with Python, Java & Shell scripting
  • Designed the basic ETL to transform the data from source to Master tables
  • Implemented the business rules for currency conversions, Fund coverage and Daily price.
  • Deployed applications across all development, staging, and production environments, ensuring consistency and reliability.
  • Created Databricks notebooks using PySpark, Python, and SQL, automating notebooks using job scheduling for streamlined workflows.
  • Developed scripts to create Hive table DDL and analyze tables from PySpark jobs, optimizing data storage and retrieval.
  • Designed and developed Tableau dashboards with views for weekly data categorization, providing actionable insights.
  • Design, development and implementation of performant ETL pipelines using python API (PySpark) of Apache Spark on AWS EM
  • Data structuring and building data pipeline for data ingestion and transformation to Hive tables.
  • Developed and maintained AWS CloudWatch Alarms and Logs for monitoring the health and performance of Amazon Kinesis streams.
  • Enhanced security measures by implementing AWS Identity and Access Management (IAM) roles and policies for Amazon Kinesis resources.
  • Implemented data retention and lifecycle policies to manage the storage of streaming data within Amazon Kinesis.
  • Collaborated with stakeholders to gather requirements and designed scalable architectures for streaming data processing using Amazon Kinesis.
  • Data modelling with defining Schemas, removing duplicates and filling missing Observations
  • Build deployment on AWS ecosystem with EMR, S3, EC2, and Service Catalogue & Glue Catalogue
  • Developed the scripts to create Hive table DDL and Analyze table from Pyspark jobs.

Hadoop Developer

Delta Airline
Plymouth Meeting, PA
07.2015 - 01.2017
  • Application design, develop, test, deploy the Big data solution for customers experience
  • The customer transnational Medical data includes Patient visits, Medications, Prescriptions, Problems, Orders and Observations
  • Roles & Designed the Spark application to extract data from XML sources and transform OMOP tables to CDD tables
  • Implemented the business rules to read the Patient data, inferred new business rules to derive specific data
  • Designed business logic for Cleaning & Structuring incoming customer data initially stored in HDFS Parquet files, and transformation to Impala tables
  • Created dashboards using Tableau and Plotly in Databricks using PySpark
  • Configured a python API Producer file to ingest data from the Slack API, using Kafka, for real-time processing with Spark
  • Data modelling with defining Schemas, removing duplicates and filling missing Observations
  • Developed the Unit test framework for all the CDD Tables after the business rules
  • Developed the scripts to create impala table DDL and Analyze table from spark jobs
  • Developed Spark jobs Scala and Java API's and performed transformations and actions on RDD's
  • Designed Patient EMR object module with Java object-oriented programming
  • Data ingestion from RDBMS to HDFS with Sqoop, Oraoop & Spark JDBC applications
  • Design built and deployed a set of Python modelling APIs for customer analytics, which integrate multiple machine learning techniques for various user behavior prediction and support multiple marketing segmentation programs

Education

Bachelor of health science -

Long Island University
01.2015

Skills

  • Big Data Technolgies: Spark, PySpark, Hive, Impala, Sqoop, Tableau, Flume, Oozie, HDFS, MongoDb, Snowflake,Databricks,Maprreduce
  • AWS: EMR, Glue, S3, EC2, Lambda, Athena, Step function, API gateway, SNS, Glue Catalogue, Redshift, DynamoDb, CloudWatch
  • GCP: Dataproc, BQ, Storage, VM, image
  • Languages:Python, Scala, Java, C, SQL, MySql, Shell scripting
  • Workflow: Airflow, Step functions, Dataflow, Control-M
  • Web Technologies: HTML, CSS, Javascript, JSON, XML/SOAP, REST, WSDL
  • Operating Systems Linux (Ubuntu, Fedora & CentOS), Unix and Windows
  • IDEs: Databricks, JuPyter, IntelliJ, PyCharm, Eclipse, Source insight
  • Version Control: Git, Subversion, CVS, MKS
  • DB Tools: SQL Developer, Squirrel
  • Tracking tools: Redmine, JIRA
  • CI/CD tools Jenkins, Chef, Confluence, Bitbucket

Technical Summary

Spark, PySpark, Hive, Impala, Sqoop, Tableau, Flume, Oozie, HDFS, MongoDb, Snowflake, Databricks, Maprreduce, AWS, EMR, Glue, S3, EC2, Lambda, Athena, Step function, API gateway, SNS, Glue Catalogue, Redshift, DynamoDb, CloudWatch, GCP, Dataproc, BQ, Storage, VM, image, Python, Scala, Java, C++, SQL, MySql, Shell scripting, Airflow, Step functions, Dataflow, Control-M, HTML, CSS, Javascript, JSON, XML/SOAP, REST, WSDL, Linux (Ubuntu, Fedora & CentOS), Unix, Windows, Databricks, JuPyter, IntelliJ, PyCharm, Eclipse, Source insight, Git, Subversion, CVS, MKS, SQL Developer, Squirrel, Redmine, JIRA, Jenkins, Chef, Confluence, Bitbucket

Timeline

Data Engineer

Bank of America
11.2022 - Current

Sr Data Engineer

Fannie Mae
12.2020 - 10.2022

Big Data Engineer ,Hadoop Developer

Freddie Mac
01.2020 - 11.2020

Big Data Engineer, Hadoop Developer

Dutch Bank
01.2017 - 12.2019

Hadoop Developer

Delta Airline
07.2015 - 01.2017

Bachelor of health science -

Long Island University
Dana Alkhatib