Summary

Overview

Work History

Education

Skills

Technical Summary

Timeline

Dana Alkhatib

Staten Island,NY

Summary

Results-driven and highly skilled Data Engineer with over 9 years of extensive experience in Data Engineering, Data Analysis, Business Analysis, Hadoop Development, ETL Development, and Project Management. Adept in designing and developing Java/Scala and Python-based Spark applications, proficient in object-oriented programming (OOPS) with expertise in Python, Java, and C++. Excels in preparing interactive data visualizations using Tableau from diverse data sources.

Key Competencies:

Expertise in developing ETL pipelines for various domains including Telecommute, Medical, Healthcare, Insurance, and Financial sectors.
Proven experience in application development and project management, ensuring successful delivery of complex data engineering projects.
Proficient in developing Python APIs to retrieve, analyze, and structure data from NoSQL platforms like HBASE and DynamoDB.
Skilled in 'Business Intelligence' script development for data analysis in HIVE, contributing to informed decision-making.
Hands-on experience with C/CD tools such as Jenkins, Bamboo, Bitbucket, Ansible, Maven, Ant, Git, CVS, MKS, SVN.
Developed data ingestion tools using Spark-Scala, PySpark, and Python scripts, streamlining data processing workflows.
Extensive knowledge in working with SSIS, Matillion, AWS Athena, Glue, Amazon Kinesis for real-time streaming, DBT, Fivetran, Django, PostgreSQL, Terraform, Bash, Ruby, Flume, Flask, and various AWS services including Redshift, Lambda, EC2, EMR, S3, Snowflake, Databricks, Azure, and PyTorch.
Specialized in implementing data-migration plans to transfer data to HDFS, S3, DynamoDB, BQ using PySpark, Python, and Sqoop.

Demonstrated success in deploying comprehensive solutions, streamlining data processing workflows, and upholding data integrity across varied domains. A collaborative team member with robust analytical and problem-solving abilities. Eager to utilize a diverse skill set to pioneer inventive solutions and make meaningful contributions to the triumph of data-centric initiatives.

Overview

years of professional experience

Work History

Data Engineer

Bank of America

, NY

11.2022 - Current

Designed and implemented scalable data processing pipelines using Hadoop ecosystem tools such as Apache Hadoop and Apache Spark.
Managed and optimized Hadoop clusters for efficient storage and processing of large datasets.
Utilized MapReduce programming model to develop custom data processing applications.
Proficient in writing complex SQL queries for data retrieval, transformation, and analysis.
Designed and maintained relational databases, ensuring data integrity and optimal performance.
Experience with database administration tasks, including indexing, partitioning, and query optimization.
Developed and optimized Hive queries for efficient data analysis and reporting.
Implemented data modeling techniques in Hive for creating structured data representations.
Collaborated with data analysts to translate business requirements into Hive query logic.
Designed and executed data validation scripts to ensure the accuracy and completeness of incoming data.
Implemented data quality checks to identify and rectify inconsistencies in large datasets.
Spearheaded the development of a robust data processing pipeline using Hadoop ecosystem tools, reducing data processing time by 30%.
Managed and optimized Hadoop clusters, ensuring efficient storage, retrieval, and fault tolerance in HDFS.
Developed Spark applications using Scala and PySpark for large-scale data processing, implementing caching and persistence strategies for optimization.
Implemented and optimized Hive queries, resulting in a 20% improvement in query performance.
Developed validation frameworks to automate the verification of data integrity and adherence to business rules.
Led the implementation of data warehousing solutions using Hive, facilitating complex analytics on historical data.
Scheduled and monitored ETL workflows using Autosys, optimizing job dependencies for orderly execution.
Utilized Oracle databases for data extraction, transformation, and loading, conducting performance tuning on SQL queries.
Developed Spark applications using Scala and PySpark for large-scale data processing.
Implemented Spark RDD transformations and actions, optimizing workflows for fault tolerance.
Leveraged Spark SQL for querying structured data and Spark Streaming for real-time analytics.

Sr Data Engineer

Fannie Mae

12.2020 - 10.2022

Application design, Data analysis, migration, develop, test, deployment for real estate feature store
Development of data pipelines from snowflake/mysql to Databricks and development of online Feature store
Development of data pipelines in Databricks Delta Lake.
Development of PySpark application to transform the data and load to feature store with Python object oriented programming & Shell scripting
Creating Databricks notebooks using Pyspark, Python, SQL and automated notebooks using jobs
Wrote simple SQL scripts on the final database to prepare data for visualization with Tableau
Automating the backup jobs on a monthly/daily basis with AWS Cloud watch & Lambda services
Identified, designed and developed statistical data analysis routines to process data and visualize large sets of data to turn information into insights using multiple platforms with SQL Alchemy, Pandas & NumPy
Feature store integration with data API’s by home type & category
Development of Python APIs to dump the array structures in the Processor at the failure point for debugging
Development of snow pipe for continuous data load
ETL development with AWS Glue & Athena, Cloud watch and Lambda
Deployment of application for all Dev, Stage & Prod environments
Configuring and setting up Airflow DAG's for each workflow & for multiple environments
Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster
Developed the scripts to create Delta table DDL and Analyze table from Pyspark jobs
Automation of data validation through great expectations and stub functions
Development of Tableau dashboard with the views for weekly data category wise
Environment: Pyspark, Python, Glue, Databricks, Snowflake, Tableau,Snow SQL, AWS Lambda, EMR, EC2,S3,Cloud watch, Airflow, Shell-scripting, Linux, Jenkins, Bitbucket, Hadoop eco system , HDFS,Hive,Sqoop,MapReduce, Resource Managers, Oozie
Engineered data pipelines from Snowflake and MySQL to Databricks, ensuring seamless integration and efficient data transfer.
Spearheaded the implementation of an online feature store within Databricks Delta Lake, optimizing storage and query performance for feature retrieval.
Integrated Ruby scripts into the overall insurance data engineering workflow to enhance automation.
Leveraged Ruby for handling specific data transformations and manipulations.
Collaborated with cross-functional teams to architect end-to-end solutions, incorporating Amazon Kinesis into a broader AWS ecosystem.
Provided technical expertise in troubleshooting and optimizing Amazon Kinesis applications, ensuring high-performance data streaming.
Enhanced the scalability and maintainability of AWS infrastructure using Terraform.
Utilized Django for creating interactive and user-friendly interfaces for data visualization and reporting.
Integrated Django applications with backend data processing systems for seamless functionality.
Implemented Bash scripting for automating backup jobs and other routine data processing tasks.
Developed PySpark applications using object-oriented programming and shell scripting to transform and load data into the feature store.
Utilized the DBT cycle for iterative development and testing of SQL-based transformations, ensuring robustness and reliability.
Applied PyTorch for machine learning tasks, leveraging its capabilities for model development and deployment.
Utilized Pandas for data manipulation, cleaning, and preprocessing tasks, ensuring data quality and consistency.
Developed Flask applications for creating RESTful APIs, facilitating seamless integration of feature store data into various applications.
Integrated Flume for log collection and aggregation, enhancing real-time data ingestion capabilities.

Big Data Engineer ,Hadoop Developer

Freddie Mac

Scottsdale, AZ

01.2020 - 11.2020

Data analysis, Application design, develop, test, deployment for Insurance for Workers Compensation
Data pipeline for Insurance claims data from multiple TPAs
Development of PySpark application to extract data from multiple TPAs with Python & Shell scripting
Mentor and provided technical oversight on various projects and offered input
Automating the backup jobs on a daily basis with AWS Cloud watch & Lambda services
Development of spark jobs to classify the data on industry type & category
Designed the ETL data pipeline to transform the data from multiple sources to Master tables
Created several types of data visualizations using Python and Tableau
Development of Python APIs to dump the array structures in the Processor at the failure point for debugging
Build deployment on AWS ecosystem with EMR, S3, Cloud watch, Step functions and Lambda
Deployment of application for all Dev, Stage & Prod environments
Creating Databricks notebooks using Pyspark, Python, SQL and automated notebooks using jobs
Developed the scripts to create Hive table DDL and Analyze table from Pyspark jobs
Integrated Django applications with backend data processing systems for seamless functionality.
Development of Tableau dashboard with the views for weekly data category wise.
Implemented ETL processes using Matillion for seamless extraction, transformation, and loading of data from multiple sources.
Utilized Matillion's capabilities to enhance data transformation and integration in the insurance data engineering projects.
Developed SSIS packages for data extraction, transformation, and loading within the Microsoft SQL Server ecosystem.
Implemented SSIS solutions to handle complex data integration scenarios and improve data consistency.
Orchestrated infrastructure as code using Terraform for AWS deployments.
Automated the provisioning of cloud resources, including EMR clusters, S3 buckets, and Lambda functions.
Leveraged AWS Athena for interactive querying and analysis of data stored in Amazon S3.
Integrated Athena into the data engineering pipeline, enabling efficient querying and exploration of large datasets
Configured Fivetran connectors to automate data integration and synchronization between various data sources.

Big Data Engineer, Hadoop Developer

Dutch Bank

Malvern, PA

01.2017 - 12.2019

Data analysis, Application design, development, test, deployment for Electronic traded funds (ETFs)
Data analysis on ETFs data includes fund flows, Asset Under Management by different categories
Designed the PySpark application to extract data from multiple vendors through REST APIs
Spark application development with Python, Java & Shell scripting
Designed the basic ETL to transform the data from source to Master tables
Implemented the business rules for currency conversions, Fund coverage and Daily price.
Deployed applications across all development, staging, and production environments, ensuring consistency and reliability.
Created Databricks notebooks using PySpark, Python, and SQL, automating notebooks using job scheduling for streamlined workflows.
Developed scripts to create Hive table DDL and analyze tables from PySpark jobs, optimizing data storage and retrieval.
Designed and developed Tableau dashboards with views for weekly data categorization, providing actionable insights.
Design, development and implementation of performant ETL pipelines using python API (PySpark) of Apache Spark on AWS EM
Data structuring and building data pipeline for data ingestion and transformation to Hive tables.
Developed and maintained AWS CloudWatch Alarms and Logs for monitoring the health and performance of Amazon Kinesis streams.
Enhanced security measures by implementing AWS Identity and Access Management (IAM) roles and policies for Amazon Kinesis resources.
Implemented data retention and lifecycle policies to manage the storage of streaming data within Amazon Kinesis.
Collaborated with stakeholders to gather requirements and designed scalable architectures for streaming data processing using Amazon Kinesis.
Data modelling with defining Schemas, removing duplicates and filling missing Observations
Build deployment on AWS ecosystem with EMR, S3, EC2, and Service Catalogue & Glue Catalogue
Developed the scripts to create Hive table DDL and Analyze table from Pyspark jobs.

Hadoop Developer

Delta Airline

Plymouth Meeting, PA

07.2015 - 01.2017

Application design, develop, test, deploy the Big data solution for customers experience
The customer transnational Medical data includes Patient visits, Medications, Prescriptions, Problems, Orders and Observations
Roles & Designed the Spark application to extract data from XML sources and transform OMOP tables to CDD tables
Implemented the business rules to read the Patient data, inferred new business rules to derive specific data
Designed business logic for Cleaning & Structuring incoming customer data initially stored in HDFS Parquet files, and transformation to Impala tables
Created dashboards using Tableau and Plotly in Databricks using PySpark
Configured a python API Producer file to ingest data from the Slack API, using Kafka, for real-time processing with Spark
Data modelling with defining Schemas, removing duplicates and filling missing Observations
Developed the Unit test framework for all the CDD Tables after the business rules
Developed the scripts to create impala table DDL and Analyze table from spark jobs
Developed Spark jobs Scala and Java API's and performed transformations and actions on RDD's
Designed Patient EMR object module with Java object-oriented programming
Data ingestion from RDBMS to HDFS with Sqoop, Oraoop & Spark JDBC applications
Design built and deployed a set of Python modelling APIs for customer analytics, which integrate multiple machine learning techniques for various user behavior prediction and support multiple marketing segmentation programs

Education

Bachelor of health science -

Long Island University

01.2015

Skills

Big Data Technolgies: Spark, PySpark, Hive, Impala, Sqoop, Tableau, Flume, Oozie, HDFS, MongoDb, Snowflake,Databricks,Maprreduce
AWS: EMR, Glue, S3, EC2, Lambda, Athena, Step function, API gateway, SNS, Glue Catalogue, Redshift, DynamoDb, CloudWatch
GCP: Dataproc, BQ, Storage, VM, image
Languages:Python, Scala, Java, C, SQL, MySql, Shell scripting
Workflow: Airflow, Step functions, Dataflow, Control-M
Web Technologies: HTML, CSS, Javascript, JSON, XML/SOAP, REST, WSDL

Operating Systems Linux (Ubuntu, Fedora & CentOS), Unix and Windows
IDEs: Databricks, JuPyter, IntelliJ, PyCharm, Eclipse, Source insight
Version Control: Git, Subversion, CVS, MKS
DB Tools: SQL Developer, Squirrel
Tracking tools: Redmine, JIRA
CI/CD tools Jenkins, Chef, Confluence, Bitbucket

Technical Summary

Spark, PySpark, Hive, Impala, Sqoop, Tableau, Flume, Oozie, HDFS, MongoDb, Snowflake, Databricks, Maprreduce, AWS, EMR, Glue, S3, EC2, Lambda, Athena, Step function, API gateway, SNS, Glue Catalogue, Redshift, DynamoDb, CloudWatch, GCP, Dataproc, BQ, Storage, VM, image, Python, Scala, Java, C++, SQL, MySql, Shell scripting, Airflow, Step functions, Dataflow, Control-M, HTML, CSS, Javascript, JSON, XML/SOAP, REST, WSDL, Linux (Ubuntu, Fedora & CentOS), Unix, Windows, Databricks, JuPyter, IntelliJ, PyCharm, Eclipse, Source insight, Git, Subversion, CVS, MKS, SQL Developer, Squirrel, Redmine, JIRA, Jenkins, Chef, Confluence, Bitbucket

Timeline

Data Engineer

Bank of America

11.2022 - Current

Sr Data Engineer

Fannie Mae

12.2020 - 10.2022

Big Data Engineer ,Hadoop Developer

Freddie Mac

01.2020 - 11.2020

Big Data Engineer, Hadoop Developer

Dutch Bank

01.2017 - 12.2019

Hadoop Developer

Delta Airline

07.2015 - 01.2017

Bachelor of health science -

Long Island University

Dana Alkhatib

Summary

Overview

Work History

Data Engineer

Sr Data Engineer

Big Data Engineer ,Hadoop Developer

Big Data Engineer, Hadoop Developer

Hadoop Developer

Education

Bachelor of health science -

Skills

Technical Summary

Timeline

Data Engineer

Sr Data Engineer

Big Data Engineer ,Hadoop Developer

Big Data Engineer, Hadoop Developer

Hadoop Developer

Bachelor of health science -

Similar Profiles

Mariel VigilMariel Vigil

SWARUP KUMAR YSWARUP KUMAR Y

Felecia SoolamanFelecia Soolaman

Jovani TorresJovani Torres

Ava NobileAva Nobile