Summary
Overview
Work History
Education
Skills
Personal Information
Timeline
Generic

Mike Husein

Staten Island,NY

Summary

Over 6 years of IT experience in analysis, design, development and implementation of large-scale applications using Big Data and Java/J2EE technologies such as Apache Spark, Hadoop, Hive , Sqoop, Oozie, Hbase, Zookeeper, Python & Scala Strong experience writing Spark Core, Spark SQL, Spark Streaming, Java MapReduce, Spark on Java Applications. Highly skilled in integrating Kafka with Spark Streaming applications to build long running real-time applications. Solid understanding of RDD operations in Apache Spark i.e., Transformations & Actions, Persistence (Caching), Accumulators, Broadcast Variables, Optimizing Broadcasts. In-depth knowledge of handling large amounts of data utilizing Spark Data Frames/Datasets API and Case Classes. Experienced in running query using Impala and used BI tools to run ad-hoc queries directly on Hadoop. In-depth knowledge of the Big Data Architecture along with-it various components of Hadoop 1.X and 2.X such as HDFS, Job Tracker, Task Tracker, Data Node, Name Node and YARN concepts such as Resource Manager, Node Manager. Hands on experience on AWS cloud services (EC2, S3, RDS, Glue, Redshift, Data Pipeline, EMR,, Workspaces, Lambda,, RDS). HiveQL scripts leading to good understanding in MapReduce design patterns, data analysis using Hive . Great knowledge of working with Apache Spark Streaming API on Big Data Distributions in an active cluster environment. Very capable at using AWS utilities such as EMR, S3 and CloudWatch to run and monitor Hadoop/Spark jobs on AWS. Proficient in importing and exporting data from Relational Database Systems to HDFS and vice versa, using Sqoop. Good understanding of column-family NoSQL databases like HBase, Cassandra and Mongo DB in enterprise use cases. Very capable in processing of large sets of structured, semi-structured and unstructured data and supporting system application architecture in Hadoop, Spark and SQL databases such as Teradata, MySQL, DB2. Experienced in version control and source code management tools like GIT, SVN, and Bitbucket. Experience in Java Application Development, Client/Server Applications using MVC, J2EE, JDBC, JSP, XML methodologies (XML, XSL, XSD), Web Services, Relational Databases and NoSQL Databases. Hands-on experience in application development using Java, RDBMS, and Linux shell scripting, Perl. Hands-on experience working with IDE tools such as Eclipse, IntelliJ, NetBeans, Visual Studio, GIT and Maven and experienced in writing cohesive E2E applications on Apache Zeppelin. . Experience working in Waterfall and Agile - SCRUM methodologies.

Overview

6
6
years of professional experience

Work History

Data Engineer

Forbes
New York
05.2019 - Current
  • Developed data pipelines using Stream sets Data Collector to store data from Kafka into HDFS, Elastic Search, HBase and MapR DB
  • Event Streaming on different stages on Stream sets Data Collector, running a MapReduce job on event triggers to convert Avro to Parquet
  • Worked on analyzing Hadoop stack and different big data analytic tools including Kafka, Hive, HBase database and Sqoop
  • Created various Documents such as Source-To-Target Data mapping Document, Unit Test Cases and Data Migration Document
  • Worked on installing cluster, commissioning & decommissioning of Data nodes, Name node recovery, capacity planning, JVM tuning, map and slots configuration
  • Designed and implemented Spark test bench application to evaluate quality of recommendations made by the engine
  • Tool monitored log input from several data centers, via Spark Stream, was analyze and data was parsed and saved into Cassandra
  • Implemented Cluster balancing
  • Migrated high-volume OLTP transactions from Oracle to Cassandra to reduce oracle licensing footprint
  • Streaming and complex analytics of processing are handled with use of Spark
  • Implemented test scripts to support test driven development and continuous integration
  • Worked on tuning the performance of Hive
  • Worked on Impala for Massive parallel processing of Hive queries
  • Streaming data to Hadoop using Kafka
  • Writing java code for custom partitioner and writable
  • Worked on the Analytics Infrastructure team to develop a stream filtering system on top of Apache Kafka
  • Worked on to ease the jobs by building the applications on top of NoSQL database Cassandra
  • Configured Spark streaming to receive real time data from Kafka and store the stream data to HDFS
  • Unit tested and tuned SQLs and ETL Code for better performance
  • Monitored the performance and identified performance bottlenecks in ETL code
  • Used TABLEAU which grabs data to generate reports, graphs and charts summarizing the given set of data
  • Worked on data utilizing a Hadoop, Zookeeper, and Accumulate stack, aiding in the development of specialized indexes for performant queries on big data implementations
  • Involved in creating partitioned Hive tables, and loading and analyzing data using hive queries, Implemented Partitioning and bucketing in Hive
  • Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project
  • Developed Hive queries to process the data and generate the data cubes for visualizing
  • Implemented schema extraction for Parquet and Avro file Formats in Hive
  • Good experience with Talend open studio for designing ETL Jobs for Processing of data
  • Experience designing, reviewing, implementing and optimizing data transformation processes in the Hadoop and Talend and Informatica ecosystems
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE
  • Configured Hadoop clusters and coordinated with Big Data Admins for cluster maintenance
  • Environment Hadoop YARN, Spark-Core, Spark-Streaming, Spark-SQL, Python, Kafka, Hive, Sqoop, Amazon AWS, Elastic Search, Impala, Cassandra, Tableau, Informatica, Cloudera, Oracle 10g, Linux

Data Engineer

PayPal
Paolo
06.2018 - 05.2019
  • Worked on analyzing Hadoop cluster and different big data analytic tools including Hive and Sqoop
  • Developed Simple to complex MapReduce Jobs using Hive
  • Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms
  • Worked with Senior Engineer on configuring Kafka for streaming data
  • Responsible for building scalable distributed data solutions using Hadoop
  • Worked on project to retrieve log messages procured by leveraging Spark Streaming
  • Designed Oozie jobs for the auto processing of similar data
  • Collect the data using Spark Streaming
  • Analyzed the data by performing Hive queries and running Pig scripts to know user behavior
  • Extensively used for all and bulk collect to fetch large volumes of data from table
  • Installed Oozie workflow engine to run multiple Hive and Pig jobs
  • Developed Pig scripts in the areas where extensive coding needs to be reduced
  • Worked with Spark Streaming to ingest data into spark engine
  • Extensively used for all and bulk collect to fetch large volumes of data from table
  • Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS
  • Handled importing of data from various data sources using Sqoop, performed transformations using Hive, MapReduce, and loaded data into HDFS
  • Created HBase tables to store various data formats of PII data coming from different portfolios
  • Configured Sqoop and developed scripts to extract data from MySQL into HDFS
  • Hands-on experience with productional zing Hadoop applications viz
  • Administration, configuration management, monitoring, debugging and performance tuning
  • Worked on analyzing Hadoop cluster and different big data analytic tools including Pig HBase database and Sqoop
  • Created HBase tables to store various data formats of PII data coming from different portfolios
  • Data processing using SPARK
  • Developed Spark scripts by writing custom RDDs in Scala for data transformations and perform actions on RDDs
  • Parsed high-level design specification to simple ETL coding and mapping standards
  • Cluster co-ordination services through Zookeeper
  • Created Pig Latin scripts to sort, group, join and filter the enterprise wise data
  • Developed complex Talend jobs mappings to load the data from various sources using different components
  • Design, develop and implement solutions using Talend Integration Suite
  • Partitioning data streams using KAFKA
  • Designed and configured Kafka cluster to accommodate heavy throughput of 1 million messages per second
  • Used Kafka producer 0.8.3 API's to produce messages
  • Built big Data solutions using HBase handling millions of records for the different trends of data and exporting it to Hive
  • Developed scripts in Hive to perform transformations on the data and load to target systems for use by the data analysts for reporting
  • Tested the data coming from the source before processing
  • Familiarized with automated monitoring tools like Nagios
  • Used Oozie as workflow engine and Falcon for Job scheduling
  • Debugged the technical issues and errors was resolve
  • Used Apache Kafka for collecting, aggregating, and moving large amounts of data from application servers
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data
  • As Part of POC setup Amazon web services (AWS) to check whether Hadoop is a feasible solution or not
  • Analyzed user requirements, designed and developed ETL processes to load enterprise data into the Data Warehouse.
  • Identified key use cases and associated reference architectures for market segments and industry verticals.
  • Wrote and coded logical and physical database descriptions, specifying identifiers of database to management systems.
  • Collected, outlined and refined requirements, led design processes and oversaw project progress.

Data Engineer

Tapestry
San Francisco
01.2018 - 06.2018
  • Launching Amazon EC2 Cloud Instances using Amazon Web Services (Linux/ Ubuntu/RHEL) and Configuring launched instances with respect to specific applications
  • Installed application on AWS EC2 instances and configured the storage on S3 buckets
  • Performed S3 buckets creation, policies and on the IAM role based polices and customizing the JSON template
  • Implemented and maintained the monitoring and alerting of production and corporate servers/storage using AWS Cloud watch
  • Managed servers on the Amazon Web Services (AWS) platform instances using Puppet, Chef Configuration management
  • Developed PIG scripts to transform the raw data into intelligent data as specified by business users
  • Worked in AWS environment for development and deployment of Custom Hadoop Applications
  • Worked closely with the data modelers to model the new incoming data sets
  • Involved in start to end process of Hadoop jobs that used various technologies such as Sqoop, PIG, Hive, Map Reduce, Spark and Shell scripts (for scheduling of few jobs
  • Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, Oozie, Zookeeper, SQOOP, flume, Spark, Impala, Cassandra with Horton work Distribution
  • Involved in creating Hive tables, Pig tables, and loading data and writing hive queries and pig scripts
  • Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and HBase
  • Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data
  • Configured deployed and maintained multi-node Dev and Test Kafka Clusters
  • Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS
  • Worked on tuning Hive and Pig to improve performance and solve performance related issues in Hive and Pig scripts with good understanding of Joins, Group and aggregation and how it does Map Reduce jobs
  • Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data
  • Import the data from different sources like HDFS/HBase into Spark RDD
  • Developed a data pipeline using Kafka and Storm to store data into HDFS
  • Performed real time analysis on the incoming data
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing
  • Implemented Spark using Scala and Spark for faster testing and processing of data
  • Environment Apache Hadoop, HDFS, MapReduce, Sqoop, Flume, Pig, Hive, HBASE, Oozie, Scala, Spark, Linux.
  • Optimized existing queries to improve query performance by creating indexes on tables.
  • Managed the development and implementation of infrastructure automation solutions using AWS technologies such as EC2, S3, ECS, EKS, Lambda.

Education

Bachelor of Science - Information Technology

New York City College of Technology of The City University of New York

Skills

  • Requirements Specifications
  • IBM DB2
  • Hives Treatment
  • New Project Development
  • Apache Hadoop
  • Apache Spark
  • Sqoop
  • Data Lakes
  • Software Development Methodologies
  • User Profile
  • Software Solutions
  • Data Pipeline Design
  • Real-time Analytics
  • Data Migration

Personal Information

Citizenship: US Citizen

Timeline

Data Engineer

Forbes
05.2019 - Current

Data Engineer

PayPal
06.2018 - 05.2019

Data Engineer

Tapestry
01.2018 - 06.2018

Bachelor of Science - Information Technology

New York City College of Technology of The City University of New York
Mike Husein