Summary
Overview
Work History
Education
Skills
Websites
Timeline
Generic

Md Jahid Hasan Bhuyian

Queens,NY

Summary

Data Engineer with 6 years of experience in designing and implementing cutting-edge big data solutions. Proven track record of quickly adapting to new technologies and cloud platforms, including AWS and Azure, to manage and process large-scale data sets. Skilled in programming languages such as Python, SQL, and Java to build flexible and scalable data pipelines. Experienced in utilizing big data technologies such as Hadoop, Spark, and Hive to support real-time data processing and analytics. Strong understanding of data warehousing concepts and data governance best practices. Always eager to learn and take on new challenges, committed to delivering high-quality and impactful data solutions to drive business success."

Overview

9
9
years of professional experience

Work History

Senior Data Engineer

Metlife
01.2022 - Current
  • Designed and implemented highly scalable data processing pipelines on Azure cloud platform, utilizing Azure Data Factory, Stream Analytics, and DataBricks
  • Expertise in migrating and integrating data from various sources, including databases, APIs and file systems into Azure data storage, such as SQL Database, Cosmos DB, and Blob Storage
  • Experienced in real-time data processing, implemented pipeline for processing incoming data from IoT devices using Azure Event Hubs and Stream Analytics, ensuring timely and accurate data analysis
  • Developed predictive models using Azure Machine Learning, and deployed these models into production, resulting in 15% increase in efficiency
  • Solid experience in Advanced SQL with a proven track record of designing and implementing complex queries, stored procedures, functions, and triggers in a fast-paced, data-intensive environment.
  • Utilized advanced SQL techniques to optimize database performance and enhance data quality, resulting in improved business insights and operational efficiencies.
  • Proficient in data visualization, created interactive dashboards using Power BI for business stakeholders, resulting in clearer understanding of business performance
  • Mount Azure Data Lake containers to DataBricks and create service principals, access keys, tokens to access Azure Data Lake Gen2 storage account
  • Import raw data such as csv, json files into Azure Data Lake Gen2 to perform data ingestion by writing PySpark to extract flat files
  • Construct data transformation by writing PySpark in Databricks to rename, drop, clean, validate and reformat into parquet files and load them into Azure Blob storage container
  • Develop Azure linked services to construct connections with on-premises Oracle
  • Database, SQL Server, Apache Hive with Azure datasets in the cloud
  • Build ETL data pipelines in Azure Data Factory (ADF) to manage and process >1B+ rows into Azure SQL DW
  • Configured Input & Output bindings of Azure Function with Azure Cosmos DB collection to read and write data from the container whenever the function executes
  • Connected the data bricks notebooks with Airflow to schedule and monitor the ETL process
  • Train NLP Question & Answering models using BERT Transfer Learning to answer domain questions; expedite Name-Entity Recognition process
  • Provided user management and support by administering epics, user stories, tasks in Jira using Agile methodology, logged process flow documents in Confluence
  • Worked with a team of developers to develop and implement an API for optical character recognition and named entity extraction on medical records
  • Analyzed and extracted relevant information from medical forms to support medical summary reports, compliance, claims settlement, litigation, and predictive analysis
  • Utilized azure.cognitiveservices.vision.computervision, pytesseract, spacy, and pytextrank for NER and OCR technologies
  • Worked with Flask and Werkzeug for web development and REST API implementation
  • Deployed predictive models using the AzureML platform
  • Environment: Azure HDInsight, Databricks, Data Lake, Cosmos DB, MySQL, Azure SQL,Snowflake, Cassandra, Teradata, Ambari, PowerBI, Azure, Blob Storage, DataFactory, Data Storage Explorer, Scala, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark,Git, PySpark, Airflow, Hive, HBase, Airflow,AzureML

Data Engineer

Fannie Mae
04.2018 - 12.2021
  • Designed, built and managed data pipelines for real-time processing and batch processing using AWS Glue and Amazon Kinesis
  • Implemented data storage in S3 with appropriate partitioning and compression techniques to optimize data retrieval
  • Implemented a scalable data warehousing solution using Amazon Redshift.
  • Designed and optimized Redshift tables and ensured high performance through appropriate distribution styles, sort keys, and compression encodings
  • Worked with large datasets using Apache Spark and Amazon EMR
  • Developed Spark scripts to perform complex data transformations and aggregations, and optimized Spark job performance for faster processing times
  • Built scalable NoSQL databases using Amazon DynamoDB, Amazon DocumentDB and Amazon Neptune to store and retrieve semi-structured and unstructured data
  • Developed serverless architectures using AWS Lambda and Amazon SNS to process real-time data and trigger event-driven workflows
  • Used Apache NiFi and AWS Glue to build custom data ingestion and transformation workflows
  • Automated data validation and quality checks using AWS Glue jobs and Apache NiFi processes
  • Implemented security measures to protect sensitive data stored in S3 and managed data access using Amazon IAM and Amazon VPC
  • Monitored pipeline performance and security events using Amazon CloudWatch and AWS CloudTrail
  • Strong SQL, data warehousing, and ETL experience on traditional databases
  • Advanced knowledge of Amazon Web Services and its major components (EC2, S3, RDS, VPC, IAM, etc.)
  • Worked on Amazon Web Services (AWS) technologies including EC2, S3, RDS, ELB, and Elasticache
  • Exploring DAG's, their dependencies and logs using Airflow pipelines for automation
  • Tracking operations using sensors until certain criteria is met using Airflow technology
  • Developed Spark scripts by using Python shell commands as per the requirement
  • Experience with creating dynamic dashboards, using parameters, filters, and calculated fields in Tableau and PowerBI
  • Experience with creating dynamic dashboards, using parameters, filters, and calculated fields in Tableau and PowerBI
  • Experience with using PowerBI's data visualization options, such as charts, tables, and KPIs
  • Knowledge of Tableau's data blending and data joining options, including data blending and join calculations
  • Experience with creating and managing data sources in Tableau and PowerBI, including Excel spreadsheets and SQL databases
  • Migrated on-premise databases to the cloud using AWS DMS and Amazon EC2
  • Monitored the migration process, ensured data consistency and resolved any issues during migration
  • Deployed machine learning models using Amazon SageMaker and integrated them with existing data pipelines
  • Developed automated processes to retrain models on a regular basis and monitor model performance using Amazon CloudWatch
  • Environments : Hadoop 2.x, Hive, HDFS, Python, Spark, Sqoop, Oozie, AWSS3, Amazon Redshift .MySQL, PostgreSQL,EMR, AWS Glue, Amazon EMR

Data Engineer

Visa
03.2016 - 03.2018
  • Worked on analyzing Hadoop clusters and different big data analytic tools including Pig, Hive and Sqoop
  • Developed Spark scripts by using Scala shell commands as per the requirement
  • Created Spark jobs to see trends in data usage by users
  • Used Spark and Spark SQL to read the Parquet data and create the tables in Hive using the Scala API.
  • Loaded data pipelines from web servers and Teradata using Sqoop with Kafka and Spark Streaming API
  • Developed Kafka pub-sub, Cassandra clients and Spark along with components on HDFS and Hive
  • Populated HDFS and HBase with huge amounts of data using Apache Kafka
  • Configured deployed and maintained multi-node Dev and Test Kafka Clusters
  • Developed the Pig UDF'S to pre-process the data for analysis
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig and HiveQL
  • Created Hive tables to store data and written Hive queries
  • Extracted the data from Teradata into HDFS using Sqoop
  • Exported the patterns analyzed back to Teradata using Sqoop
  • Involved in Installing, Configuring Hadoop EcoSystem, and Cloudera Manager using CDH4 Distribution
  • Developed Spark code to use Scala and Spark-SQL for faster processing and testing
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python
  • Used Spark API over Hadoop YARN as execution engine for data analytics using Hive
  • Experienced data pipelines using Kafka and Akka for handling large terabytes of data
  • Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior
  • Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster
  • Developed Scala scripts to extract the data from the web server output files to load into HDFS
  • Design and implement Map Reduce jobs to support distributed data processing
  • Process large data sets utilizing our Hadoop cluster
  • Designing NoSQL schemas in HBase
  • Developing Mapreduce ETL in Python/Pig
  • Involved in data validation using HIVE
  • Importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa
  • Involved in weekly walkthrough and inspection meetings, to verify the status of the testing efforts and the project
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala
  • Environment: Hadoop, HDFS, Pig, Sqoop, Shell Scripting, Ubuntu, Linux Red Hat, Spark, Scala, Hortonworks, Cloudera Manager, Apache Yarn, Python, Machine Learning, NLP (Natural Language Processing)

Education

Bachelor of Science - Computer Science

North South University
Dhaka Bangladesh

Skills

  • Python, R, SQL, Java, Scala
  • Hadoop, Spark, Hive
  • AWS,Azure
  • Machine Learning
  • Tableau, PowerBI
  • Apache Airflow, GIT, Jenkins
  • Business Intelligence
  • Data Modeling
  • Data Analysis
  • Production Work
  • Agile methodologies: Scrum, Kanban

Timeline

Senior Data Engineer

Metlife
01.2022 - Current

Data Engineer

Fannie Mae
04.2018 - 12.2021

Data Engineer

Visa
03.2016 - 03.2018

Bachelor of Science - Computer Science

North South University
Md Jahid Hasan Bhuyian