Summary
Overview
Work History
Education
Skills
Certification
Timeline
Generic

Han Wu

Sharon,USA

Summary

Adept Data Engineer with a proven track record at Altria Group, Inc., where I spearheaded the development of the Audience Selection Tool using Azure Databricks, Python, and Pyspark. Demonstrated expertise in ETL pipeline construction and data transformation, enhancing job efficiency by 100%. Collaboratively led Agile project teams, showcasing strong leadership and technical prowess in Python and Azure ecosystems.

Overview

4
4
years of professional experience
1
1
Certification

Work History

Data engineer, Contractor

Altria Group, Inc.
Remote
10.2021 - Current
  • Data engineer responsible for the development of the Altria’s Audience Selection Tool (AST) in Azure Databricks Notebooks (ADB) via scripting languages Python, Pyspark, and Spark SQL, a sophisticated system managing multi-channel marketing campaigns within Azure DataLake
  • Implemented ETL and data movement using Azure Databricks and Azure Data Factory pipelines to cloud storage, Azure Blob, Azure Data Lakes(ADLS), Hive tables, Unity Catalogs
  • Designed and implemented complex batch data integration in Databricks notebooks, via Python, Pyspark, SpkarSql to clean and transform, read and write data for both structured and semi-structured data, delivering output silver tables for data scientists or analysts to perform analysis
  • Provided multi-threading solution from python threading, concurrent.futures on AST backend workflow, speeding up the total job time by 2 times, as well as building monitoring process with exceptional error handling on multi-threads
  • Provided solution regarding writing JSON file through python scripts and .dumps method reducing less than a half time on file writes, removing performance bottle neck within python loops
  • Performed complex data transformations for data enrichment, transforming semi-structured data and extracting structured data from various cloud databases, like customers, address, product features, join them with more conditions or aggregations through pyspark built-in functions which have better performance on scripts, such as filter, when, regexp_extract, inline, as well as transferred data to RDDs to perform UDFs via rdd.map
  • Work collaboratively with ADF teams to monitor the performance on AST jobs and provide solution for performance tuning in ADB scripts or refactor the scripts to speed up the jobs
  • Provided the solution to monitor the latency issue happened on job cluster running AST batch processing pipeline, by leveraging logging and multi-threading in python libraries and enabled the process generating logs to cloud ABFS in a timely manner
  • Extensive experience building ETL pipelines using Python and related libraries and (Pandas, Numpy, threading, joblib, pgeocode, logging)
  • Experienced in writing spark RDD, Dataframes, transformations, actions and write results to the Blob stores
  • Experienced in processing semi-structured data, including CSV, Parquet, JSON files using Pyspark
  • Interacted with Business analytics on designing and building golden table or view by filter and table aggregation using Pyspark and Spark SQL in Databricks
  • Performance tunning and optimization achieved through the management of indices, table partitioning and optimizing the python scripts
  • Experienced in Azure Data Lake storage gen2 to store csv files, parquet files and retrieve data via leveraging Blob API
  • Utilized Azure DevOps and Git for CI/CD pipelines to track code version changes and enable more effective collaboration
  • Worked in Agile development environment in sprint cycles of 3 weeks by dividing, organizing, and discussing tasks for project implementation
  • Led end-to-end troubleshooting of the inbound and outbound process, ensuring smooth operations, and addressing production issue promptly
  • Contributed to development process through comprehensive code review, ensuring adherence to best practices and promoting collaborative improvements

Student Tutor

Bentley University
Waltham, USA
03.2021 - 05.2021
  • Quantitative Analysis for Business course in Graduate Student Academic Service
  • Answered questions and facilitated students understand lectures and exams, including topics of linear regression model, complex regressors, ANOVA, transformation on variables, variable selection and assumption checking

Education

Master of Science - Data Analytics (STEM)

Bentley University
Waltham, MA, USA
05.2021

Bachelor of Accounting -

Macau University of Science and Technology
Macau, China
06.2011

Skills

  • Python
  • Pyspark
  • T-SQL
  • MySQL
  • Azure Databricks
  • Azure DevOps
  • Git
  • Visual Studio
  • Visual Studio Code
  • Blob
  • ADLS Gen 2
  • Delta Lake
  • Data Factory
  • Databricks
  • SQL server
  • Synapse Analytics
  • Event Hubs
  • Stream Analytics
  • Monitor
  • Tableau
  • PowerBI

Certification

(to be added)

Timeline

Data engineer, Contractor

Altria Group, Inc.
10.2021 - Current

Student Tutor

Bentley University
03.2021 - 05.2021

Master of Science - Data Analytics (STEM)

Bentley University

Bachelor of Accounting -

Macau University of Science and Technology
Han Wu