Summary
Overview
Work History
Education
Skills
Selected Project Experience
Affiliations
Timeline
Generic

Yu Shao

Indianapolis,IN

Summary

  • Focused data science professional well-versed in identifying strategic opportunities to benefit decision making in different industries. Work closely with business side, IT department and other data related teams to solve real world problems for the enterprise. Expert R/Python/SQL/Tableau user.
  • Six-year hands-on experience with multiple machine learning models (linear models, decision tree, SVM, EM, neural network, NLP and different ensemble methods) in Python and R (Scikit-Learn and Caret). Experienced with data visualization and app creation using R/Python; Service and model production in Python and Docker.
  • Green card holder

Overview

7
7
years of professional experience

Work History

Senior Data Scientist

GEICO
INDIANAPOLIS, IN
07.2024 - Current
  • Developed a real-time intent detection (RIDE) model for different customer service channels, including chat and call centers. The unlabeled data were clustered and labeled by GPT-4; the data were then used to train on a pre-trained BERT model, achieving a 90% accuracy/F1 score; the model was subsequently registered and deployed as a real-time service. Currently, I am working with our two-pizza team to further improve the model, adapting to the business need. The end goal is to integrate the model into a service powered by LLM, so that minimum effort from live agents is required.

Data Scientist II

GEICO
Indianapolis, IN
09.2022 - 07.2024
  • Used deep Q learning/reinfocement learning to develop a customer personal experience recommender called next best action or NBA. The model recommended best coverage and action item to customers based on different customer types and customer journey stages. Deployed the model through Azure Machine Learning. Conducted A/B testing to evaluate model's performance. Projected net profit reaches $136,300,000.
  • Built a named entity recognition model based on the large language model BERT. The training data was synthesized by ChatGPT. Achieved an average accuracy of 85% and f1 score of 86%. That model was targeting to be used as part of the new Geico chatbot and customer service platform.

Data Scientist

DealMachine
Indianapolis, IN
07.2021 - 08.2022
  • Utilized various statistical models to help the company determine its conversion, churn and upgrade triggers as well as the North Star Metric.
  • Utilized XGBoost to predict users churn between a 4-8 weeks window with accuracy at 83%.
  • Utilized a machine learning model (CART) to predict user upgrade to our elite tier with accuracy at 95%.
  • Performed the cohort analysis to help the leadership team better understand the churn rate change
  • Developed and maintained ten dashboards to help the leadership team keep close monitoring over the product, marketing, financial and customer service statuses.

Data Scientist

FSSA, The State Of Indiana
Indianapolis, IN
05.2019 - 07.2021
  • Developed machine learning models (e.g. provider fraud, inter-departmental record linkage, Covid contact tracing) through cloud-based infrastructure to facilitate continuous integration/continuous deployment
  • Promoted mutual understanding and awareness between the business side and the technical side of the enterprise through web app, dashboard, executive report and data palooza. Tailored the model/solution to truly fulfill the demands of the stakeholders.

Jr. Data Scientist

FSSA, The State Of Indiana
Indianapolis, IN
10.2018 - 05.2019
  • Extracted Medicaid and other public health data from the data warehouse, census and social media for stakeholders to oversee the overall picture about the Medicaid population in Indiana
  • Ran descriptive and inferential statistics to help stakeholders gain insight of certain public health trends such as the opioid crisis
  • Fitted data into machine learning models (e.g. NLP, ARIMA, random forest, neural network etc.) to predict incoming trends of certain issues (e.g. Medicaid cost, potential Alzheimer's Disease patients etc.) so that necessary legislative actions could be undertaken in advance

Data Analyst Intern

Covance
Indianapolis, IN
05.2018 - 08.2018

Developed a R shiny app to help physicians interactively visualize sample cancellation data (over 270 million observations). Identified sites in need of targeted training and was also used to track their progress towards improvement. Increased site & sponsor interactions and reduced cancellation rates.

Education

Master of Science - Biostatistics

Indiana University
2018

Ph.D. - Medical and Molecular Genetics

Indiana University
2017

Bachelor of Science - Biology

Nanyang Technological University
2009

Skills

  • Python (NumPy, Pandas, Scikit-Learn, Langchain)
  • Machine Learning (LLM, reinforcement learning)
  • A/B testing
  • Effective communication
  • SQL and non-SQL database
  • Azure Machine Learning/model deployment
  • R programming (caret, shiny, etc)
  • Prompt engineering
  • Tableau
  • Git

Selected Project Experience

1 Project Title: Next best action

Project Description: Built the next best action recommender based on reinforcement learning which personalized the customer experience and policy coverage options. The goal was to enhance profitability while improving customer satisfaction and retention.

2 Project Title: Named entity recognition

Project Description: Built a entity recognition model based on BERT. The model was integrated in Geico's new chatbot and customer service platform. It would also facilitate post call analysis to better understand customer's intention and need.

3 Project Title: Covid contact tracing

Project Description: Used NLP technique to extract key relations from a survey targeting Covid19 patients.

4. Project Title: Social determinants of health (SDOH) and clinical outcomes
Project Description: Used tree based model to predict what type of person (i.e. their response to a SDOH survey) is more prone to certain clinical outcomes (e.g. substance abuse disorder, preterm labor and emergency department service) so that the stakeholders can take preventive action in advance.

5. Project Title: Provider Fraud Detection

Project Description: Unsupervised learning (Local outlier factor) to assign risk score for each healthcare provider participating in Medicaid, given their overall claim costs, pharmacy costs, specialties and so on.

Affiliations

  • Coleridge Initiative Data Science Program
  • Golden Key International Honor Society

Timeline

Senior Data Scientist

GEICO
07.2024 - Current

Data Scientist II

GEICO
09.2022 - 07.2024

Data Scientist

DealMachine
07.2021 - 08.2022

Data Scientist

FSSA, The State Of Indiana
05.2019 - 07.2021

Jr. Data Scientist

FSSA, The State Of Indiana
10.2018 - 05.2019

Data Analyst Intern

Covance
05.2018 - 08.2018

Master of Science - Biostatistics

Indiana University

Ph.D. - Medical and Molecular Genetics

Indiana University

Bachelor of Science - Biology

Nanyang Technological University
Yu Shao