Summary
Overview
Work History
Education
Skills
Certification
Links
Projects
Timeline
Generic

Sumithra Hariguruprasad

Morrisville,NC

Summary

Certified Data Scientist with dual Master's degrees in Informatics and Analytics and Applied Mathematics, complemented by a Bachelor's in Mathematics. Proficient in Python and R, specializing in data mining, visualization, and predictive analysis. Expertise in managing large datasets and solving complex business challenges through machine learning. Known for meticulous attention to detail and a passion for driving impactful insights in dynamic data science environments.

Overview

6
6
years of post-secondary education
4
4
Certifications

Work History

Data Science Intern

ISimcha Health
Durham, NC
03.2024 - 04.2024

Automated PDF to Text Conversion:

  • Developed a script using `requests` and `fitz` libraries to download PDFs and extract text from each page.
  • Implemented functionality to track page numbers and record total page count in output text file.

JSON Data Processing:

  • Created a function using `json` module to load and parse JSON data into Python dictionaries.

CSS Styling for HTML Documents:

  • Defined CSS styles for `.container` and `.box` classes, enhancing layout with CSS Grid and visual styling for improved document presentation.

HTML Generation from JSON:

  • Developed a function to dynamically create HTML elements from JSON data, applying predefined CSS styles and structuring layout.

File Handling and Automation:

  • Handled file operations to read JSON data, generate HTML content, and save it to new files with confirmation messages for successful saving.

Education

Master of Science - Informatics And Analytics

UNC
Greensboro, NC
09.2022 - 05.2024

Master of Science - Applied Mathematics

Pondicherry University
Pondicherry, India
07.1998 - 04.2000

Bachelor of Science - Mathematics (Honors)

Sri Sathya Sai Institute of Higher Learning
Anantapur, India
06.1994 - 03.1997

Skills

Programming Languages: Python, R, SQL

Libraries and Frameworks: Pandas, NumPy, SciPy, Statsmodels, Seaborn, Word Cloud, Matplotlib, NLP, Scikit-learn, TensorFlow, PySpark, BlueBERT, BRAT, Hugging Face Transformers

Tools: RStudio, Jupyter Notebook, Tableau, Shiny, Power BI, GitHub, MS Office, Google Colab, ChromaDB

Soft Skills: Team player, Fast learner, Open-minded, Interpersonal skills, Adaptability to changes

Certification

Certification in Data Science, SCS, University of Toronto. Key courses: Foundations of Data Science, Statistics, Machine Learning, and Big Data.

Links

  • LinkedIn, http://www.linkedin.com/in/sumithra-hariguruprasad
  • GitHub, https://github.com/sumi19

Projects

Graduate Project  (January 2024- March 2024)


Improving Biomedical Literature Retrieval with User Search Logs 

  • Enhanced biomedical literature retrieval by leveraging insights from user search logs to optimize query processing and improve retrieval accuracy using advanced NER techniques and query analysis strategies.
  • Applied NLP techniques using pandas, scikit-learn, and TensorFlow to preprocess and analyze biomedical text data.
  • Developed and fine-tuned NER models (BlueBERT) for accurate entity recognition in queries.
  • Utilized vector databases to expand query contexts with semantically related terms, enhancing information retrieval.
  • Analyzed large-scale PubMed search logs to understand user behavior patterns and optimize retrieval systems.
  • Leveraged open-source repositories such as GitHub and Hugging Face for model development, fine-tuning, and evaluation.


Graduate Coursework Projects Timeline September 2022- December 2023


COVID-19 Analysis Project

  • Conducted comprehensive analysis of COVID-19 cases and deaths using datasets from usafacts.org, Census Demographic ACS, and ourworldindata.org.
  • Employed methodologies including exploratory data analysis, hypothesis testing, and statistical modeling to understand pandemic dynamics.
  • Analyzed trends across geographic levels (county, state, country) to identify variations and potential influencing factors.
  • Uncovered critical patterns and correlations, highlighting the impact of the pandemic across diverse populations.
  • Utilized regression models to forecast trends in COVID-19 cases and deaths.
  • Contributed insights on societal and environmental factors influencing virus spread and the importance of targeted interventions for effective public health strategies.


Impact of Various Factors on COVID-19

  • Analyzed COVID-19 data from 'Our World in Data' to understand global pandemic trends.
  • Utilized logistic regression to assess transmission likelihood and random forest to identify influential factors.
  • Applied clustering techniques to categorize regions based on transmission patterns.
  • Discovered relationships between demographics, interventions, and COVID-19 spread, contributing valuable insights for public health strategies.


Agricultural Industry Analysis

  • Analyzed the agricultural industry focusing on products, demographics, farm economics, and environmental factors.
  • Utilized the full report for the 2017 Census of Agriculture by the USDA and NASS.
  • Employed Tableau Prep for data cleaning and modification.
  • Explored various aspects of agriculture including economics, environmental factors, crop sales, demographics, and animal products.
  • Provided insights and visual representations through interactive Tableau dashboards.


Mortgage Lending Application

  • Designed a mock mortgage approval app, creating synthesized customer data (credit score, income, property details, monthly EMI) for accurate loan term predictions.
  • Developed an applied problem statement from the perspective of potential app users, emphasizing customer needs.
  • Designed a comprehensive plan addressing goals, data management strategies, and advanced data analysis methods.
  • Proposed integration with a Lender API for querying pre-approved loans and classifying risk profiles based on market research.
  • Delivered actionable results, predicting customer mortgage terms based on credit score, income, property type, and EMI, and classified loan offers into risk categories to guide customer decisions.


Certification Coursework Projects    (September 2018 - December 2019)


Customer Churn Prediction

  • Applied Python and data science techniques to analyze customer churn data.
  • Developed predictive models including Naive Bayes and decision-tree classifiers.
  • Conducted feature engineering and model tuning to enhance accuracy.
  • Provided actionable insights for targeted retention strategies in telecommunications.


Heart Disease Prediction

  • Employed Python and statistical methodologies to analyze heart disease data.
  • Identified key risk factors through comprehensive data analysis.
  • Developed predictive models including Naive Bayes and decision-tree classifiers.
  • Ensured robustness and reliability of predictions through feature selection and evaluation.


Predicting Cuisines Based on Ingredients

  • Employed Python and machine learning techniques to analyze recipe data from Kaggle.
  • Utilized NLP techniques such as stemming, lemmatization, and feature engineering to preprocess and enrich text data for accurate modeling.
  • Trained classification models (SVC, decision tree, random forest, multinomial Naïve Bayes) to predict cuisines based on ingredients, ensuring robust performance across diverse culinary datasets.
  • Conducted comprehensive model evaluation using metrics like accuracy, precision, recall, and F1-score to optimize predictive accuracy and validate the effectiveness of the machine learning approach.


Hospital Readmissions of Patients with Diabetes

  • Leveraged PySpark and Big Data technologies to extract and analyze clinical care data from the UCI ML repository.
  • Applied regression, random forest, and gradient-boosting models to predict hospital readmissions for patients diagnosed with diabetes.
  • Conducted feature engineering and selection to optimize model performance and interpretability.
  • Contributed insights into factors influencing hospital readmissions, enhancing predictive accuracy for healthcare outcomes.


Timeline

Data Science Intern

ISimcha Health
03.2024 - 04.2024

Master of Science - Informatics And Analytics

UNC
09.2022 - 05.2024

Deep Learning Specialization, DeepLearning.AI, Coursera. Key Courses: Neural Networks and Deep Learning, Sequence Models, Hyperparameter Tuning, Regularization and Optimization, Structuring Machine Learning Projects, Convolutional Neural Networks

07-2020

SQL for Data Science, UCDavis, Coursera.

06-2020

Machine Learning, Stanford University, Coursera.

05-2020

Certification in Data Science, SCS, University of Toronto. Key courses: Foundations of Data Science, Statistics, Machine Learning, and Big Data.

12-2019

Master of Science - Applied Mathematics

Pondicherry University
07.1998 - 04.2000

Bachelor of Science - Mathematics (Honors)

Sri Sathya Sai Institute of Higher Learning
06.1994 - 03.1997
Sumithra Hariguruprasad