Summary
Overview
Work History
Education
Skills
Timeline
Generic

Khatija Begum

Worcester,MA

Summary

Data Scientist with 5+ years of experience delivering end-to-end AI/ML solutions across healthcare, financial services, and ed-tech domains. Skilled in cloud-native model development (AWS, Azure), MLOps, NLP, and statistical analysis. Proven ability to deploy scalable machine learning pipelines, automate workflows, and drive data-driven decision-making using advanced analytics and agile methodologies.

Overview

6
6
years of professional experience

Work History

Data Scientist

CVS Pharmacy
10.2024 - Current
  • Designed and deployed a scalable Healthcare Knowledge Assistant using Azure-based LLMs (OpenAI GPT-4, Bison), improving document search accuracy by 25%.
  • Developed and trained ML models (Random Forest, XGBoost, LSTM) on FHIR patient data for risk prediction and time series forecasting, achieving 20% improvement in prediction accuracy.
  • Applied NLP (spaCy, NLTK, TextBlob) on medical documents for sensitive entity extraction and sentiment classification, enhancing patient privacy compliance.
  • Built and automated ML pipelines using Azure ML, Flask APIs, Terraform, Jenkins, and GitLab CI/CD, maintaining 99% uptime in production.
  • Conducted feature engineering and PCA to optimize model performance on large-scale healthcare datasets.
  • Implemented Retrieval-Augmented Generation (RAG) techniques using Elasticsearch and Cognitive Search to boost response relevance in healthcare Q&A.
  • Led integration of RESTful APIs and IAM roles to support secure, real-time data access in HIPAA-compliant environments.
  • Documented technical workflows and maintained project tracking using Git, Bitbucket, Confluence, and Jira under Agile/SCRUM

Environment: Python, Pandas, NumPy, Scikit-learn, TensorFlow, Keras, Azure ML Studio, Azure Functions, Azure Blob Storage, Terraform, Flask, Jenkins, GitLab, Jira, spaCy, NLTK, TextBlob, Power BI, REST APIs

Data Scientist

RBL Bank
07.2022 - 07.2023
  • Developed automated credit risk and churn prediction models using Python (Random Forest, Logistic Regression), improving model accuracy by 18%.
  • Created financial dashboards with Tableau and SQL, enabling 30% faster reporting for risk teams.
  • Led ETL development using Informatica PowerCenter (SCD Types 1 & 2, CDC) for reliable data warehousing and history tracking.
  • Built and optimized data models, performed data validation, and coordinated project rollouts using Agile and cross-functional collaboration.
  • Conducted hypothesis testing (A/B, ANOVA) and power analysis to validate customer behavior trends.
  • Enhanced data reporting accuracy through automated validation pipelines integrated with Tableau and database queries.

Environment: Python, SQL, Informatica PowerCenter, Tableau, Oracle, AWS Redshift, Agile/SCRUM

Data Scientist

Unacademy
08.2021 - 06.2022
  • Built and deployed ML models for student churn prediction and behavior segmentation using Scikit-learn, improving student retention strategies.
  • Designed ETL pipelines with Azure Data Factory and Databricks to process large-scale user interaction logs.
  • Developed NLP models using NLTK and spaCy for sentiment classification and text analytics on student feedback.
  • Engineered features using PCA, scaling, and label encoding to prepare clean, high-quality training datasets.
  • Delivered actionable insights through interactive dashboards (Tableau, ggplot2) and advanced statistical analysis.

Environment: Python, Pandas, Scikit-learn, Azure Data Factory, Databricks, NLTK, spaCy, Tableau, SQL, R

Machine Learning Engineer

Care Hospitals , Banjara Hills
02.2019 - 07.2021
  • Predicted patient readmission risks by training ML models (Logistic Regression, SVM, Gradient Boosting) on EMR datasets with >12M records.
  • Built data preprocessing pipelines with Azure Data Factory and performed advanced EDA using Pandas, Matplotlib, and Seaborn.
  • Deployed ML models using Azure ML Studio and Azure Functions for real-time inference, reducing manual triage effort by 30%.
  • Applied PCA and feature scaling techniques to reduce dimensionality and improve model training performance.
  • Integrated model APIs into hospital systems and maintained secure data access through Azure SQL and IAM best practices.

Environment: Azure ML Studio, Azure Data Factory, Azure Functions, Python (Pandas, Scikit-learn, Matplotlib), Tableau, Azure SQL, EMR data

Education

Master of Science - Data Analytics

Clark University
Worcester, MA
05-2025

Master of Science - Applied Statistics

Osmania University
Hyderabad, India
06-2021

Skills

  • Programming Languages & Tools: Python (NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn), R (RStudio, ggplot2), SQL, HTML, CSS, JavaScript
  • Databases, Data Warehousing & Big Data: SQL, MySQL, PostgreSQL, MongoDB, MS SQL Server, AWS Redshift, Snowflake, Hadoop, Spark, Hive QL, HDFS
  • Cloud Platforms & Services: AWS: EC2, S3, Redshift, EMR, Lambda, SageMaker, Snowflake Azure: ML Studio, Data Factory, Databricks, Synapse Analytics, Azure Kubernetes Service (AKS)
  • Machine Learning & AI: Linear & Logistic Regression, Naïve Bayes, Decision Trees, Random Forest, SVM, K-Means, KNN, XGBoost, AdaBoost, PCA, LDA, Clustering, Reinforcement Learning, Bayesian Deep Learning
  • Deep Learning: Artificial Neural Networks (ANN), CNN, RNN, DNN, LSTM, TensorFlow, Keras, PyTorch, Deep Learning on AWS
  • Natural Language Processing (NLP): Tokenization, Lemmatization, POS Tagging, Markov Models, WordNet, ConceptNet
  • Statistical Analysis & Data Science: Hypothesis Testing, ANOVA, PCA, Time Series Analysis, Chi-Square, Multivariate & Covariance Analysis, Correlation, Bayesian Inference, Data Cleaning, Wrangling, Transformation
  • Data Visualization & Reporting: Tableau, Power BI, QlikView, Plotly, Dash, Matplotlib, Seaborn, ggplot2
  • Development Tools & DevOps: IDEs: Visual Studio Code, Anaconda, Jupyter Notebooks DevOps: Git, GitHub, GitLab, Docker, CI/CD Pipelines Methodologies: Agile, Scrum
  • Soft Skills: Analytical Thinking, Critical Problem-Solving, Strategic Decision-Making,Team Leadership, Time Management, Continuous Learning

Timeline

Data Scientist

CVS Pharmacy
10.2024 - Current

Data Scientist

RBL Bank
07.2022 - 07.2023

Data Scientist

Unacademy
08.2021 - 06.2022

Machine Learning Engineer

Care Hospitals , Banjara Hills
02.2019 - 07.2021

Master of Science - Data Analytics

Clark University

Master of Science - Applied Statistics

Osmania University