Senior computational data scientist with expertise in Python and SQL, specializing in the design and implementation of complex data pipelines. Demonstrated success in analyzing intricate datasets and troubleshooting workflows, contributing to impactful projects in multimodal deep learning and sentiment analysis.
Overview
1
1
year of professional experience
Work History
Multimodal Pet Photo Engagement Prediction
Pennsylvania State University
State College
01.2026 - 05.2026
Built a late-fusion multimodal deep learning pipeline in PyTorch combining a ResNet-18 image encoder (512-dim) with a metadata MLP (16-dim), achieving 20.09 RMSE on a 0–100 scale, competitive with Kaggle leaderboard submissions (17.0–20.0 RMSE)
Conducted Pearson correlation analysis across 12 metadata features, identifying near-zero individual predictive power and informing design of a fusion architecture to capture non-linear feature interactions
Analyzed 9,912 out-of-fold predictions, uncovering RMSE degradation from 11.53 (scores 21–40) to 48.57 (scores 81–100) and pinpointing spurious open-mouth correlation as a key model failure
Migrated training pipeline from CPU to T4 GPU with mixed precision (bfloat16 AMP), reducing runtime from 7 hours to 1.5 hours (4.7x speedup)
Leveraged AI tools (Claude, ChatGPT) to enhance debugging and documentation processes, maintaining complete ownership of modeling
StockTwits Sentiment-Based Stock Prediction
Pennsylvania State University
State College
01.2026 - 05.2026
Built an end-to-end pipeline scraping ticker-specific StockTwits posts and aggregating them into 5-minute rolling windows for intraday stock direction prediction
Engineered sentiment and attention features including net sentiment index, bullish share, sentiment momentum, message density, and abnormal density
Merged sentiment features with intraday price data, training a logistic regression classifier that achieved ~57% accuracy and AUC ~0.563 for predicting stock direction
Researched LLM-based sentiment classification with FinBERT, proposing it as a pipeline improvement to enhance accuracy over noisy self-tagged labels
Transformed 1.5M+ NFL play-by-play records using Python (PySpark) on Apache Spark in Jupyter Notebook, building data pipelines that isolated ~70–90K valid fourth-down scenarios for targeted analysis.
Built scalable ETL and feature engineering workflows, generating labeled outcomes (GO, PUNT, FG) from game data for downstream analysis and modeling
Conducted exploratory data analysis (EDA) and developed performance metrics to assess fourth-down decision effectiveness, informing strategic insights across game contexts.
Optimized performance (4 GB CSV → 1.5 GB Parquet) and ran distributed jobs on Penn State’s ICDS HPC cluster
Presented technical findings through reports and presentations, leveraging AI tools (ChatGPT) to enhance development speed and workflow optimization.
Course Project: Used Car Price Prediction at Teqanny Training – Machine Learning ProgramCourse Project: Used Car Price Prediction at Teqanny Training – Machine Learning Program
Heart Disease Prediction Using Machine Learning at Birmingham City UniversityHeart Disease Prediction Using Machine Learning at Birmingham City University