Summary
Overview
Work History
Education
Skills
Accomplishments
WORK AUTHORIZATION
Languages
References
Timeline
Generic

Sam K Shahi

Dallas,TX

Summary

  • Experienced Software Engineer with over 8 years of hands-on expertise in Scala, Apache Spark, and Databricks, specializing in large-scale data engineering.
  • Designed and deployed real-time and batch ETL pipelines across finance, aviation, and cybersecurity domains.
  • Proficient in Spark Structured Streaming, Delta Lake, and Spark SQL for building fault-tolerant and high-throughput workflows.
  • Developed and operationalized machine learning pipelines using Spark MLlib, MLflow, and feature engineering strategies.
  • Integrated GenAI use cases with Spark-parsed inputs and RAG pipelines for financial document summarization and querying.
  • Strong background in AWS cloud-native architectures, leveraging Lambda, EventBridge, and Kinesis for event-driven design.
  • Implemented behavioral authentication models at BlackBerry using weight-based scoring and Squeezer algorithms in Spark.
  • Optimized Spark jobs using advanced partitioning, caching, and tuning techniques, reducing runtimes by up to 60%.
  • Skilled in developing secure backend APIs using Java, Spring Boot, and GraphQL to support analytical and ML applications.
  • Recognized guest speaker and mentor at UT Dallas, guiding students on real-world data science applications in enterprise systems.

Overview

9
9
years of professional experience

Work History

Software Engineer – Data Science & Engineering

PNC Bank
Dallas, TX
12.2024 - Current
  • Developed scalable ETL and ML feature pipelines using Scala, Spark, and Databricks to support fraud detection and credit modeling systems.
  • Built Spark-based RAG data preprocessing pipelines to support retrieval-grounded GPT workflows, enriching financial document search capabilities.
  • Partnered with ML and AI teams to deploy Spark MLlib models for transaction classification, integrating outputs with downstream Delta Lake stores.
  • Tuned Spark workloads for 10x faster execution by optimizing joins, partitioning strategies, and job shuffling behavior.
  • Used MLflow to track experiments and register models, automating integration into real-time scoring APIs.
  • Developed and orchestrated pipelines with Databricks Jobs and Airflow, ensuring SLA-bound data delivery and visibility.
  • Created validation layers using Spark UDFs for cleansing and deduplicating multi-source financial data.
  • Automated GenAI chatbot integration for customer support scenarios by exposing GPT responses generated from Spark-parsed documents.
  • Deployed prompt-testing frameworks to measure accuracy, drift, and hallucination rates of GPT responses tied to Spark-processed datasets.
  • Built notebooks for ad hoc analytics and troubleshooting of model drift, using Spark SQL on Delta tables.

Software Engineer - Flight Planning

Southwest Airlines
Dallas, TX
02.2022 - 11.2024
  • Implemented event-driven systems for flight planning using AWS Lambda, EventBridge, SQS, and SNS to handle real-time flight operation workflows.
  • Processed and reacted to Deferred Maintenance Incidents (DMI) events to adjust flight planning logic based on aircraft airworthiness constraints.
  • Developed functionality to automatically generate and submit flight plans to the FAA, ensuring alignment with compliance and routing protocols.
  • Integrated internal services with FlightKeys, enabling optimal route calculation and aircraft performance-based planning.
  • Built and enhanced backend microservices using Java and Spring Boot, supporting key services in flight planning pipelines.
  • Created and consumed REST and GraphQL APIs to enable secure and real-time communication between planning modules and operational systems.
  • Used Kafka and IBM MQ for asynchronous message passing and orchestration between microservices tied to aircraft events and route generation.
  • Contributed to infrastructure automation using CloudFormation, ensuring consistent provisioning and deployment across environments.
  • Developed interactive front-end tools using Angular, TypeScript, and RxJS to visualize and edit flight data and FAA submission status.
  • Enabled auditing and traceability of flight plans by integrating event metadata into storage and logging layers.
  • Participated in agile ceremonies, backlog grooming, and cross-team planning sessions to iterate on flight dispatch features and system improvements.
  • Performed functional testing and debugging of end-to-end flight planning scenarios, including edge cases like reroutes due to unresolved DMIs.

Software Engineer – Data Science

JPMorgan Chase & Co.
Plano, TX
02.2019 - 10.2021
  • Designed and built real-time and batch ETL pipelines using Apache Spark (Scala) to process billions of payment records daily across merchant services.
  • Engineered feature extraction and transformation pipelines within Databricks for fraud detection and transaction scoring models.
  • Integrated Delta Lake for unified batch + streaming workflows, enabling auditability, rollback, and ACID transactions for compliance datasets.
  • Partnered with data scientists to implement Spark MLlib pipelines for user behavior clustering and risk scoring — tracked using MLflow.
  • Created Spark-based processing layers for financial documents used in GenAI summarization experiments with OpenAI models.
  • Reduced pipeline runtime by over 60% through memory tuning, job parallelization, and caching of intermediate transformations.
  • Developed schema validation and outlier detection modules using Spark UDFs to ensure data quality before model scoring.
  • Built modular components for ingestion from Kafka, transformation in Spark, and output to Cassandra and S3-based Delta Lakes.
  • Authored Databricks notebooks and job templates for reusable, scalable ETL workflows with integrated lineage tracking.
  • Introduced RAG pipeline experiments using Spark-parsed summaries + vector stores to enable natural language access to high-volume payment logs.

Data Engineer / Scientist

BlackBerry
Irving, TX
05.2016 - 12.2018
  • Developed real-time behavioral authentication algorithms using Scala and Apache Spark to detect anomalous user access patterns across mobile devices.
  • Implemented a Squeezer algorithm-based authentication model leveraging user interaction vectors (typing speed, app usage rhythm, device tilt) to generate challenge scores.
  • Built weight-based challenge-response models in Spark MLlib to calculate probabilistic identity scores from biometric and device-based signals.
  • Designed Spark pipelines to process terabytes of sensor and event data for model training, feature selection, and scoring in Databricks.
  • Tuned model features using correlation filtering, PCA, and Spark UDFs for high-dimensional behavioral vectors.
  • Integrated anomaly scores with BlackBerry's security engine to trigger adaptive authentication or escalation workflows.
  • Created an end-to-end Databricks MLflow pipeline to track authentication model versions, accuracy trends, and feature evolution.
  • Conducted statistical validation and model drift analysis using Spark SQL and Databricks visualizations.
  • Collaborated with mobile app teams to embed lightweight scoring agents and feedback loops for continuous model updates.
  • Reduced false positives by 40% over traditional rules-based systems by introducing streaming model inputs and feedback loops.

Education

Bachelor of Science - Computer Science

The University of Texas At Dallas
Richardson, TX
12.2018

Skills

  • Category Technologies / Tools / Concepts
  • Big Data & ETL Apache Spark, Databricks, Delta Lake, Spark SQL, Apache Kafka, Airflow, Hive, Cassandra
  • Data Engineering ETL Pipelines, Data Ingestion, Data Transformation, Feature Engineering, Streaming & Batch Jobs
  • DevOps & CI/CD GitLab CI/CD, Jenkins, Git, Docker, CloudFormation (IaC)
  • Data Formats JSON, Parquet, Avro, ORC, CSV
  • Workflow Orchestration Apache Airflow, Databricks Workflows
  • Project Methodologies Agile (Scrum), Test-Driven Development (TDD), Behavior-Driven Development (BDD)
  • Programming Languages Scala, Java, Python, TypeScript, SQL, JavaScript
  • Cloud Platforms AWS (Lambda, S3, EventBridge, Kinesis, SQS, SNS, CloudFormation), Databricks (Workspace, Jobs)
  • Machine Learning Tools MLlib, MLflow, Feature Stores, Model Deployment Pipelines
  • Backend Frameworks Spring Boot, REST APIs, GraphQL, Microservices, Kafka Streams
  • Frontend Technologies Angular, React, HTML5, CSS3
  • Databases PostgreSQL, MySQL, Cassandra, MongoDB
  • Version Control & Tools Git, GitLab, GitHub, Bitbucket
  • Security & Compliance OAuth2, Spring Security, Audit Logging
  • Infrastructure Automation AWS CloudFormation, Docker

Accomplishments

  • 2018 Fall Dean's List - Erik Johnson School of Engineering and Computer Science Jul 2018
  • For the fall 2017 semester, 1,579 undergraduate students made the dean's list at The University of Texas at Dallas. The dean's list is published by the University's Office of Undergraduate Education at the conclusion of each fall and spring semester. It contains the names of students who completed at least 12 credit hours during the semester with a grade-point average among the top 10 percent of all students within their respective schools. The students are listed below in accordance with student requests under the Family Educational Rights and Privacy Act.
  • Guest Speaker – Machine Learning Special Topics in Computer Science Erik Jonsson School of Engineering and Computer Science, UT Dallas — June 2018
  • Invited by faculty to guest lecture and mentor students on real-world machine learning applications as part of a special topics course.
  • Shared hands-on experience as a Data Scientist at BlackBerry, covering behavioral authentication models, Spark-based pipelines, and ML deployment workflows.
  • Assisted the professor in guiding students through industry use cases, technical challenges, and career pathways in data science and AI.
  • Actively mentored students during Q&A and follow-up sessions, focusing on applied Scala, Spark, and Databricks workflows in cybersecurity contexts.

WORK AUTHORIZATION

US Citizen

Languages

English
Professional
Nepali
Professional

References

References available upon request.

Timeline

Software Engineer – Data Science & Engineering

PNC Bank
12.2024 - Current

Software Engineer - Flight Planning

Southwest Airlines
02.2022 - 11.2024

Software Engineer – Data Science

JPMorgan Chase & Co.
02.2019 - 10.2021

Data Engineer / Scientist

BlackBerry
05.2016 - 12.2018

Bachelor of Science - Computer Science

The University of Texas At Dallas
Sam K Shahi