Summary
Overview
Work History
Education
Skills
Timeline
Generic

THARUN REDDY

Summary

  • Senior Data Scientist with 8+ years of experience in building and deploying scalable machine learning solutions across various platforms, specializing in cloud technologies, statistical analysis, and advanced machine learning algorithms for driving business intelligence and predictive analytics at an enterprise level.
    • Experienced in deploying machine learning models using Kubernetes and Docker, ensuring scalability, reliability, and seamless deployment across multiple environments.
    • Proficient in Python and R for building advanced machine learning models and performing in-depth data analysis, including predictive modeling, time-series forecasting, and anomaly detection.
    • Skilled in data preprocessing and transformation, utilizing libraries such as Pandas, NumPy, and SciPy to clean, aggregate, and enhance datasets for improved model accuracy and business insights.
    • Adept in designing and implementing ETL pipelines using SQL, Apache Spark, and Hadoop, optimizing data extraction, transformation, and loading for high-performance data workflows.
    • Strong ability to integrate machine learning frameworks like TensorFlow, Keras, and PyTorch to develop and deploy deep learning models, including for NLP, image classification, and time-series analysis.
    • Expertise in automating machine learning pipelines within Docker and Kubernetes, streamlining model deployment, and ensuring robust performance under varying traffic loads.
    • Proficient in designing and implementing scalable, cost-effective data architectures using AWS services such as S3, EC2, Redshift, and Glue, enabling high-performance data processing and storage solutions for machine learning.
    • Deep understanding of advanced statistical analysis with tools such as SciPy and R, performing regression, hypothesis testing, and probability distribution modeling to support data-driven decision-making.
    • Proficient in scalable reporting solutions using Tableau and Power BI, providing real-time business intelligence and insights to non-technical stakeholders.
    • Skilled in version control management using Git, ensuring smooth collaboration, efficient tracking of changes, and maintaining model reproducibility in data science projects.
    • Experience in leveraging transfer learning techniques in deep learning to enhance model performance using pre-trained models for various applications.
    • Proficient in building and optimizing machine learning models using XGBoost, Spark MLlib, and TensorFlow to drive scalable, high-accuracy predictive analytics.
    • In-depth experience in SQL scripting for automating complex data queries, ensuring streamlined reporting, data access, and efficient retrieval of large datasets across departments.
    • Expertise in time-series data manipulation with Pandas, improving forecasting accuracy through time-based aggregations, resampling, and model optimization.
    • Skilled in leveraging Azure Machine Learning, Azure Synapse Analytics, and Azure Data Factory to build and deploy scalable machine learning models for real-time and batch predictions.
    • Skilled in geospatial data analysis using SciPy for performing location-based analytics, including clustering, distance calculations, and spatial relationships.
    • Experience in automating reporting workflows through Power BI, enabling self-service reporting capabilities and providing business users with real-time data insights.
    • Successfully led the integration of data science tools and machine learning models in production environments, promoting best practices in model versioning, documentation, and deployment.
    • Trained and mentored teams in data science technologies such as Python, R, SQL, and machine learning frameworks, improving team efficiency and knowledge sharing.
    Leveraged Databricks to orchestrate distributed ETL and machine learning pipelines, improving scalability, performance, and collaboration on large-scale datasets.
    Implemented MLflow within Databricks to manage model versioning, track experiments, and enhance reproducibility across data science projects.
    Utilized distributed data processing and pipeline automation on Databricks for faster, large-scale data transformations, improving efficiency of machine learning workflows.

Overview

8
8
years of professional experience

Work History

Senior Data Scientist/ML

Capital One
Plano, Texas
08.2022 - Current

Applied machine learning techniques such as regression, classification, clustering, and deep learning to develop robust predictive models.
• Developed end-to-end machine learning pipelines in PyTorch and TensorFlow/Keras, from data preprocessing and augmentation to model training and evaluation, ensuring reproducibility and scalability.
• Leveraged Databricks to develop and orchestrate end-to-end ETL pipelines, enabling large-scale data processing and real-time analytics across multiple data sources.
• Implemented MLflow on Databricks to track experiments, manage model versions, and streamline collaboration across data science teams.
• Optimized Spark jobs in Databricks for faster processing of structured and unstructured datasets, improving model training times and reducing compute costs.
• Optimized data processing pipelines using Apache Spark, enabling the processing of large-scale datasets in parallel and drastically reducing computation time for complex analytics tasks.
• Deployed machine learning models on Azure Machine Learning and automated deployment within Docker containers, streamlining the process from development to production.
• Automated model deployment and scaling in Kubernetes, ensuring models handled fluctuating traffic by dynamically allocating resources as needed.
• Integrated Azure data services such as Azure Data Lake and Azure SQL Database to manage large datasets and create scalable solutions for business intelligence and predictive analytics.
• Performed data cleaning, transformation, and preprocessing using Hadoop ecosystem tools and Python, ensuring high-quality datasets for analysis and ML tasks.
• Developed and executed ad-hoc SQL queries, collaborated with analysts, and created reusable Python codebases to improve accessibility, efficiency, and consistency.
• Automated Tableau reporting processes and implemented data security best practices using User Filters and Row-Level Security (RLS) for sensitive information protection.
• Implemented advanced statistical analyses using SciPy, such as hypothesis testing, regression analysis, and probability distributions, to support data-driven business decisions.
• Handled time series data manipulation in Pandas, including time-based aggregations and resampling, to improve forecasting accuracy and business planning.
• Managed automated testing frameworks within Jenkins to validate data models, pipelines, and analytics code before production deployment.
• Used Git/GitHub to track and manage the evolution of analytical models, ensuring version control and smooth collaboration across teams.
• Provided training to team members on R programming, data manipulation, visualization best practices, and model interpretation to improve overall data literacy.
• Explained model results to non-technical stakeholders by visualizing feature importance and interpreting complex ML models for better business understanding.

Data Scientist

Cognizant
Teaneck, New Jersey
08.2019 - 07.2022
  • Designed interactive and visually appealing data visualizations using R’s ggplot2 and Plotly, enabling stakeholders to quickly interpret insights and trends from large datasets.
    • Partnered with data engineers to architect and build scalable SQL-based ETL pipelines, ensuring smooth and reliable data flow from raw sources to structured formats for analysis.
    • Developed automated data extraction and transformation tools using Pandas, streamlining workflows and accelerating time-to-insight for complex datasets.
    • Built and deployed scalable machine learning pipelines on AWS Glue, automating ETL and data preparation processes to support model training and inference.
    • Applied machine learning algorithms using Spark’s MLLib for distributed datasets, improving the speed, scalability, and accuracy of predictive analytics.
    Leveraged Databricks to streamline distributed data processing and ETL workflows, enabling faster and more efficient handling of large-scale datasets.
    • Implemented transfer learning in TensorFlow, leveraging pre-trained models such as ResNet to reduce training time and enhance accuracy for image and text classification tasks.
    • Used Keras to evaluate and validate deep learning models, ensuring robustness and generalizability before production deployment.
    • Developed custom loss functions and optimizers in TensorFlow to handle business-specific challenges, such as imbalanced classes and non-standard data distributions.
    • Maintained robust Jenkins pipelines to automate complex data workflows, ensuring seamless integration, testing, and continuous delivery of data-driven applications.
    • Deployed scalable data pipelines with AWS services such as S3, Glue, and Redshift, supporting efficient ETL processes and high-volume data analytics.
    • Improved performance of Power BI dashboards for large datasets through query optimization, reducing load times and enhancing user experience.
    • Created automated reporting solutions using Power BI, enabling business users to access real-time reports without manual intervention and improving departmental efficiency.
    • Optimized Spark jobs through partitioning, caching, and broadcasting, maximizing memory efficiency and performance when processing large-scale datasets.
    • Monitored PyTorch model training with TensorBoard and other visualization tools, tracking performance metrics and detecting issues early.
    • Tuned XGBoost hyperparameters using grid and random search, enhancing model accuracy and reducing overfitting.
    • Collaborated with cross-functional teams to define data requirements and design SQL-based reporting solutions, delivering actionable insights for multiple business units.
    • Utilized SciPy, NumPy, and Pandas to develop integrated data analysis pipelines, addressing complex statistical and mathematical challenges.
    • Deployed multiple data science tools within Docker containers, ensuring consistent and reproducible development and testing environments.
    • Managed Kubernetes namespaces and access controls, isolating machine learning projects and ensuring secure, efficient resource allocation.
    Implemented MLflow tracking on Databricks to manage model versions, monitor experiments, and improve collaboration across the data science team.
    • Collaborated effectively using Git branches and pull requests, allowing simultaneous feature development and smooth integration into the main codebase.

Data Analyst

Apollo Hospitals
Hyderabad, India
05.2017 - 07.2019
  • Implemented end-to-end data analysis and ETL pipelines using Python and SQL, automating extraction, transformation, and loading (ETL) of large datasets to support analytics and machine learning workflows.
    • Automated repetitive data cleaning, preprocessing, and transformation tasks using Python and R, improving accuracy, efficiency, and team productivity.
    • Built and deployed scalable machine learning pipelines using AWS Glue, Spark MLlib, TensorFlow, Keras, and PyTorch for classification, regression, clustering, NLP, and time-series forecasting.
    • Applied transfer learning with pre-trained models (e.g., ResNet) and developed custom loss functions and optimizers to address business-specific challenges, such as imbalanced classes or non-standard data distributions.
    • Tuned XGBoost hyperparameters using grid and random search, optimizing predictive accuracy and reducing overfitting.
    • Managed cloud-based data pipelines using Azure Data Factory, AWS S3, Glue, Redshift, and Azure SQL, enabling large-scale ETL processes and high-performance ML model training.
    • Optimized distributed data processing workflows using Apache Spark and Spark SQL with partitioning, caching, and broadcasting techniques to maximize performance and resource efficiency.
    • Developed interactive dashboards using Tableau and Power BI, providing stakeholders with real-time insights, applying row-level security, and training non-technical users to create custom reports.
    • Utilized Docker to standardize development and testing environments, and implemented Kubernetes namespaces and access controls to isolate ML projects and manage resources securely.
    • Monitored model training using TensorBoard and other visualization tools to track performance metrics and detect issues early, ensuring robust model deployment.
    • Maintained version control and collaboration using Git/GitHub with branches and pull requests, enabling parallel development and smooth integration across teams.
    • Built integrated data analysis pipelines using Pandas, NumPy, and SciPy, including geospatial analysis with clustering, distance calculations, and spatial relationship modeling.
    • Automated reporting and analytics workflows with Power BI and Tableau, reducing manual effort and improving accessibility for business stakeholders.
    Leveraged Databricks to orchestrate distributed ETL and ML pipelines, improving scalability and handling of large-scale datasets efficiently.
    Implemented MLflow on Databricks to track experiments, manage model versions, and enhance collaboration across the data science team.
    • Collaborated with cross-functional teams to define requirements, design SQL-based reporting solutions, and deliver actionable insights for strategic business decisions.

Education

The University of Texas at Arlington
Arlington, TX

Skills

  • Programming Languages: Python, R, SQL, Bash
  • Machine Learning & Deep Learning: TensorFlow, Keras, PyTorch, XGBoost, Spark MLlib, Transfer Learning, Predictive Modeling, Time-Series Forecasting, Anomaly Detection, NLP, Image Classification
  • Data Science & Statistical Analysis: Pandas, NumPy, SciPy, Regression Analysis, Hypothesis Testing, Probability Distributions, Feature Engineering, Clustering, Distance Calculations
  • Big Data & Distributed Systems: Databricks, Apache Spark, Hadoop, HDFS, SQL-based ETL pipelines, Spark SQL, Distributed Data Processing & Optimization, Hadoop Ecosystem Tools
  • Cloud Platforms: AWS, Azure, GCP, Databricks on Azure and AWS
  • Data Visualization & Reporting: Tableau, Power BI, ggplot2, Plotly, Interactive Dashboards, Real-Time Analytics, Automated Reporting, Data Integration, Query Optimization
  • Version Control & Collaboration: Git, GitHub, Jenkins, GitLab, Branching, Merging, Version Control Best Practices
  • Containerization & Automation: Docker, Kubernetes, Helm, CI/CD pipelines for Model Deployment & Scaling, Automated ML & Data Pipelines, Databricks MLflow for Model Management
  • SQL & Database Management: Advanced SQL (Joins, Subqueries, Window Functions), Data Extraction, Transformation, and Loading (ETL), SQL Scripting for Analysis & Reporting, Databricks SQL for Big Data Analytics

Timeline

Senior Data Scientist/ML

Capital One
08.2022 - Current

Data Scientist

Cognizant
08.2019 - 07.2022

Data Analyst

Apollo Hospitals
05.2017 - 07.2019

The University of Texas at Arlington
THARUN REDDY