Summary
Overview
Work History
Education
Skills
Accomplishments
Certification
Timeline
background-images
SRIVISHNU JAMMULA

SRIVISHNU JAMMULA

Frisco,USA

Summary

Tech Lead MLOps / DevOps Architect with over 10 years of experience building and operationalizing large-scale ML and AI infrastructure across cloud and hybrid environments. Certified Google Cloud Architect and Kubernetes expert skilled in designing end-to-end MLOps platforms covering model training, validation, deployment, monitoring, and governance.

Expert in Python-based automation, Kubernetes (GKE), container orchestration, and event-driven pipelines, integrating MLflow for model lifecycle management. Proven track record of building cost-optimized GPU clusters, CI/CD frameworks, and auto-remediation systems that drive operational excellence, reproducibility, and scalability.

Collaborative leader bridging data science, AI research, and engineering to ensure secure, compliant, and high-availability ML systems. Passionate about advancing automation, observability, and continuous improvement across the full AI delivery pipeline.

My role sits at the intersection of AI research, DevOps, and SRE. I actively drive AI/ML initiatives using MLflow, PyTorch, and TensorFlow frameworks for training, validating, and deploying production-grade models with strong governance and reproducibility. I’ve also contributed to building pretrained model workflows and documenting reliability frameworks aligned with SRE best practices for AI-driven systems.

Overview

11
11
years of professional experience
1
1
Certification

Work History

DevOps Tech Lead

Copart Inc
05.2018 - Current
  • Led a high-performing engineering team to deliver scalable, secure, and fault-tolerant ML and cloud-based solutions under tight deadlines.
  • Designed and implemented scalable MLOps architecture supporting end-to-end model training, validation, and deployment using MLflow, TensorFlow, and PyTorch frameworks.
  • Built model validation pipelines ensuring inference accuracy and latency SLAs before release to production environments.
  • Developed Python-based auto-remediation frameworks integrated with StackStorm and APM alerts to self-heal infrastructure incidents.
  • Implemented governance for model versioning and audit trails to ensure compliance across AI deployments.
  • Optimized ML infrastructure for GPU utilization and cost efficiency using dynamic resource pooling and autoscaling policies.
  • Built CI/CD automation frameworks with GitLab, Jenkins, and Spinnaker for GPU driver and AI workload validation.
  • Implemented monitoring and observability using Prometheus, Grafana, and Sumo Logic for performance and availability insights.
  • Developed Flask/Django-based dashboards for ML service health, training metrics, and CI/CD validation reports.
  • Partnered with NVIDIA hardware and software engineering teams to support GPU testing pipelines for CUDA, and TensorRT on Linux and Windows platforms.
  • Designed and led AI Infrastructure deployment of GPU-powered cluster in-house for AI R&D, tripling model iteration speed and reducing cloud compute cost.
  • Built CI automation frameworks to validate new GPU drivers and AI workloads.
  • Collaborated with AI researchers to accelerate deployment of HPC and AI workloads across hybrid infrastructure.
  • Experience with GCS (Storage, Application Services, Deployment and Management) and managed GKE platform instances using Custom Ansible Module & spinnaker.
  • Setup GCP firewall rules to allow or deny traffic between vm's and increased response times using nginx caching improving user experience & latency times.
  • Created Projects on GCP , VPC's subnetworks , Setup GKE cluster environments including setting up helm charts, Ingress controllers, Kubectl, CICD pipelines to deploy services from registry based on image tags, HPA, labels for dedicated pool nodes for high resource contention apps.
  • Implemented continues testing for QA automation integration to ensure app functionalities are verified after deployments.
  • Integrated Security Practices in CI/CD Pipeline for achieving secure deployment of applications.
  • Maintained scalable, fault tolerant container orchestration platform using Kubernetes to orchestrate docker containers that hosts Java (Spring Boot), NodeJS , Ruby (JRuby), Python (Flask, Django) based applications.
  • Developed Shell, Python, Perl scripts and Spinnaker manifests to build out the servers and Docker containers which will host critical Copart applications.
  • Designed and developed defensive mechanism to prevent attacks like Cross Site Request Forgery (CSRF) and Cross Site Scripting (XSS).
  • Utilized Industry standard OAuth2.0, LDAP, and Active Directory to authenticate and authorize users across all application with single sign on.
  • Developed Dashboard using Django & Python frameworks for Data refresh.
  • Maintained SOLR cluster and develop streaming applications utilizing Kafka to provide highly functioning search capabilities to Copart's external and internal users to be able to search through 2million+ salvage vehicles.
  • Knowledge of Java memory management and optimization techniques, and ability to improve application performance by tuning Java heap and garbage collection settings.
  • Developed rules and configurations for Nginx Web servers for load balancing, caching, proxying and filtering Copart's customer web requests.
  • Discovered and develop continuous integration and continuous deployment (CI/CD) practices across the software development lifecycle integrating tools.
  • Utilized and maintained Redis cluster, as in memory session cache data store to provide better performance for the Copart's applications.
  • Managed & Maintained Swiftstack Cluster for Object storage of Imaging and Document Services to store over 1 million images/documents of salvage vehicles that are ready for auction.
  • Researched to identify errors and performance issues for Copart's JAVA Spring applications using data analysis scraping.
  • Developed Python and Yaml based scripts to integrate multiple tools such as Slack, Stackstorm, Jenkins, Ansible to streamline and automated the continuous integration and deployment process for an application build and deployment.
  • Wrote Stackstorm workflow to integrate with Slack to be able to automate application builds, restarts, upgrades, deployments, and management.
  • Developed Custom Dashboard for SSO for all DevOps tools, Monitoring & Metrics distribution for Management.
  • Developed Dashboard to Track app logging standards and reject docker images if doesn't comply standards.
  • Excellent network Debugging knowledge.
  • Participated & supported major project releases of copart business requirement.
  • Developed calico network security policies for Kubernetes cluster to support & provide simple, scalable and secure virtual networking for containers.
  • Analysed and use the Prometheus metrics to enable Horizontal Pod Autoscaler for applications deployed in docker containers for high availability.
  • Developed Auto-remediation tasks In K8S infrastructure & implemented HPA for Copart's springboot apps.
  • Developed and maintained WSO2 infrastructure with Carbon archive-based applications used within Copart's billings and payment system.
  • Proposed & Implemented Internationalization and Localization principles to make sure the Software is usable in all the countries.
  • Pioneered log management tool for collecting and monitoring the applications logs & developed regular expression queries to analyse application logs data and introduce errors to alert them.
  • Introduced Continuous Inspections Tool SonarQube to Enhance CI/CD which improves code quality systematically by static code analysing, to detect bugs, code smells and security vulnerabilities.
  • Introduced Continuous Testing concept to validate checked in code to respective env.
  • Developed Multiple Custom Docker Images for Copart's Yards Business Applications.
  • Followed Change Management Audit compliance control the release of an application to the production environment.
  • Successfully Migrated Spring microservices & app specific dependencies build out like redis memory store, Rabbitmq cluster, GKE & its components , secret managers.
  • Collaborated with systems engineer to improve caching Objects on Google cloud storage via Nginx load balancers.

Sr. DevOps Engineer

Crocs
01.2016 - 05.2018
  • Worked on Architecting the Infrastructure Deployment on GKE for Cloud migration Project.
  • Built GKE cluster & worked on creating nodes, replication controller, deployments, labels, health & ingress by writing YAML and integrated CICD for Services deployment.
  • Used Jenkins to prepare docker images for all micro-services build out store in docker-registry and auto deploy to Kubernetes cluster creating pods.
  • Conducting security assessments and code reviews to identify vulnerabilities and suggest mitigations.
  • Coordinated & Worked with developers on Git branching strategy including developing branches , feature branches ,labelling/naming images. Pull requests & code reviews were performed.
  • Developed and maintained secure infrastructure configurations using tools like Terraform and ansible.
  • Implemented Diversified node pools for GKE &, Auto Scaling, Disaster Recovery and Monitoring For worker nodes.
  • Used Prometheus & instana ( APM tool ) to collect metrics to optimize resource utilization and costs on GCP, set up monitoring and alerting improving service performance issues.
  • Implemented Horizontal Pod Autoscaling for apps based on metrics exposed to Prometheus.
  • Created environments dynamically for each feature branch with Google Kubernetes Engine (GKE).
  • Experience in using Nexus and artifactory repository manager for maven builds and used maven dependency management to deploy snapshots and release artifacts to nexus to share across projects.
  • Extensive knowledge on GCP Client libraries and cloud SDK's.
  • Used the content delivery platform (Incapsula) to route the incoming internet traffic to the internal applications hosted within Copart's IT infrastructure.
  • Setup & maintained logging and monitoring systems using Grafana.
  • Used Git webhooks for pushing in Jenkins to build CI/CD Pipelines using Jenkins groovy and code deployment guide for developers , Testers, & production management.
  • Conducted and participated in monthly technology talk sessions on the new software products and services like Instana, Zipkin, Spinnaker, SonarQube, Manage Engine, Openstack and Sumologic. These tools help in the cross-functional team troubleshooting and debugging of the software.
  • Collaborated with developers & involved in Unit testing for bugs free code release.
  • Integrated stackdriver to monitor GCP infrastructure in new-relic for observability.

SRE Build and Release Engineer

CVS Pharmacy
11.2014 - 01.2016
  • Experience building and scaling observability/monitoring systems.
  • Performed code reviews, evaluate implementations, and provided feedback for tool improvements.
  • Installed, Configured, and administrated DNS, LDAP, and Sendmail on Redhat Linux.
  • Partner with Interning and operational counterparts to champion automation efforts.
  • Designed, deploy and configure automation tools such as Ansible, python.
  • Created playbooks and workflows to achieve high-level of automation.
  • Contributed to the enterprise infrastructure vision and strategy.
  • Developed Strong knowledge of container technologies (Docker).
  • Leveraged open technology such as Docker, Bash, Springboot, playbooks, Git, Jenkins, Linux, Java, Kafka, MongoDB, Apache Zookeeper for Distributed synchronizing.
  • Resolved all IP network issues to reduce waste and downtime while also ensuring client Service Level Agreement.
  • Developed automation framework for public cloud infrastructure deployments.
  • Worked & managed Windows IIS Native Modules to translates requests Protocols.
  • Work with security team to perform security review of hosting environments.
  • Introduced stackstorm tool to automate repetitive tasks.
  • Metric driven and focused on continual improvement.
  • Working knowledge of build automation and CI/CD pipelines.
  • Worked with SRE to monitor performance of application, create automated system administration monitoring and alert systems and respond to interruption of service events for running service.
  • Evaluated and rolled out AWS Opensearch Kibana logging solution and integrated with SSO.
  • Participated in product sprints, write user stories and deliver on them for highly available, secure, globally performing service.
  • Strong knowledge in creating Jenkins continuous integration pipelines and automating deployment pipelines.
  • Responsible for performing tasks like branching, tagging, and release activities on version control tool.
  • Managed system builds, server builds, install, upgrades, security patches, migration, backup, disaster recovery, performance monitoring, fine-tuning on Red Hat Linux systems.
  • Work with other Developers to create knowledge repository, setup and maintenance of development environments, etc.
  • Work continuously toward making improvements in the Change Management and Release Management process, including a planned transition to Agile methodologies.
  • Working with the Release Manager to improve build automation and to reduce bottlenecks in the delivery pipeline.
  • Participate in load and performance tests to ensure application is production ready Implement monitoring, develop run books and develop self-healing automation.
  • On call production support.

Education

Master of Science - Information Technology

University of The Cumberlands
Williamsburg, KY

Skills

  • Programming & Automation: Python (advanced), Go (basic), Bash, StackStorm, Ansible
  • ML Infrastructure & Serving: MLflow, Kubeflow, Vertex AI, Airflow, TensorFlow Serving, TorchServe
  • Cloud Platforms: Google Cloud (GKE, BigQuery, Pub/Sub, Bigtable, Memorystore), AWS (EKS, S3, Lambda)
  • Infrastructure-as-Code & CI/CD: Terraform, Helm, Jenkins, GitLab CI, Spinnaker
  • Monitoring & Observability: Prometheus, Grafana, Sumo Logic, ELK Stack, Instana
  • Data Pipelines & Streaming: Kafka, Spark, RabbitMQ, Redis
  • Container Orchestration: Kubernetes, Docker (secure builds & runtime optimization)
  • Security & Governance: Vault, IAM, SSO (OAuth2, Okta), Model Auditing, Compliance
  • Leadership: Cross-functional mentorship, AI/ML platform road-mapping, best practice evangelism

Accomplishments

  • End-to-End MLOps Platform Buildout: Architected a centralized ML control platform integrating MLflow, Vertex AI Pipelines for training, validation, deployment, and drift monitoring establishing full model versioning, rollback, and reproducibility.
  • GPU Infrastructure for LLM Training: Designed and deployed NVIDIA runtime GKE clusters enabling scalable LLM and computer-vision workloads, improving training throughput 3× while cutting costs 40%.
  • Automated Model Operations and Remediation: Built StackStorm event-driven frameworks in Python for L1 incident detection and self-healing based on APM signals (Prometheus + Logging), reducing manual intervention 60%.
  • Observability and Governance Enhancement: Implemented real-time metrics, alerting, and drift detection dashboards using Prometheus, Grafana, and custom Django UIs; integrated ML audit logging and model registry governance.
  • Cross-Functional Leadership: Led a team of 5 DevOps and ML engineers, mentoring them on IaC, MLOps automation, and cloud optimization practices, bridging AI research and production engineering.

Certification

  • Certified Google Professional cloud architect - 2023
  • Certification of Recognition for Copart Data Centre migration
  • Mentor/coach interns for implementing CI/CD, DevOps, DevSecOps, Security, Cloud, SRE and Containers and helping Organization.

Timeline

DevOps Tech Lead

Copart Inc
05.2018 - Current

Sr. DevOps Engineer

Crocs
01.2016 - 05.2018

SRE Build and Release Engineer

CVS Pharmacy
11.2014 - 01.2016

Master of Science - Information Technology

University of The Cumberlands
SRIVISHNU JAMMULA