Summary
Overview
Work History
Education
Skills
Certification
Timeline
Generic

Aditya Chavali

Site Reliability Engineer
Fremont

Summary

  • Results-driven Platform Engineer with over 19 years of experience driving large-scale observability, automation, and performance engineering initiatives across enterprise IT environments.
  • Proven expertise in enabling end-to-end application monitoring (APM), cloud-native transformations, and modern deployment practices using tools like AppDynamics, Dynatrace, Splunk, Ansible, OpenShift, and Kubernetes.
  • Instrumental in scaling APM adoption from 100 to over 600 applications by automating onboarding processes, reducing manual effort by 85%, and improving Mean Time to Detect (MTTD) by 30%.
  • Adept at integrating infrastructure, middleware, and app-level monitoring while partnering with product vendors to influence agent-level improvements.
  • Skilled in leading platform modernization initiatives, including migration of legacy systems to containerized deployments on OCP, integrating with enterprise CI/CD pipelines, and enhancing production readiness through AI-driven alerting and real-time telemetry dashboards.
  • Strong background in incident management, root cause analysis, and operational support of critical production environments. Recognized for building automation frameworks, proactive monitoring systems, and performance baselines that ensure resilience, scalability, and reliability across hybrid cloud architectures.

Overview

19
19
years of professional experience
2
2
Certifications
3
3
Languages

Work History

Lead Infrastructure Engineer

Wells Fargo
08.2023 - Current

Overall Summary

  • Played a pivotal role in improving production monitoring maturity, leading to better visibility, more accurate alerting, and increased platform reliability during large-scale releases.
  • Managed and monitored installed systems for highest level of availability.
  • Improved secret rotation efficiency by 60% via Python-based Vault automation, reducing manual intervention, and compliance risk.
  • Enabled safe, controlled deployments across three major platforms through robust CI/CD integrations and advanced deployment strategies (Canary, Blue/Green).


Deployment Automation and Release Engineering.

  • Led production deployments using Ansible, partnering with DevOps teams to enable Canary and Blue/Green deployment models, with dynamic traffic routing via F5 and AVI load balancers.
  • Orchestrated seamless deployments across diverse platforms (legacy VMs, PCF, and OpenShift) using tools such as UDeploy and Harness.
  • Contributed key insights to enterprise-wide CI/CD initiatives, ensuring pipeline security, resilience, and cross-platform compatibility.


Platform Modernization and Infrastructure Migration.

  • Drove F5 to AVI migration efforts for multiple applications, spanning both traditional infrastructure and PCF-based microservices, improving performance and simplifying the load balancing strategy.
  • Developed Python-based automation to streamline HashiCorp Vault secret management, improving operational efficiency by 60%, and reducing human error in sensitive configurations.


Observability, Monitoring, and Performance Engineering

  • Created real-time Grafana dashboards using data from Splunk to monitor traffic distribution and ensure migration success for critical application flows.
  • Drove ongoing improvements in production monitoring and alerting for Line-of-Business (LOB) applications, significantly enhancing end-to-end visibility.
  • Developed runbooks, on-call playbooks, and real-time dashboards to support high-visibility national product launches, ensuring incident readiness and stakeholder communication.


AI-Based Alerting and Monitoring Consolidation

  • Currently leading a major initiative to consolidate fragmented monitoring systems and implement AI-based alerting.
  • Aimed at reducing noise and improving alert accuracy across enterprise applications.


Performance Engineering Leadership.

  • Acted as a key contributor across all performance engineering domains, including monitoring, alerting, and tuning, using tools like AppDynamics, Dynatrace, Glassbox, Splunk, ThousandEyes, and General Nelson.
  • Collaborated cross-functionally to ensure platform engineering capabilities (resilience, scalability, availability) were fully implemented across mission-critical applications.

Lead Performance Engineer

Tata Consultancy Services
02.2017 - 08.2023

Overall Summary

  • Spearheaded the automation of AppDynamics onboarding, leading to a 6× increase in APM adoption — scaling from approximately 100 to over 600 applications onboarded, including sub-components and supporting services.
  • Observed a 30% improvement in Mean Time to Detect (MTTD) by addressing key observability gaps across the application, infrastructure, and middleware layers.
  • Contributed to a significant reduction in Mean Time to Resolve (MTTR) for critical incidents through enhanced monitoring visibility, root cause triage, and proactive escalation support.
  • While incident volume remained consistent, improved observability enabled earlier detection of performance degradations, preventing escalation to P1/P2 incidents, and improving service resilience.


APM Strategy and Cross-Application Enablement.

  • Designed tailored APM strategies based on individual application architectures; partnered with development teams to guide monitoring adoption, identify feasibility, and resolve implementation gaps.
  • Developed standardized monitoring blueprints, including agent selection (Java, infrastructure, DB), business transaction detection, custom alerting thresholds, and anomaly detection setup.
  • Provided structured feedback to the AppDynamics product team, leading to enhancements in Java Agent observability for enterprise-grade deployments.


Automation of AppDynamics onboarding

  • Architected and implemented an automated AppDynamics onboarding platform using the Spring Framework and MongoDB, reducing onboarding time from 20 minutes to under 3 minutes, significantly boosting adoption across Cisco IT.
  • The automation framework was adopted as a reference API design by the AppDynamics product team for wider use cases.


Integration with event correlation and incident management.

  • Developed an alert-forwarding system to push AppDynamics events to a Kafka-based event correlation engine, enabling seamless integration with ServiceNow for automated incident creation and triaging.


APM Agent Enhancements and Product-Level Improvements

  • Played a key role in enhancing APM agent capabilities by identifying platform-specific gaps and providing structured feedback to AppDynamics product teams, resulting in improved agent observability and performance.
  • Led monitoring enablement across a wide range of agents, including: Cluster Agent (OpenShift monitoring), Network Agent (network telemetry to complement APM), Web Server Agent, Database Agents (Oracle, MongoDB, PostgreSQL, Cassandra), Machine Agent with custom extensions (Kafka, RabbitMQ), and Language-specific Agents (Java, Python, Node.js).
  • Conducted deep feasibility studies and PoCs to validate agent functionality, enrich metric collection, and standardize onboarding across diverse environments.
  • Influenced multiple agent roadmap improvements through recurring engagements with AppDynamics engineering teams.


AI Collaboration and Automation Bots

  • Partnered with the TCS internal AI team to develop a proof-of-value 'Smart Sensor' for predictive analytics, leveraging historical APM and performance telemetry.
  • Designed and deployed a command center bot capable of triggering remediation actions, such as auto-scaling, server restarts, and traffic rerouting, based on monitoring insights.


Cloud-Native Transformation & CI/CD Integration.

  • Led the transformation of internal platform applications from traditional monolithic architectures to cloud-native, containerized solutions deployed on the OpenShift Container Platform (OCP).
  • Partnered with enterprise DevOps teams to integrate applications into standardized CI/CD pipelines, ensuring smooth, automated deployment workflows, aligned with organizational best practices.
  • Enabled platform services to adopt modern development patterns, improving scalability, fault tolerance, and deployment consistency.
  • Ensured cloud-native apps adhered to platform observability, security, and compliance standards as part of the modernization effort.


Incident Management and Root Cause Analysis

  • Actively contributed to P1/P2 incident triaging and on-call rotations; leveraged AppDynamics and performance engineering techniques to conduct RCA, and provide both short-term remediations and long-term recommendations to development teams.


Documentation, enablement, and standards.

  • Authored comprehensive documentation and playbooks covering APM agent onboarding, alert configuration, performance thresholds, and anomaly detection best practices.
  • Established reusable templates and reporting formats to drive consistency in problem detection, observability coverage, and monitoring metrics reporting across teams.


Operational Support for Kubernetes and OpenShift

  • For a focused six-month period, I supported the daily health, stability, and availability of critical production Kubernetes clusters, ensuring the seamless operation of containerized applications.
  • Led incident management efforts for a wide range of issues, including application deployment failures, YAML misconfigurations, GSLB-related outages, access control, network policies, container crashes, and autoscaling challenges.
  • Supported the rollout of OpenShift as a Service, enabling development teams to provision target environments through automated, self-service workflows, significantly reducing onboarding and provisioning effort.

Lead Performance Engineer

TCS Pioneer, ITPL
04.2014 - 02.2017
  • Engineered and developed comprehensive monitoring strategies for complex application architectures, ensuring high availability and faster response times for live applications.
  • Continuously evaluated and integrated multiple monitoring tools to enrich end-to-end performance visibility.
  • Performed advanced Java Heap Sizing, Garbage Collection tuning, and Memory leak analysis for microservice-based applications running on Kubernetes clusters.
  • Diagnosed and resolved high response time and thread contention issues using APM tools such as Dynatrace, Appdynamics, and New Relic across various application servers including Tomcat, JBoss, and WebLogic.
  • Provided recommendations for fine-tuning container-based deployments, including capacity planning, resource sizing, and load balancing strategies.
  • Resolved critical P1/P2 production performance issues and collaborated with development teams on design changes to resolve performance bottlenecks.
  • Mentored and trained a team of engineers on new technologies and skill enrichment, driving a culture of continuous improvement.

Lead Performance Engineer

Oracle India Pvt Ltd
04.2011 - 04.2014
  • Led performance tuning for Oracle BPM applications, diagnosing issues like memory leaks using JProfiler, Eclipse MAT, and Samurai.
  • Engineered production-like multi-tenant environments on Linux, automating installations and monitoring with custom shell scripts.
  • Developed load test scripts with Oracle Application Testing Suite to benchmark key business flows and identify performance bottlenecks.
  • Collaborated with DBAs to analyze database performance via AWR and STS reports, improving system efficiency.
  • Implemented proactive monitoring strategies to enhance system resilience and streamline deployments.
  • Built and proposed prototypes to resolve critical performance issues, while also providing technical guidance and training to junior team members.

Technology Consultant

Hewlett Packard
04.2010 - 05.2011
  • Drove a 48% improvement in application performance by redesigning a legacy Telecom Application MMS Connector with Java and Spring Framework.
  • Architected new middleware layer using Core Java and REST APIs, enhancing Telecom MMS Connector functionality.
  • Built performance workload model to validate REST API migration success, demonstrating 48% application performance increase.
  • Implemented proactive monitoring strategies to ensure high availability in hybrid cloud environments.
  • Collaborated with cross-functional teams to conduct root cause analyses and optimize system resilience.

Product Developer

BMC Software
10.2008 - 04.2010
  • Architected and developed comprehensive monitoring software solutions for VMware infrastructure, enhancing operational visibility and control.
  • Implemented multithreaded Auto-Discovery Engine using Core Java to automate discovery of virtual machines and hosts.
  • Designed VMotion Detector monitoring adapter to precisely track virtual machine migrations across infrastructure.
  • Developed solutions for Host Bus Adapters (HBA) and Host Data Stores, enabling proactive identification of hardware and storage issues.
  • Executed end-to-end performance tests, ensuring robust application performance through monitoring modules.
  • Conducted performance tuning and optimization initiatives pre-launch, contributing to a seamless production environment.
  • Led cross-functional collaboration to drive root cause analysis and implement proactive monitoring strategies.
  • Utilized APM tools and DevOps practices to enhance system resilience in hybrid cloud environments.

Java Performance Engineer

Symphony Services
07.2006 - 10.2008
  • Engineered and maintained production-like lab environments using Oracle Fusion Middleware Suite and Linux servers for robust performance testing.
  • Drove performance tuning and optimization for Oracle Application Development Framework (ADF) applications, achieving a 45% improvement in responsiveness.
  • Authored custom load test scripts with Load Runner to benchmark critical business flows against production-scale infrastructure.
  • Conducted root cause analysis of performance degradation issues, reducing system downtime by 20% through targeted optimizations.
  • Implemented proactive monitoring strategies with APM tools to enhance system resilience and ensure high availability.

Education

Bachelor of Technology (B. TECH) - Computer Science Engineering and Information Technology

Jawaharlal Nehru Technological University

Master of Technology (M. TECH) - Computer Science Engineering

Jawaharlal Nehru Technological University

Skills

Java

undefined

Certification

Certified Kubernetes Administrator, Cloud Native Computing Foundation, Cupertino, March 2023 — March 2026

Timeline

Lead Infrastructure Engineer

Wells Fargo
08.2023 - Current

Lead Performance Engineer

Tata Consultancy Services
02.2017 - 08.2023

Lead Performance Engineer

TCS Pioneer, ITPL
04.2014 - 02.2017

Lead Performance Engineer

Oracle India Pvt Ltd
04.2011 - 04.2014

Technology Consultant

Hewlett Packard
04.2010 - 05.2011

Product Developer

BMC Software
10.2008 - 04.2010

Java Performance Engineer

Symphony Services
07.2006 - 10.2008

Master of Technology (M. TECH) - Computer Science Engineering

Jawaharlal Nehru Technological University

Bachelor of Technology (B. TECH) - Computer Science Engineering and Information Technology

Jawaharlal Nehru Technological University
Aditya ChavaliSite Reliability Engineer