Overall Summary
- Spearheaded the automation of AppDynamics onboarding, leading to a 6× increase in APM adoption — scaling from approximately 100 to over 600 applications onboarded, including sub-components and supporting services.
- Observed a 30% improvement in Mean Time to Detect (MTTD) by addressing key observability gaps across the application, infrastructure, and middleware layers.
- Contributed to a significant reduction in Mean Time to Resolve (MTTR) for critical incidents through enhanced monitoring visibility, root cause triage, and proactive escalation support.
- While incident volume remained consistent, improved observability enabled earlier detection of performance degradations, preventing escalation to P1/P2 incidents, and improving service resilience.
APM Strategy and Cross-Application Enablement.
- Designed tailored APM strategies based on individual application architectures; partnered with development teams to guide monitoring adoption, identify feasibility, and resolve implementation gaps.
- Developed standardized monitoring blueprints, including agent selection (Java, infrastructure, DB), business transaction detection, custom alerting thresholds, and anomaly detection setup.
- Provided structured feedback to the AppDynamics product team, leading to enhancements in Java Agent observability for enterprise-grade deployments.
Automation of AppDynamics onboarding
- Architected and implemented an automated AppDynamics onboarding platform using the Spring Framework and MongoDB, reducing onboarding time from 20 minutes to under 3 minutes, significantly boosting adoption across Cisco IT.
- The automation framework was adopted as a reference API design by the AppDynamics product team for wider use cases.
Integration with event correlation and incident management.
- Developed an alert-forwarding system to push AppDynamics events to a Kafka-based event correlation engine, enabling seamless integration with ServiceNow for automated incident creation and triaging.
APM Agent Enhancements and Product-Level Improvements
- Played a key role in enhancing APM agent capabilities by identifying platform-specific gaps and providing structured feedback to AppDynamics product teams, resulting in improved agent observability and performance.
- Led monitoring enablement across a wide range of agents, including: Cluster Agent (OpenShift monitoring), Network Agent (network telemetry to complement APM), Web Server Agent, Database Agents (Oracle, MongoDB, PostgreSQL, Cassandra), Machine Agent with custom extensions (Kafka, RabbitMQ), and Language-specific Agents (Java, Python, Node.js).
- Conducted deep feasibility studies and PoCs to validate agent functionality, enrich metric collection, and standardize onboarding across diverse environments.
- Influenced multiple agent roadmap improvements through recurring engagements with AppDynamics engineering teams.
AI Collaboration and Automation Bots
- Partnered with the TCS internal AI team to develop a proof-of-value 'Smart Sensor' for predictive analytics, leveraging historical APM and performance telemetry.
- Designed and deployed a command center bot capable of triggering remediation actions, such as auto-scaling, server restarts, and traffic rerouting, based on monitoring insights.
Cloud-Native Transformation & CI/CD Integration.
- Led the transformation of internal platform applications from traditional monolithic architectures to cloud-native, containerized solutions deployed on the OpenShift Container Platform (OCP).
- Partnered with enterprise DevOps teams to integrate applications into standardized CI/CD pipelines, ensuring smooth, automated deployment workflows, aligned with organizational best practices.
- Enabled platform services to adopt modern development patterns, improving scalability, fault tolerance, and deployment consistency.
- Ensured cloud-native apps adhered to platform observability, security, and compliance standards as part of the modernization effort.
Incident Management and Root Cause Analysis
- Actively contributed to P1/P2 incident triaging and on-call rotations; leveraged AppDynamics and performance engineering techniques to conduct RCA, and provide both short-term remediations and long-term recommendations to development teams.
Documentation, enablement, and standards.
- Authored comprehensive documentation and playbooks covering APM agent onboarding, alert configuration, performance thresholds, and anomaly detection best practices.
- Established reusable templates and reporting formats to drive consistency in problem detection, observability coverage, and monitoring metrics reporting across teams.
Operational Support for Kubernetes and OpenShift
- For a focused six-month period, I supported the daily health, stability, and availability of critical production Kubernetes clusters, ensuring the seamless operation of containerized applications.
- Led incident management efforts for a wide range of issues, including application deployment failures, YAML misconfigurations, GSLB-related outages, access control, network policies, container crashes, and autoscaling challenges.
- Supported the rollout of OpenShift as a Service, enabling development teams to provision target environments through automated, self-service workflows, significantly reducing onboarding and provisioning effort.