Summary
Overview
Work History
Education
Skills
Timeline
Generic

Sriram Maheswaran

San Ramon,CA

Summary

Principal Engineer with 18+ years in software engineering and 8+ years at Apple, operating Kubernetes infrastructure at fleet scale (2,000+ clusters), architecting AI/ML agent platforms, and rolling out end-to-end observability and distributed tracing. Hands-on technical leader with deep expertise in deployment automation, cluster lifecycle, and OpenTelemetry/Jaeger tracing—mentors engineering teams and drives architecture across organizations.

Overview

21
21
years of professional experience

Work History

Principal Engineer

Apple Inc
12.2017 - Current
  • Operate Kubernetes infrastructure at fleet scale—2,000+ clusters across multiple regions supporting engineering, AI/ML, and external-facing user workloads. Own cluster lifecycle automation (provisioning, upgrades, decommissioning), capacity planning, and multi-region HA.
  • Led the original design, build-out, and migration of legacy workloads to Kubernetes-based platforms—modernizing the deployment model and improving reliability and scalability fleet-wide.
  • Hardened cluster operations across the 2k+ cluster footprint—HA, fleet-wide rolling upgrades, capacity planning, GitOps-based config management, and incident response—for production-critical services.
  • Manage external-facing Apple web properties on Kubernetes at global scale—owning scalability, high availability, and reliability across high-traffic user-facing workloads.
  • Run the Nginx-based ingress and traffic-handling layer fronting these services for routing, TLS termination, and resilience.
  • Operate Spinnaker-based CI/CD pipelines and Kubernetes deployments enabling rapid, safe rollouts—automated promotion across environments, canary and blue-green strategies, and automated rollback; led CI/CD automation across multiple internal services that cut manual release effort and accelerated delivery cycles.
  • Standardized container-based build and test agents, enabling reproducible builds and consistent environments from dev through production.
  • Led the monitoring and observability function—defined alerting standards, designed dashboards, and rolled out telemetry across services using Prometheus, Alertmanager, and Grafana.
  • Designed and rolled out end-to-end distributed tracing with OpenTelemetry (OTEL) instrumentation and Jaeger—giving engineering teams cross-service request traces, latency breakdowns by hop, and dramatically faster root-cause identification across multi-service workflows.
  • Established SLO/SLI-based reliability practices, reducing alert noise and improving mean-time-to-detection and recovery across the fleet.
  • Use Catchpoint synthetic monitoring to validate availability, performance, and end-user experience from diverse global vantage points—proactive detection of regressions before customers are impacted.
  • Architect and own Apple's company-wide AI/ML agent platform—the foundation for agent-based applications across the organization.
  • Provide technical architecture mentorship to a team of 5 engineers: review system designs, guide architectural decisions, and partner on coding and design improvements.
  • Designed the reusable agent framework on Apple's AI/ML platform—intent-based routing dispatches requests to specialist agents via declared capability metadata, automated skill/capability discovery makes new agents routable without manual wiring, and a declarative config-driven onboarding model lets teams add agents by changing config only (no platform code changes).
  • Integrated mem0 for persistent cross-session agent memory; adopted LangGraph for graph-based multi-step orchestration (stateful execution, conditional branching, retries); standardized on the A2A (Agent-to-Agent) protocol for inter-agent communication, enabling multi-agent workflows that span team and organizational boundaries.
  • Built Claude Agent SDK-powered agents for SRE automation—on-call alert analysis correlates signals across systems and ranks root-cause hypotheses (cutting time-to-triage), plus knowledge-graph dependency reasoning over services and infrastructure for impact analysis, change-risk assessment, and faster incident response.
  • Architected and built Apple's internal RAG platform from scratch on Milvus—chunking strategy, pluggable ingestion/retrieval architecture (custom embedders, retrievers, post-processors), and ANN/sharding tuning for production-scale vector search. Own reliability and onboarding; adopted across multiple internal organizations.
  • Manage NVIDIA Triton Inference Server deployments serving the Apple Maps organization—production model-serving infrastructure for Maps' ML/AI workloads.
  • Led cross-functional teams to drive product development and enhance system architecture.

Staff Software Engineer

Juniper Networks
02.2013 - 12.2017
  • Designed and maintained deployment infrastructure (SaltStack/Pillar/Jinja) across the Cloud Service-Platform product; managed Docker, Kubernetes, Keystone, RabbitMQ (cluster + federation), Cassandra, and Swift.
  • Led an AWS provisioning project using Terraform—owned design and team mentoring; delivered in a month and received the Business AMP company-level award.
  • Implemented HA + scalability (HAProxy/Keepalived, K8s master, SaltMaster, ETCD); cut deployment time from 2 hours to 40 minutes via parallel logic that avoided HA race conditions.
  • Introduced Reprepro (local Debian repo) to eliminate internet dependency for library downloads—unblocked a major customer win.
  • Earlier: HA cluster management and MySQL clustering/DR on Junos Space.

Senior Software Engineer

Bluecoat Systems
01.2012 - 01.2013
  • Cloud Services platform—drove Spring + RabbitMQ POCs and adoption, REST web services, DAO/SQL design, Flex3/ExtJS4 GUI.

Senior Software Engineer

Alert Enterprise
01.2011 - 01.2012
  • Alert Controls built from scratch on Spring + Hibernate—DAO layer, scheduled-jobs subsystem, Spring-CXF web services.

Senior Engineer, Product Development

Symphony Services
01.2010 - 01.2011
  • Jobvite Source recruiting / ATS.

Software Engineer

Cisco Systems
01.2008 - 01.2009
  • Built PSMART (PSIRT vulnerability tracking) and the Feature Tracking System for IOS / IOS XE / IOS XR.

Systems Engineer

Tata Consultancy Services
01.2005 - 01.2008
  • SWIFT EAI engine for Clearstream, Fidelity, RJ, FNB.

Education

B.Tech. - Information Technology

SSN College of Engineering, Anna University
Chennai, India
04.2005

Skills

  • AI / ML & AGENTS
  • LangGraph
  • A2A protocol
  • Claude Agent SDK
  • Mem0
  • Agent routing
  • Automated skill discovery
  • Knowledge-graph reasoning
  • Config-driven onboarding
  • RAG & RETRIEVAL
  • Milvus (vector DB)
  • Embedding-based retrieval
  • Chunking strategies
  • Pluggable ingestion/retrieval pipelines
  • NVIDIA Triton Inference Server
  • INFRASTRUCTURE
  • Kubernetes
  • Docker
  • Nginx
  • Spinnaker
  • SaltStack
  • Terraform
  • Helm
  • ETCD
  • CI/CD
  • CLOUD & KNOWLEDGE GRAPH
  • AWS (EC2, S3, ELB, MSK, ElastiCache)
  • Neo4j
  • TigerGraph
  • OBSERVABILITY & TRACING
  • Prometheus
  • Alertmanager
  • Grafana
  • ELK
  • OpenTelemetry (OTEL)
  • Jaeger
  • Distributed tracing
  • Catchpoint (synthetic monitoring)
  • SLO/SLI practices
  • DATA & MESSAGING
  • PostgreSQL
  • MySQL
  • Cassandra
  • Redis
  • Oracle
  • RabbitMQ (cluster federation)
  • JMS/ActiveMQ
  • FRAMEWORKS
  • Spring
  • Hibernate
  • REST/CXF
  • Jinja

Timeline

Principal Engineer

Apple Inc
12.2017 - Current

Staff Software Engineer

Juniper Networks
02.2013 - 12.2017

Senior Software Engineer

Bluecoat Systems
01.2012 - 01.2013

Senior Software Engineer

Alert Enterprise
01.2011 - 01.2012

Senior Engineer, Product Development

Symphony Services
01.2010 - 01.2011

Software Engineer

Cisco Systems
01.2008 - 01.2009

Systems Engineer

Tata Consultancy Services
01.2005 - 01.2008

B.Tech. - Information Technology

SSN College of Engineering, Anna University
Sriram Maheswaran