Summary
Overview
Work History
Education
Skills
Accomplishments
Certification
Timeline
Hi, I’m

MANASAPRIYA BERI

Site Reliability Engineer (SRE)
SFO,CA
MANASAPRIYA BERI

Summary

Dynamic DevOps and Site Reliability Engineer with over 9 years of experience in designing, automating, and managing large-scale cloud-native infrastructures across AWS, Azure, GCP, and on-premises Kubernetes clusters. Expertise in Kubernetes (200+ applications), Kafka, CI/CD, Infrastructure as Code (IaC), and SRE practices has driven significant improvements in system reliability and reductions in Mean Time to Recovery (MTTR). Proven track record of building scalable, highly available systems within hybrid environments while leveraging AI-assisted DevOps automation platforms that integrate tools such as Jira, Jenkins, Kubernetes, PostgreSQL, and LLM-based intent parsing. Committed to enhancing production reliability through innovative AI-driven monitoring and anomaly detection solutions that streamline CI/CD operations.

Overview

9
years of professional experience
10
Certifications

Work History

Williams Sonoma Inc.

Site Reliability Engineer
06.2022 - Current

Job overview

  • Environment: AWS, Azure, On-Premises Data Centers | Kubernetes (EKS, AKS), Helm, Argo CD | Apache Kafka | Jenkins, GitHub Enterprise | Terraform (IaC) | Prometheus, Grafana, ELK, Splunk, OpenSearch, AppD | Linux, TCP/IP, DNS, Load Balancing, TLS/SSL | Python, Linux, Spring Boot | PostgreSQL | Jira Webhooks | OpenAI APIs, LLM Integration, AI-driven DevOps Automation, AIOps Monitoring.
  • Responsibilities & Achievements:
  • Owned reliability, availability, and performance of large-scale production systems across AWS, Azure (AKS), and on-prem Kubernetes clusters, supporting 250+ microservices across production and disaster recovery environments with high uptime (99.5%+) targets.
  • Managed hybrid cloud architecture integrating cloud (AWS/Azure) and on-prem infrastructure, ensuring secure connectivity, workload portability, and high availability.
  • Led Kubernetes platform engineering across EKS, AKS, and on-prem clusters, including cluster provisioning, upgrades, namespace governance, network policies, autoscaling (HPA), and resource optimization.
  • Designed and automated Kubernetes deployment workflows using Helm and Argo CD (GitOps), reducing manual deployment efforts and ensuring configuration consistency across environments.
  • Administered and supported Apache Kafka cluster deployments across cloud and on Prem environments, including broker provisioning, topic configuration (partitioning/replication), ACL management, and retention tuning.
  • Monitored Kafka clusters for broker health, ISR sync, throughput, consumer lag, and replication performance using Prometheus and Grafana; proactively resolved rebalance and latency issues.
  • Automated infrastructure provisioning across AWS, Azure, and on-prem environments using Terraform, enabling scalable and repeatable infrastructure deployments.
  • Managed AWS infrastructure including S3, IAM, Route 53, RDS/Aurora (PostgreSQL), VPC, and networking, ensuring secure, scalable, and highly available architectures.
  • Implemented security best practices including IAM policies, secure secrets management, and compliance with IaC-driven provisioning standard
  • Defined and implemented SRE best practices, including SLIs/SLOs, error budgets, alert tuning, incident response runbooks, and capacity planning.
  • Reduced MTTR by implementing automated alerting, centralized logging, and advanced observability dashboards across Kubernetes, Kafka, and infrastructure layers.
  • Built comprehensive monitoring solutions using Prometheus, Grafana, ELK, OpenSearch, and Splunk to track cluster health, JVM metrics, API latency, resource utilization, and system performance.
  • Provided on-call support and follow-the-sun incident response (P1/P2), ensuring timely resolution aligned with SLA requirements and minimizing service disruption.
  • Automated deployment tasks, scaling operations, log analysis, and health checks using Python and Shell scripting, significantly reducing operational overhead.
  • Diagnosed and resolved issues related to TCP/IP, DNS, TLS/SSL, and load balancing in distributed systems.
  • Led production incident response (L2/L3), conducted root cause analysis (RCA), and implemented preventive reliability enhancements.
  • Executed Disaster Recovery (DR) testing across cloud and on-prem environments, ensuring multi-region resilience and business continuity for stateful and distributed systems.
  • Optimized infrastructure cost and performance through workload right-sizing, Kafka partition tuning, storage optimization, and autoscaling improvements.
  • Partnered with development teams to enhance application resiliency by improving readiness/liveness probes, graceful shutdown handling, retry mechanisms, and deployment strategies (rolling, blue/green).
  • AI-Driven DevOps Automation Platform (Jira → Jenkins → Kubernetes)
  • Designed and implemented a Spring Boot–based Jira Webhook Service that listens for Jira issue events and automatically triggers Jenkins pipelines for application restart and deployment workflows.
  • Built an event-driven DevOps automation framework integrating Jira, Jenkins, PostgreSQL, and GitHub Enterprise to convert Jira tickets into automated CI/CD actions across environments.
  • Implemented application discovery and validation using applicationList.json and a PostgreSQL service catalog table to prevent invalid deployments and enforce service allowlists.
  • Developed scheduled GitHub Enterprise repository synchronization to populate the application catalog nightly, ensuring the automation platform remains aligned with active microservice repositories.
  • Implemented parsing logic to extract application names and supported environments (dev, dit, eqa, prf, prodpilot) from Jira ticket summaries and descriptions to trigger environment specific Jenkins jobs.
  • Integrated OpenAI-compatible LLM services (gpt-4.1-mini) to perform NLP-based normalization of service names and user inputs, improving ticket-to-action accuracy.
  • Implemented guardrails including Jira project key validation, environment allowlists, and service discovery verification to prevent unauthorized or incorrect automation execution.
  • Added centralized logging and audit tracking for Jira events, parsed actions, Jenkins job triggers, and pipeline responses to support debugging and compliance.
  • Built an AI-assisted DevOps automation system that converts Jira tickets into automated Jenkins deployments and service restarts across multiple Kubernetes environments. Implemented AI-assisted monitoring and anomaly detection to improve incident response and reduce MTTR in production systems.
  • Collaborated with cross-functional teams to develop, test, and deploy scalable software solutions.
  • Conducted root-cause analyses after major incidents to identify areas for process improvement or technical enhancement opportunities.
  • Contributed to the ongoing refinement of internal processes and procedures within the site reliability engineering discipline through regular reviews, updates, and knowledge sharing activities.
  • Designed load testing scenarios to validate application scalability under various traffic patterns and conditions.
  • Collaborated with software developers to refine code deployment strategies, enhancing product reliability and speed to market.
  • Trained and guided onsite and offshore team members in quality assurance standards, policies and procedures.

Cisco

Platform Engineer
12.2021 - 06.2022

Job overview

  • Environment: Ansible, Docker, Kubernetes, Helm, Istio, Jenkins, Argo CD (GitOps), Terraform, AWS, Ant, Maven, Gradle, GitHub, Bitbucket, Prometheus, Grafana, OpenSearch, Kibana, Splunk, MongoDB, Linux, Python, Shell Scripting, Java, Node.js, Rally
  • Responsibilities:
  • Designed and implemented containerization strategies for applications and microservices using Docker and Kubernetes, significantly improving deployment efficiency and environment consistency.
  • Automated infrastructure provisioning across AWS and on-premises environments using Terraform, enabling scalable and repeatable deployment pipelines.
  • Implemented GitOps workflows using Argo CD and Git-based repositories to ensure controlled, versioned, and consistent multi-environment deployments.
  • Built and maintained CI/CD pipelines using Jenkins and Argo CD to orchestrate automated build, test, and release processes for large-scale distributed systems.
  • Deployed and managed Kubernetes workloads using Helm charts and integrated Istio service mesh for traffic management, observability, and secure service-to-service communication.
  • Implemented monitoring and alerting solutions using Prometheus and Grafana to track system performance, availability, and reliability, supporting high uptime requirements.
  • Promoted releases across development, staging, and production environments using automated pipelines, ensuring deployment consistency and reduced manual intervention.
  • Developed and enhanced Python and Shell scripts to automate operational tasks, optimize infrastructure utilization, and streamline internal DevOps workflows.
  • Addressed performance bottlenecks by conducting thorough profiling and analysis of system components.
  • Collaborated on cross-functional teams to drive platform improvements and deliver business value.
  • Designed resilient systems that maintained high availability during peak usage periods.
  • Championed best practices in code quality through code reviews, refactoring initiatives, and documentation efforts.

Fusion Mint Technology (India)

DevOps Engineer
05.2016 - 12.2019

Job overview

  • Environment: Docker, Jenkins, Chef, AWS, Splunk, Java, Python, Linux
  • Responsibilities & Achievements:
  • Built and managed AWS infrastructure for test, pre-production, and production environments. Designed CloudFormation templates for VPC, subnets, EC2, RDS, Route 53, and Auto Scaling.
  • Implemented CI/CD pipelines using Jenkins to automate build and deployment processes.
  • Deployed and managed Docker-based applications across multiple environments.
  • Established monitoring and alerting systems using CloudWatch and Splunk.
  • Performed SRE duties including on-call support, incident management, capacity planning, and root cause analysis.
  • Implemented configuration automation using Chef and Python scripting.

Elvya Technologies (India)

Systems Analyst
10.2012 - 06.2015

Job overview

  • Environment: Java, Linux, Windows, Docker, Jenkins, CI/CD Tools, Chef, SonarQube, NETIQ, Git, GitHub, Maven, Splunk.
  • Responsibilities:
  • Defined and implemented CI/CD best practices across the complete Software Development Life Cycle (SDLC), improving build reliability and deployment efficiency.
  • Architected and deployed a scalable, highly available Jenkins Master/Agent infrastructure integrated with SonarQube using Docker to support continuous integration and automated quality checks.
  • Led DevOps transformation initiatives by migrating legacy build and deployment systems to fully automated CI/CD pipelines, enhancing scalability and operational efficiency.
  • Designed and implemented a structured DevOps framework, establishing release governance, deployment standards, and operational controls with a focus on NETIQ-based applications.
  • Provided system administration support including installation, configuration, maintenance, and troubleshooting of CI/CD tools and supporting infrastructure.
  • Monitored and resolved infrastructure and build pipeline issues, performing root cause analysis to prevent recurring incidents.
  • Conducted training sessions and knowledge-sharing workshops to promote DevOps adoption and improve team proficiency with automation tools.
  • Maintained technical documentation, release reports, and compliance records to ensure alignment with organizational policies and industry standards.

Education

Jawaharlal Nehru Technological University
Anantapur, India

Master of Technology (M.Tech) from Computer Science and Engineering
09-2012

University Overview

  • Gate Scholarship Recipient
  • 4 GPA

Jawaharlal Nehru Technological University
Kakinada, India

Bachelor of Technology (B.Tech) from Computer Science and Engineering
04-2010

University Overview

  • Awarded Medal for being topper of the class.
  • 3.5 GPA

Skills

Cloud Platforms: AWS (EKS, EC2, RDS, S3, VPC, IAM), Azure, GCP

Containerization & Orchestration: Docker, Kubernetes, Helm, Istio, OpenShift

Streaming & Messaging: Apache Kafka (topics, partitions, ACLs, consumer groups, lag monitoring)

CI/CD & DevOps Automation: Jenkins, Bitbucket, Azure DevOps, Argo CD (GitOps), GitHub Enterprise, CI/CD Pipelines, Deployment Automation

Build Tools: Maven, Gradle

Infrastructure as Code: Terraform, CloudFormation, Ansible, Chef, Helm Charts

Monitoring & Observability: Prometheus, Grafana, ELK Stack, OpenSearch, Datadog, AppDynamics

SRE: Incident Response, On-call Support, RCA, Runbooks, SLIs/SLOs

Security: DevSecOps, IAM, Secure CI/CD, Compliance Automation

Programming & Scripting: Python, Java, Shell Scripting, SQL

Databases: Oracle, PostgreSQL, MySQL, MongoDB

Systems & Networking: Linux, TCP/IP, DNS, Load Balancing, HTTP/HTTPS, TLS

Operating Systems: Linux, Unix, Windows, macOS

AI & Automation Tools: GitHub Copilot, OpenAI APIs, LLM Integration, NLP-based Service Parsing, AI-driven DevOps Automation, AIOps Monitoring

Incident management

Accomplishments

  • Managed over 200 Kubernetes Microservices applications in production environments.
  • Improved deployment success rate by standardizing CI/CD automation.
  • Reduced production incidents through SRE-driven monitoring and alerting improvements.
  • Led multiple high-severity incident responses, reducing downtime and improving system resilience.
  • Improved MTTR through enhanced monitoring and incident response processes.
  • Implemented Kafka messaging platform for scalable event-driven microservices.
  • Successfully executed disaster recovery drills with zero data loss.
  • Developed an AI-assisted DevOps automation system that converts Jira tickets into automated Jenkins deployments and service restarts across multiple Kubernetes environments.

Certification

Certified Kubernetes Administrator (CKA) – The Linux Foundation

Timeline

Site Reliability Engineer

Williams Sonoma Inc.
06.2022 - Current

Platform Engineer

Cisco
12.2021 - 06.2022

DevOps Engineer

Fusion Mint Technology (India)
05.2016 - 12.2019

Systems Analyst

Elvya Technologies (India)
10.2012 - 06.2015

Jawaharlal Nehru Technological University

Master of Technology (M.Tech) from Computer Science and Engineering

Jawaharlal Nehru Technological University

Bachelor of Technology (B.Tech) from Computer Science and Engineering
MANASAPRIYA BERISite Reliability Engineer (SRE)