Seasoned Lead DevOps Engineer with 10+ years of experience architecting and automating scalable, cloud-native infrastructure and CI/CD pipelines across AWS and GCP. Expert in Kubernetes (EKS, GKE), GitOps, infrastructure as code (Terraform, CloudFormation), and observability tools like Prometheus and Grafana. Proven leader of cross-functional teams delivering cost-effective cloud-native solutions. Skilled in MLOps and data pipeline automation, enabling faster, reliable ML model deployments and improved operational reliability
· Provisioned and managed over 100 Linux-based servers across development, staging, and production environments, ensuring consistent configuration and operational readiness.
· Monitored and resolved critical system outages and performance issues , reducing average incident response time by 45% and maintaining 99.9% system uptime .
· Performed scheduled patching and upgrades using automated scripts, improving system security posture and maintaining compliance with IT governance standards.
· Led user and access management using Linux account management , LDAP , and role-based access control (RBAC) , enhancing system security and operational control.
· Developed automated scripts in Bash and Python to streamline repetitive server maintenance tasks, reducing manual workload by 60% and eliminating configuration drift.
· Collaborated with development and operations teams to support application deployments, troubleshoot environment issues, and optimize system resource usage.
· Maintained configuration consistency across environments and enforced change control processes using Git and Jenkins .
Managed and maintained application server infrastructure on Linux-based platforms , ensuring high availability and optimal performance.
Diagnosed and resolved server-related issues , conducted root cause analysis, and implemented performance tuning strategies to enhance system stability.
Performed regular patching, security updates, and version upgrades on production and non-production servers, ensuring compliance and reducing vulnerabilities.
Coordinated cross-team incident response and production support , ensuring 99%+ uptime , rapid issue resolution, and minimal disruption to business operations.