Dynamic Lead Site Reliability Engineer at Atlassian Inc., recognized for driving multi-million dollar cost savings and enhancing system reliability. Expert in AWS cloud services and Kubernetes orchestration, with a proven ability to mentor teams and optimize workflows. Achieved a 60% reduction in incident analysis time through innovative AI solutions.
Overview
17
17
years of professional experience
Work History
Lead Platform Engineer
Atlassian Inc.
Austin, United States
11.2020 - Current
Platform Unification & AI Enablement: Spearheading the design and implementation of a unified Web Asset & Content (WAC) platform to consolidate 8+ duplicative services, targeting an 80% reduction in service sprawl, and 30% lower operational costs.
Developer Empowerment and Standardization: Establishing platform-wide standards for localization, CMS, and edge caching to combat fragmentation, accelerate experiment velocity, and provide product teams with a seamless, self-service foundation for AI-ready web experiences.
AI-Powered Data Analysis - Built a model context protocol-based AI service, reducing DevOps/incident analysis time by 60%.
Future-proofing and governance: Architecting the platform with embedded AI/global readiness (LLMs, Rovo) and implementing platform governance to prevent future sprawl, directly supporting Atlassian's next-generation growth initiatives.
Multi-Million Dollar Cost Savings - Led the migration of AWS resources (EC2, DDB, Elasticsearch) to low-cost regions, saving over $1.2 million annually with zero customer impact.
Global Performance Optimization - Architected edge caching for Atlassian.com, cutting infra costs by 30%, and improving TTFB by 85%.
Security & Scalability - Developed a custom rate-limiting tool (Nginx, Redis, Lua), reducing DDoS attacks by 60%.
Incident Leadership - Resolved 50+ major cloud incidents, boosting Atlassian.com availability by 40% via Linux tuning, SLO management, and Datadog/Splunk monitoring.
Developer Productivity - Built an EC2 provisioning tool (Rust/Terraform) that slashed development environment setup time by 40%.
Senior Site Reliability Engineer
Samsung Ads
Mountain View, United States
09.2018 - 11.2020
Designed real-time bidding (RTB) resilience for ad pipelines, reducing failover time from 5min to < 30s.
24/7 Incident Management - Reduced alert fatigue by 60% via enhanced monitoring (Graphite, Grafana, Prometheus) and proactive tuning of PagerDuty workflows.
Kubernetes Migration - Led shift to K8s (KOPS, Helm 2/3) with observability via Prometheus/AlertManager, improving scalability and resilience.
Cross-Cloud Monitoring - Unified metrics across AWS (CloudWatch), Check_MK/Nagios, and custom tools (Python/Shell) for full-stack visibility.
NoSQL Expertise - Managed Cassandra & Vertica clusters for high-throughput ad analytics, optimizing performance and uptime.
CI/CD Optimization - Cut release times by 20% using Jenkins, AWS (EBS, EC2, RDS), and custom PowerShell/Bash scripts, boosting dev team productivity by 70%.
Infrastructure as Code (IaC) - Managed AWS resources via Terraform, ensuring scalable and repeatable deployments.
Security & Code Quality - Ran ZAP Proxy security scans and maintained SonarQube on Azure, improving code quality by 40%.
Multi-Platform CI Setup - Automated builds for ASP.NET, C#, Node.js, Ruby on Rails, and JavaScript across IIS, NGINX, Apache (Windows/Linux).
Linux/Windows Administration - Managed production servers, ensuring uptime and performance.