Around 3+ years of experience in supporting AWS-based production systems in high-availability microservice environments. Skilled in Kubernetes (EKS), Docker, Linux systems engineering, observability, and Python automation. Experienced in 24/7 production operations, incident management , Disaster Recovery (DR), and performance troubleshooting using Dynatrace and AWS CloudWatch. MBA candidate with strong business acumen focused on operational excellence, reliability, and risk mitigation. Expertise in scripting, database management and application performance tuning
Overview
5
5
years of professional experience
Work History
Production Support Engineer
MANULIFE BANK OF CANADA
07.2024 - Current
Involved in 24/7 production support for AWS-based applications and ensuring SLA/SLO compliance are met.
Monitoring, maintenance, review, and bug fixing of code for python based applications using Python APIs, Shell scripting, SQL, GIT ,Dynatrace and AWS cloud watch.
Monitored system performance, identifying areas for improvement and implementing solutions.
Experience in leading P1/P2 incident triage calls and representing the application to make sure impact is smoothly mitigated.
Executed runbook , coordinated escalations and did work on service restoration within SLA timelines.
Documented processes and procedures to enhance knowledge sharing among team members.
Conducted root cause analysis on recurring incidents, driving long-term resolutions.
Implemented effective incident management strategies that minimized disruption to business operations during system outages or failures.
Follow Agile Scrum Methodology with 3-week sprints and Participated in Sprint Grooming & Project Implementation (PI) planning sessions.
Strong experience in using GIT.
Monitored AWS console application hosting services by optimizing CPU/memory allocation to improve performance
Investigated and proposed corrective actions for quality issues.
Automate operational health checks and monitoring workflows with Python, reducing manual effort by 30%.
Support CI/CD releases, deployment validation, rollback procedures, and release cutovers.
Execute Disaster Recovery (DR) testing, failover validation, and update recovery procedures to ensure operational readiness.
Monitor and manage SSL/TLS certificates proactively to prevent expires and production impact.
Excellent communication and interpersonal skills and have ability to handle multiple tasks; can take initiative to handle responsibilities independently as well as a proactive member of a team
Site Reliability Engineer
INFRAMART REALTECH INDIA PVT LTD
09.2021 - 04.2024
Managed AWS cloud infrastructure: EC2 provisioning, IAM configuration, VPC networking.
Streamlined incident response processes, reducing mean time to recovery through effective root cause analysis.
Led on-call rotations, providing critical support during outages and ensuring rapid restoration of services.
Deployed and maintained Docker containers and Kubernetes workloads across development and production environments.
Implemented Dynatrace APM monitoring for performance analysis and proactive incident detection.
Conducted root-cause analyses after major incidents to identify areas for process improvement or technical enhancement opportunities.
Implemented cost-saving measures by optimizing resource utilization across cloud-based infrastructure environments.
Optimized database performance, analyzing and restructuring data storage solutions.
Managed Linux server administration: patching, disk management, security hardening, access control.
Developed Python automation scripts to streamline operational workflows, reducing manual effort by 30%.
Assisted in CI/CD deployment troubleshooting and Git-based release workflows.
Monitored infrastructure health using AWS CloudWatch metrics, logs, and alerting mechanisms.
Conducted DR planning and testing to validate failover readiness and system resilience.
Education
Master of Business Administration (MBA) -
01-2025
Bachelor of Commerce (B.Com) - undefined
01-2019
Skills
AWS cloud and Infrastructure
Incident management
Disaster recovery
Reliability & Production Engineering
SLA/SLO Monitoring
Root Cause Analysis (RCA)
Release & Deployment Validation
Docker
Kubernetes (EKS)
Monitoring & Observability
Dynatrace (APM, Infrastructure, Logs, Alerting)
Proactive Alerting & Dashboarding
Operating Systems
Linux (Ubuntu, RHEL, Amazon Linux)
Bash/Shell Scripting
Automation & DevOps
Python (Automation Scripts, Health Checks, Log Parsing)
Git
CI/CD Support & Troubleshooting
Accomplishments
Achieved 99.95% production uptime through proactive reliability engineering.
Reduced MTTR by implementing structured RCA documentation and incident response improvements.
Improved deployment consistency by 40% through Docker standardization.
Reduced operational workload by 30% through Python automation initiatives.
Increased incident detection speed by 35% through optimized monitoring and alerting strategies.
Executed DR testing and implemented recovery procedures, improving operational resilience.
ADDITIONAL INFORMATION
Experience supporting financial services production systems.
Strong troubleshooting and analytical skills in high-pressure environments.
Eligible to work in Canada.
Available for on-call and rotational shift support.
Senior Technical Analyst, Application and Operational Support at TransUnion LLPSenior Technical Analyst, Application and Operational Support at TransUnion LLP