12+ years of experience in Linux administration, system/application monitoring and automation.
• Developing Service Level Objectives (SLOs) for prod systems.
• Demonstrated experience in managing incident response teams.
• Detailing RCAs, corrective actions & leading postmortems.
• Reviewing system design & architecture documentation.
• Preparing materials addressing security controls.
• Automating tasks and manual steps via Python
• Tuning monitoring ensuring every alert must be addressed.
• Work on observability of relevant system metrics
• Experience analyzing observability metrics, logs, traces.
• Maintaining GIT repo for our team.
• Building server farm on Linux and Solaris platforms.
• Troubleshooting Network/DNS related issues.
• Writing Nagios alerts for effective monitoring.
• Configuring and Managing splunk.
• Racking up the servers, storage & installing OS.
• Server health check, taking backups and other troubleshooting.
• Deploying packages and patches at client sites.