SITE RELIABILITY ENGINER 01/27/2017 - CURRENT
LUMEN - Operations Support System
Skills: Unix script, Python, Java script, Oracle SQL queries
The objective of this project is to maintain various applications in telecom industry IT system. It involves management of the Operations Support System (OSS) that handle provisioning, service delivery, service assurance, service quality, network monitoring, and network repair and maintenance.
Responsibilities:
- Initiating application troubleshooting bridge with development and Unix team for high priority application outage issues.
- Collaborating with software engineers and Client SME to design and implement deployment approaches using automated continuous integration.
- Collaborating with development team, key stakeholders, and team members to resolve complex problems.
- Setting up application monitoring, optimization, managing system performance and resources on the automated monitoring tool.
- Troubleshooting if there are application slowness issues, business as is workflow issues for the Lumen applications.
- Resolving application user created service-now tickets for any functionality errors, application access issues, customer cable line record update etc.
- Working with operating system team to clear cache and bounce applications during OS patch upgrade on the production servers.
- Writing shell or Python script that monitor file system usage, monitoring applications health and JVM Heap utilization etc.
- Performing application bounce and clear cache when the users report application slowness. Investigating scheduled Cron jobs failures on the Linux server logs and application logs.
- Analyzing application logs for a user reported technical errors and suggest to development team to fix the code issues.
- Half-yearly application fail-over from primary server to secondary server.