Summary
Overview
Work History
Skills
Certification
Timeline
Generic

Josh Hess

Centerburg,OH

Summary

An accomplished, solution-oriented Engineering Leader with broad-based IT knowledge, 19 years of relative experience and a drive to achieve organizational objectives. Adept at promptly and thoroughly mastering new technologies and adapting to existing architectures producing immediate results.

Exceptional team leadership, mentoring, communication, and interpersonal skills that inspire teams while instilling a culture of innovation. Proficiency in implementing DevOps, SRE and ITIL principles to streamline processes and achieve organizational goals.

Overview

12
12
years of professional experience
1
1
Certification

Work History

Incident & Problem Management Specialist

Bread Financial
11.2024 - Current

Directed cross-functional teams through high-pressure, high-visibility incidents with an outcome-focused approach while providing executive-level real-time communications. Owned the end-to-end Problem Management lifecycle, emphasizing proactive trend analysis, preventative remediation, and data-driven continuous improvement. Partnered with Engineering, Infrastructure, and Operations leaders to identify systemic gaps in monitoring, testing, and change management to abate customer friction and operational risk.

Key Milestones

  • Drove a 50%+ reduction in MTTR and RCA completion timelines through improved facilitation, accountability, and data transparency.
  • Principal Lead in the successful transition of Incident and Problem Management capabilities from an external vendor to an in-house 24x7 team, earning executive-level recognition for execution and outcomes.
  • Designed and implemented standardized documentation, onboarding frameworks, and cross-team training materials to improve operational maturity and scalability.

Site Reliability Engineering Manager

Olo
12.2020 - 06.2023

Instituted a Site Reliability Engineering team of 6 Engineers with a defined Mission/Vision and multi-year roadmap with supporting OKRs. Played a pivotal role in fostering a SRE culture throughout Olo to improve platform reliability, stability and enhance the overall customer experience.

Developed and implemented an industry standard Event / Incident Management framework to ensure business SLAs were met and or exceeded. Established a Problem Management Framework with defined SLAs and Error Budget to identify trends, continuously improve system availability and reduce MTTR. Oversaw all aspects of the Incident Response and Incident Commander roles, responsibilities and best practices for 25 engineers supporting a 24/7/365 on-call rotation.

Key Milestones

  • Implemented a Game Day Framework to continuously test our Disaster Recovery posture to improve system availability by automating recovery.
  • Cataloged and documented all essential services in Datadog to streamline Event Management process and standardize Incident response tooling.
  • Spearheaded an Operational Readiness Maturity Assessment across all engineering teams identifying areas of opportunity. Resulting in the ability to identify and reduce tech debt, elevate system reliability, improve testing quality and standardize the production support model.

Digital Ops Service Delivery Manager

Nike
10.2018 - 12.2020

Responsible for the oversight and execution of a portfolio of platforms and services that drive Global Technology for Nike.com. Provided operational oversight to ensure alignment with organizational governance, goals and strategy for the Commerce Support team. Primary point person collaborating with global technology operations, project management and engineering teams to ensure effective and efficient service delivery across all mobile and web platforms.

Key Milestones

  • Streamlined support processes to increase collaboration during high volume sneaker launches, high impacting brand moments and peak holiday season. This optimization resulted in system availability of 99.99% and improved the overall customer experience.
  • Implemented Incident and Problem Management best practices to improve availability metrics for Nike.com and mobile applications.
  • Crafted weekly platform availability reports for Engineering, Product and Executive leadership teams to continuously evaluate service health and error budgets.

Site Reliability Engineering Manager

Nike
06.2016 - 10.2018

Led a team of 8 engineers with the primary focus of improving Observability, developing automation and partnering with Product leadership to balance prioritization of improving system Availability and developing new features. Created and implemented a SRE support model aligned with industry standards and best practices to achieve organizational availability goals. Oversaw the creation of proactive and correlative monitoring to provide service health metrics and reduce mitigation time for high impacting incidents.

Key Milestones

  • Reduced public cloud spend by $1.5 million per year while delivering 100% uptime for all authentication services for 3 consecutive years.
  • Collaborated with Product, Engineering and Security teams to develop policies and software to decrease bot activity during SNKRS launches. This initiative elevated customer satisfaction by an average of 40% month over month increasing customers’ ability to purchase.
  • Received 1 of 3 Nike Privacy awards in 2018 for contributions in building software, processes and a team to support GDPR compliance.

VP, Major Incident Team Lead

J.P. Morgan Chase
11.2014 - 06.2016

Mentored Incident Managers to promote a risk-aware culture, adhere to standard operating procedure and develop efficient and effective compliance management practices. Facilitated the remediation effort of high priority incidents by coordinating Senior Incident Managers and Engineering leaders via a technical bridge. Responsible for day-to-day Incident Management from both an application and infrastructure perspective.

Key Milestones

  • Improved Incident Management process in collaboration with global managers to reduce MTTR by an average of 40%.
  • Reduced onboarding time by 30% for new Incident Managers by updating incident management best practices and protocols.

Production Operations Manager

Hewlitt Packard
11.2013 - 11.2014

Provided leadership oversight for day-to-day Service Desk operations, serving as the primary escalation point to ensure rapid issue resolution and consistent achievement of productivity and service targets. Acted as the operations support liaison for large-scale migration initiatives, coordinating maintenance activities and proactively mitigating risks to system availability. Partnered cross-functionally to align production support capabilities with customer requirements and enterprise objectives. Led continuous improvement initiatives focused on enhancing service quality, operational efficiency, and delivery throughput.

Key Milestones

  • Played a key role in defining and executing a comprehensive Business Continuity Plan in advance of a large-scale operating system migration from Solaris to HP-UX Linux, ensuring service stability and risk mitigation.
  • Developed and implemented internal operational support standards and productivity benchmarks to consistently meet and exceed contractual SLAs, earning formal recognition from the HP Account Executive and the CIO of Ohio Medicaid.
  • Defined role responsibilities, governance, and best practices for Incident and Problem Management, standardizing the production support operating model across teams.
  • Drove a 70% reduction in average incident resolution time through the introduction of structured processes, clear ownership, and improved operational workflows.

Skills

  • Site Reliability Engineering
  • DevOps Engineering
  • ITIL Governance & Operational Maturity
  • Strategy & Execution
  • Collaboration & Cross-Functional Leadership
  • Relationship Building & Stakeholder Management
  • Problem Solving & Critical Thinking
  • Innovation & Continuous Improvement
  • Service Reliability & Risk Reduction
  • Software Development Lifecycle
  • Process Redesign & Change Management
  • Disaster Recovery & Backup Solutions

Certification

  • eCornell Technology Leadership
  • DevOps Leadership
  • DevOps Essentials
  • ITIL Strategist: Direct, Plan and Improve
  • ITIL Specialist: Create, Deliver and Support
  • ITIL Specialist: Drive Stakeholder Value
  • ITIL Specialist: High Velocity IT
  • ITIL Foundation: IT Service Management
  • Agile Scrum Essentials

Timeline

Incident & Problem Management Specialist

Bread Financial
11.2024 - Current

Site Reliability Engineering Manager

Olo
12.2020 - 06.2023

Digital Ops Service Delivery Manager

Nike
10.2018 - 12.2020

Site Reliability Engineering Manager

Nike
06.2016 - 10.2018

VP, Major Incident Team Lead

J.P. Morgan Chase
11.2014 - 06.2016

Production Operations Manager

Hewlitt Packard
11.2013 - 11.2014
Josh Hess