Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading content…
Transcript

ProdOps Escalation Flow

Click on a cube below to get a closer look at the troubleshooting process

Troubleshooting Matrix

TICKET

Ticket

* A ticket should be created for EVERY Incident. (ie: ZenDesk)

* Ticket Subject should be descriptive. The Ticket # and Subject will used as the unique Incident Name.

* If there are multiple issues related to the same Incident, tickets should be merged if possible, using main Ticket # and Subject as the Incident Name.

* Ticket #/link needs to be posted in appropriate Teams channels.

(e.g., Sev1, Sev2, Sev3, etc.)

Email Addresses that auto-create ZenDesk tickets

Email Address Group Definition

InfrastructureRequests@skillable.com Infra for requests that can go directly to the Infra team

InternalItRequests@Skillable.com IT Help Desk for internal IT Help Desk requests

Noc@skillable.com NOC for requests that can go directly to the NOC team

Solutions&Growth@skillable.com Solutions & Growth for requests that can go directly to the Solutions & Growth team (formerly Tier3)

Support@skillable.com Platform Support for requests that can go directly to the Platform Support team (typically from clients)

SysOpsRequests@skillable.com SysOps for requests that can go directly to the SysOps team

Platform Support (PS)

Platform Support

* Check customer tickets in ZenDesk queues

* Assess Open and New tickets based on age, client level and SLA to prioritize workflow

* Use skillset and product knowledge to resolve issues

* Use macro responses to handle repeat issues that have known resolutions

* Escalate to SysOps/ContentOps if issues fall outside of that criterion

* If issues need additional information or testing from the customer side, tickets stay in Pending/HOLD state until resolution

NOC

NOC

The NOC is tasked with persistent monitoring of the operational status of each data center. They will also handle receipt of system alerts, which may indicate a problem has been detected, by using the NOC Playbook and related documents and procedures. They will report and/or escalate an incident which is identified as requiring action (immediate or for further investigation) and will assist in documenting if no specified procedure has been identified.

Alarm Handling and Escalation:

• All incident/issue escalations, reporting and communications from PS. This includes but is not limited to all the issues above. PS will report all items to NOC team for further action and escalation. In a perfect world PS will only see customer reports of issues and NOC will be proactive communications to PS with anything automated.

• All internal and external incident communications Via appropriate channels (Teams, status page & status.io). Client ticket communications to be filtered back to PS.

• NOC Team to create initial incident alert and subsequent updates. Communication to take place in Teams. NOC Team will maintain communications with all engaged teams to ensure everyone remains updated on the incident. There may also be a “Mission Control” room chat with all engaged team members that should be attended.

• Severity Levels and SLAs for internal and external communications.

• Incident tracking via Zendesk.

Troubleshooting and Resolution:

• NOC Team, working with other ProdOps members use standard procedures to troubleshoot and resolve incidents. In most cases these issue(s) will be too complex/outside of scope and knowledge to be fixed by the NOC Team. This is when the NOC Team must engage other teams for assistance.

• NOC Team to ensure final incident resolution is in place and internal and external communication are made. Resolution is; issue is completely corrected or a suitable workaround is in place.

• Final incident resolution language for external communication per templates or language agreed upon with team working the incident. NOC Team to close Teams Broadcast.

Documentation and Reporting:

• Documentation is two-fold, 1. In above designated Teams channel and 2. In a simultaneous created Zendesk ticket at the time of Teams post. Zendesk ticket will mirror teams post comments and allow NOC Team to track and report on incidents through Zendesk. Zendesk ticket ID number should always be included with the initial Teams post.

Systems Operations

They take the handoff after initial triage from the NOC. Technical team that supports all production systems, processes and equipment and also is responsible to resolve issues as part of escalations or as part of enhancement projects. They will be called upon to help resolve and help document production environment solutions.

SysOps

Content Operations

Content Ops

Act as Tier 3 for product content related technical issues reported by customers. They take the handoff after initial triage from Platform Support. They perform high level technical analysis of systems, processes and equipment issues and outages across the production environment. They are also responsible for researching and documenting various mitigation strategies.

Solutions

Available exclusively for our VIP customers. As a component of the escalation process, acts as both a Platform and Content design subject matter expert internally for Production Operations and externally to customer content stakeholders. Solves for systemic problems revealed through transactional break/fix tickets while aiding in the triage for high-profile, urgent issues.

Solutions

Infra (Infrastructure)

Infra

Builds and maintains the data centers (including cloud), servers, networking/internet, storage, virtualization (VMs), etc. Infra will get involved once it is determined that the Incident could be infrastructure related.

Their resolution methods could include tweaking/swapping resources and/or re-routing network traffic and offering monitoring/alerting improvements.

BUG?

* Use Bug Hub to Submit a Bug Report => https://holsystems.sharepoint.com/sites/OperationsCenterPortal/

* New process => https://holsystems.sharepoint.com/sites/SkillableIncidentEscalationManagement/SitePages/Bug-Report-Process.aspx

* Bug Reporting Form => https://forms.office.com/Pages/ResponsePage.aspx?id=k7V24NvXjkqVVgS76A33LZTAoYe6UFhJkt8ZrDP81V9UNlBNRVAzOFJOTFQyQUhDTTc1VENERTdWTi4u

* The Bug Form auto creates a ZenDesk (ZD) ticket for the NOC. The NOC will follow the newly formed ZD to ADO integration process [Bug -> NOC ZD Bug Ticket -> ADO Ticket] so these reports can be tracked in both ticketing systems to provide increased visibility for both the NOC and PS, who can then relay pertinent/relevant progress to customers.

* ProdOps will take first 'whack' at the issue (as they do with all other reported issues) to vet this so-called 'bug' before escalating via ticket to Dev/Product.

* All communication on this 'bug' will take place in the ZD ticket, as the sole source of truth.

* Product has provided a SME (subject matter expert) alert path that has been inserted as a layer pre-Dev involvement.

* Future: We are researching if we can create a bug dashboard that will show status of bug reports.

BUG

Status IO

Part of the 'Conclude' process as Incident resolves.

The NOC will ensure that pertinent status updates are posted to: https://status.skillable.com/

Examples:

Title: US West and APAC Disruptions

Created: November 4, 2022 3:45PM

Resolved: November 4, 2022 5:59PM

Title: Lab Launch & Scoring Interruption

Created: September 8, 2022 7:05PM

Resolved: September 8, 2022 7:05PM

Posts to Stakeholders

Part of the 'Conclude' process as Incident resolves.

Incident summary should be completed and posted to the respective SEV channel for review by the Stakeholders of the event (determined by the event) within 24 hours.

Posts to

Stakeholders

RCA

Part of the 'Conclude' process as Incident resolves.

An RCA (Root Cause Analysis) should contain the following to properly describe the issues, incident, troubleshooting process, timeline, outcome and remediation efforts:

ITEM/ISSUE

DESCRIPTION/FORMAT

NOTES

INCIDENT DATE: DD/MM/YY

START TIME: NN:NN (Issues, Incident first reported. Include AM/PM & Time Zone. )

RESOLUTION TIME: NN:NN ( Incident resolved/closed. Include AM/PM & Time Zone. )

INCIDENT NAME:

USE TICKET NAME

TIMELINE: (Grab from ticket body using highlights most pertinent to customer. )

FINDINGS & ROOT CAUSE: (Grab from Incident notes. )

CORRECTIVE ACTIONS: IMMEDIATE (Grab from Incident notes. )

CORRECTIVE ACTIONS: LONG TERM (Grab from Incident notes. )

RCA

Post Mortem

Post

Mortem

Part of the 'Conclude' process as Incident resolves.

The Incident Communicator will schedule a postmortem with all incident staff, and any leadership wishing to attend, to discuss the determined cause and the remediation steps to prevent recurrence.

Learn more about creating dynamic, engaging presentations with Prezi