Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

ITIL Process Workshop

No description
by

Tara Brant

on 17 January 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of ITIL Process Workshop

ITIL
ITIL is a set of best practices for IT service management (ITSM) that focuses on aligning IT services with the needs of business.
Service
Transition

Service
Strategy

Service
Operation

Service
Design

Continual Service Improvement
The Service Lifecycle
Information Technology Infrastructure Library
ITS Staff
Overview
Purpose and Objectives
Value to the Business
Service Strategy provides guidance on how to design, develop, and implement Service Management not only as an organizational capability, but also as a strategic asset.
The purpose of Service Strategy is to design and execute plans to meet business requirements.
Execute the 4 Ps - Perspective, Position, Plans and Patterns.
Objectives of Service Strategy:
Provide knowledge about the concept of strategy
Describe Services and the Customers of these Services
Explain value creation and delivery
Identify opportunities to provide Services and exploit the Services
The benefits organizations can achieve through Service Strategy:
Link the performance of the organizational activities to their business goals
Know the Service types and levels that will make the organization's customers succeed and organize its Service Assets to deliver and support those Services
Respond quickly and effectively to Changes in the organization and ensure improved comprtitive advantage in the long run
Overview
Service Design provides guidance for the design and development of Services and Service Management processes.
Purpose and Objectives
The purpose of Service Design is:
Design IT Services and the governing IT practices, processes and polices
Realize the strategy of the Service Provider.
To help introduce the Services into the live environments
The objective of Service Design is to design effective IT Services so that only nominal improvements are required during their Lifecycle.
Value to the Business
With good Service Design it is possible to deliver quality, cost-effective Services to ensure that business requirements are met. The resulting benefits are:
Reduced Total Cost of Ownership (TCO)
Improved Service quality
Easier implementation of new or changed Services
Improved alignment and performance of Service
More effective Service Management and IT processes
Improved information and decision making
Improved alignment with the values and strategies of the customer
Overview
Service Transition provides guidance on the development and improvement of capabilities for transitioning new and changed Services into operations.
Purpose and Objectives
The purpose of Service Transition is to make sure that new, modified, or retired Services meet business expectations, as defined in Service Strategy and Service Design stages of the Lifecycle.
The objectives of Service Transition are:
Plan and manage the Changes in the Service effectively and efficiently
Manage Risks that are related to new, changed, or retired Services
Deploy the Release of the Service into the live environment successfully.
Value to the Business
The benefits Service Transition include:
Allowing the projects to assess the cost, timing, resources needed, and Risks associated with the Service Transition stage correctly
Giving increased volumes of Changes
Making adoption of changes and their impact to adopt and follow
Making sure that the assets of Service Transition are shared and reused across projects and Services
Overview
Service Operation provides guidance on achieving efficiency and effectiveness in the delivery and support of Services to ensure value for the customer and the Service Provider.
Purpose and Objectives
The purpose of Service Operation includes:
Coordinating and performing the activities and processes that are needed to deliver and manage Services at agreed levels for business users and customers
Being responsible for the continuing management of the technology that is used to deliver and support Services
Conducting, controlling, or managing Services appropriately so that well-planned and well-implemented processes are working correctly
Conducting daily Service improvement activities, such as monitoring performance, assessing metrics, and gathering data
The objective of Service Operation includes:
Maintain business satisfaction and confidence in IT through effective and efficient delivery and support of agreed IT services
Minimize the impact of Service outages on day-to-day business activities
Ensure that access to agreed IT services is provided only to authorized users
Value to the Business
The benefits of Service Operation include:
Reduce unplanned labor and costs for both the business and IT though optimized handling of Service outages and identification of their root causes
Reduce the duration and frequency of Service outages which will allow the business to take full advantage of the value created by the services they are receiving
Provide operational results and data that can be used by other ITIL processes to improve services continually and provide justification for investing in on-going Service improvement activities and supporting technologies
Meet the goals and objectives of the organization's security policy by ensuring that IT services will be accesses only by authorized users
Provide quick and effective access to standard services which business staff can use to improve their productivity or the quality of business services ans products
Value to the Business
The value of CSI could be described as:
Increased organizational competency
Integration between and processes
Reduction of redundancy, resulting in increased business throughput
Minimize lost opportunities
Assured regulatory compliance that will minimize costs and reduce Risks
The ability to react to change rapidly
Overview
You cannot manage what you cannot control.
You cannot control what you cannot measure.
You cannot measure what you cannot define.
Purpose and Objectives
The purpose of Continual Service Improvement (CSI) is to continually align and realign IT Services with the changing requirements of the business by identifying and implementing improvements to the IT Services that support the business needs.
The objectives of CSI are:
Review, analyze, prioritize, and make suggestions on improvement opportunities in each stage of the Service Lifecycle
Review and analyze the level of achievement of a Service
Identify and implement activities to improve the quality of IT Service and increase the efficiency and effectiveness of the enabling processes.
Welcome!
Process
Workshop

Incident Management
Processes
Incident Management
Problem Management
Request Fulfillment
Event Management
Access Management
Processes
Service Asset & Configuration
Management
Change Management
Release & Deployment
Management
Knowledge Management
Processes
Service Level Management
Service Catalog Management
Availability Management
Capacity Management
Supplier Management
Change Management
Problem Management
New Process Implementations
Process Improvement Initiatives
Asset & Configuration Management
Event Management
Financial Management
Incident Management
Inputs:
Incident Management
Outputs:
RFC/Change Request
Event Management Monitoring Alert
Metrics
The first trouble with measuring ITIL is that you need a certain level of maturity to gather baseline metrics and a good deal of organizations are not in a position to capture those metrics. The second trouble is that most of the metrics they capture are worthless – as good as marks on a chart.


Where the true value is from metrics that:


Inform

Support Decisions

Prompt Action

That metric simply informs you that something is happening but with no insight into how well it is being done.
How do the ITIL processes fit together?
ITSM Roadmap 2014-2015
Documentation
https://sp2010.ucop.edu/sites/its/svcmgmt/Shared%20Documents/Forms/By%20Process.aspx?RootFolder=%2Fsites%2Fits%2Fsvcmgmt%2FShared%20Documents%2FIncident%20Management&FolderCTID=0x012000583C156CC0100047AE0F5B15A9D09E50&View={2BA3C3D2-49F4-45D3-B741-28A66353C197}
ITSM Program Project Plan
Process Documentation:
Project Plan:
Insert link to IM project plan
Responsibilities
Customer, Service Desk, Event Management System, 2nd Level/3rd Level Support
Responsibilities
:
Detects the Incident
Contacts the Service Desk through e-mail, ServiceNow.com or phone call
After notified of a resolution, verifies that the Incident has been resolved
If the Incident is not resolved, notifies Service Desk
Escalates the issue to the Manager of the affected service (when necessary)
1st Level Support Teams: Service Desk/Data Center Operations
Responsibilities
:
Records, categorizes, routes
Provides initial diagnosis and resolves (If possible)
Changes the Incident state to “Resolved”
If unable to resolve assigns the Incident ticket to the appropriate support team
Coordinates and tracks the Incidents through resolution
Escalates the Incident when necessary
Communicates with the customer as necessary to update and obtain additional information or to verify Incident resolution
Notifies the on-call support team and the manager of the support team of the affected area (Incident Coordinator) when a major outage has occurred
2nd Level Support Teams: Application & Infrastructure Teams, Desktop Support, etc.
Responsibilities
:
Investigates, diagnoses and implements a resolution or workaround within the target resolution times
Escalates the Incident when necessary
Obtains additional information from customers (as needed)
Documents the Incident resolution within one hour after the incident has been resolved
Changes the Incident state to “Resolved”
Provides status updates to Incident coordinator for Major Incidents
Communicates with the customer as necessary to update and obtain additional information or to verify Incident resolution
Updating work notes in the Incident Ticket
Completes the Outage Report within 48 hours from the start of a Major Incident/P1 resulting in an outage
3rd Level Support Teams: External Vendors
Responsibilities
:
Provides technical support and expertise
Investigates, diagnoses and implements a resolution or workaround within the agreed target resolution times
Obtains additional information from 2nd level support/customers (as needed)
Documents the Incident resolution within one hour after the Incident has been resolved
Updates the work notes in the Incident ticket. If no access to our ticketing system will notify the 2nd level support team who will then update the Incident work notes.
Incident Manager
Responsibilities
:
Data Center Operations Manager/Supervisor
Responsible for driving the resolution of Major Incident/P1 Incidents
Responsible for the Incident Management process (proactive and reactive Incident Management).
Informing the 1st Level Support Teams of the status of the outage and which communication notifications are required and the details of those communications as defined in the
Incident Notification Process document
.
Responsible for reporting on metrics and trends that demonstrate the effectiveness of the Incident Management process and Team
Incident Coordinator/Assignment Group Manager
Responsibilities
:
Reviews all tickets assigned to the group on a regular basis
Assigns the tickets to the appropriate individual (may be delegated, but assumes responsibility)
Ensures that the resolutions are well documented in customer friendly terms and that the Incidents are resolved within the agreed resolution time
During a Major incident, responsible for providing the status of the outage to the Incident Manager
Assignee
Responsibilities
:
Resolves the Incident
Updates the Incident ticket
Changes Incident state to “Resolved”
Reassigns ticket as necessary
Objective
: Quick resolution of Incidents

Metric
:
% of Incidents resolved within target timeframe
By priority
By Assignment group
By Assigned to
By date
Objective
: Improved customer satisfaction

Metric
:
Number of customer satisfaction surveys sent to customers
Number of responses
Average customer satisfaction survey score
Trend of customer satisfaction score by quarter
Metric
:
Availability and performance metrics
Age of Active Incidents
Average duration time for Incident resolution
Objective
: Increased incident resolution efficiency


Metric
:
% of Incidents resolved by Service Desk - 1st, 2nd, 3rd Level
% of Incidents escalated by Service Desk – All Levels
Mean time to restore service from point of first call
Mean time to restore Critical Incidents
Mean time to restore High Incidents
Mean time to restore Routine Incidents
Number of Incidents caused by changes
Objective
: Meet agreed system availability and performance
Request Fulfillment
What is the definition of an Incident?
An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet affected service is also an Incident.
Purpose & Objective
The goal of Incident Management is to restore the service as quickly as possible to the minimize disruption of the service outage. A workaround qualifies as a restoration of the service.

The objective of the Incident Management process is to create a standardized method for coordinating and managing the Incident life cycle.
Scope
The scope of the Incident Management Process includes any event that disrupts normal Data Center Operations of the ITS services environment, including but not limited to hardware, software, voice, applications, databases, servers, virtual machines, mobile services, batch jobs, data centers, etc.

Incident Management will be used in all environments:
• Production
• Staging/Pilot/Lab
• QA/Test
• Development
Business Rules
• Production Incidents always takes precedence over non–production Incidents
• Non-production Environments are controlled by the appropriate Development Teams
Responsibilities
Customer, Service Desk, Event Management System, 2nd Level/3rd Level Support
Responsibilities
:
Detects the Incident
Contacts the Service Desk through e-mail, ServiceNow.com or phone call
After notified of a resolution, verifies that the Incident has been resolved
If the Incident is not resolved, notifies Service Desk
Escalates the issue to the Manager of the affected service (when necessary)
1st Level Support Teams: Service Desk/Data Center Operations
Responsibilities
:
Records, categorizes, routes
Provides initial diagnosis and resolves (If possible)
Changes the Incident state to “Resolved”
If unable to resolve assigns the Incident ticket to the appropriate support team
Coordinates and tracks the Incidents through resolution
Escalates the Incident when necessary
Communicates with the customer as necessary to update and obtain additional information or to verify Incident resolution
Notifies the on-call support team and the manager of the support team of the affected area (Incident Coordinator) when a major outage has occurred
2nd Level Support Teams: Application & Infrastructure Teams, Desktop Support, etc.
Responsibilities
:
Investigates, diagnoses and implements a resolution or workaround within the target resolution times
Escalates the Incident when necessary
Obtains additional information from customers (as needed)
Documents the Incident resolution within one hour after the incident has been resolved
Changes the Incident state to “Resolved”
Provides status updates to Incident coordinator for Major Incidents
Communicates with the customer as necessary to update and obtain additional information or to verify Incident resolution
Updating work notes in the Incident Ticket
Completes the Outage Report within 48 hours from the start of a Major Incident/P1 resulting in an outage
3rd Level Support Teams: External Vendors
Responsibilities
:
Provides technical support and expertise
Investigates, diagnoses and implements a resolution or workaround within the agreed target resolution times
Obtains additional information from 2nd level support/customers (as needed)
Documents the Incident resolution within one hour after the Incident has been resolved
Updates the work notes in the Incident ticket. If no access to our ticketing system will notify the 2nd level support team who will then update the Incident work notes.
Incident Manager
Responsibilities
:
Data Center Operations Manager/Supervisor
Responsible for driving the resolution of Major Incident/P1 Incidents
Responsible for the Incident Management process (proactive and reactive Incident Management).
Informing the 1st Level Support Teams of the status of the outage and which communication notifications are required and the details of those communications as defined in the
Incident Notification Process document
.
Responsible for reporting on metrics and trends that demonstrate the effectiveness of the Incident Management process and Team
Incident Coordinator/Assignment Group Manager
Responsibilities
:
Reviews all tickets assigned to the group on a regular basis
Assigns the tickets to the appropriate individual (may be delegated, but assumes responsibility)
Ensures that the resolutions are well documented in customer friendly terms and that the Incidents are resolved within the agreed resolution time
During a Major incident, responsible for providing the status of the outage to the Incident Manager
Assignee
Responsibilities
:
Resolves the Incident
Updates the Incident ticket
Changes Incident state to “Resolved”
Reassigns ticket as necessary
Documentation
https://sp2010.ucop.edu/sites/its/svcmgmt/Shared%20Documents/Forms/By%20Process.aspx?RootFolder=%2Fsites%2Fits%2Fsvcmgmt%2FShared%20Documents%2FIncident%20Management&FolderCTID=0x012000583C156CC0100047AE0F5B15A9D09E50&View={2BA3C3D2-49F4-45D3-B741-28A66353C197}
Process Documentation:
Project Plan:
Insert link to Request Fulfilment project plan
The scope of the Incident Management Process includes any event that disrupts normal Data Center Operations of the ITS services environment, including but not limited to hardware, software, voice, applications, databases, servers, virtual machines, mobile services, batch jobs, data centers, etc.

Incident Management will be used in all environments:
• Production
• Staging/Pilot/Lab
• QA/Test
• Development
Scope
Business Rules
• Production Incidents always takes precedence over non–production Incidents

• Non-production Environments are controlled by the appropriate Development Teams
The goal of Incident Management is to restore the service as quickly as possible to the minimize disruption of the service outage. A workaround qualifies as a restoration of the service.

The objective of the Incident Management process is to create a standardized method for coordinating and managing the Incident life cycle.
Purpose & Objective
An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet affected service is also an Incident.
What is the definition of an Incident?
Priority
Impact
Urgency
Priority = Impact + Urgency
Service Levels
How quickly the resolution of an Incident is required
Measurement of the number of people, business process, or critical systems affected as a result of the service disruption
Outage Report
IM Notification Process
Notification Priority
Notification Process Flowchart
Notification Roles & Responsibilities
Notification Process Steps
Escalation
The objective of the Escalation Policy is to ensure that the appropriate resources are provided as required to resolve the Incident as quickly as possible, within the agreed timeframe based on the priority classification.
Functional Escalation
Functional escalation occurs when the one of the Support Teams are unable to resolve the Incident because the resolution requires specialized skills or additional access. Functional escalation can also occur when target times are exceeded.
Hierarchal escalation
Hierarchal Escalation occurs when Incidents are serious in nature (Critical and High priority Incidents) and the appropriate ITS Managers must be notified. Hierarchal escalation happens when it is taking longer than expected to investigate and resolve the Incident. Hierarchical escalation should notify Senior Managers and customers. Additional resources may be needed to resolve the Incident.
Step (1)
Team Reports that an outage has occurred and an Incident Ticket is logged.
IM Procedure Steps
Step (2)
Determines if the call is an Incident or Service Request.
Step (3)
If Service Request, the Request Fulfillment process is initiated.
Step (4)
If it is not a service request, record and log the customer and Incident details.
Step (5)
The Incident must be categorized. A change in the Incident categorization and/or reassignment to the appropriate team may occur once further investigation is conducted.
Step (6)
Priority is assessed based on the impact and urgency.
Step (7)
Determine whether a P1 outage has occurred.
Step (8)
If a P1 outage has occurred the 1st Level Support Team will contact 2nd Level and/or 3rd Level Support Teams to verify the outage. If the outage is a critical or high priority, the Major Incident Notification Process is initiated.
Step (9)
If an outage has not occurred, the 1st Level Support Team will investigate and diagnose the Incident.
Step (10)
Determine whether a workaround or resolution exists. If a workaround does not exist, investigation and diagnosis will continue until a resolution/workaround is found, or escalation occurs. Escalation should occur as soon as possible.
Step (11)

If a workaround or resolution exists, resolve and recover.
Step (12)

Validates and tests that the solution resolves the Incident
.
Step (13)
If the Incident cannot be resolved, the 1st Level Support Team will escalate the ticket to the 2nd Level/3rd Level Support Team. The 2nd Level/3rd Level Support Team will investigate and diagnose the Incident. Note: This may result in modifying impact and urgency or reassignment to the appropriate group.
Step (14)
Determine if there is workaround or solution.
Step (15)
If yes, is a Request for Change (RFC) required?
Step (16)
If a RFC or eRFC is required, then the Change Management process is initiated.
1st Level Support
2nd/3rd Level Support
Step (17)
If a RFC is not required, resolve and recover. Validates and Test that the solution resolves the Incident.
Step (18)
If the solution is a temporary fix or workaround initiate the Problem Management Process for further analysis and troubleshooting.
Step (19)
Sends customer notification or contacts customer regarding Incident resolution
Step (20)
IValidates if the Incident is resolved.
Step (21)
If the Incident is resolved, then the Incident state is marked “resolved” (on the 5th day the Incident state systematically changes to “closed”).
Step (22)
If the Incident is not resolved, the Incident will be sent back to 2nd Level/3rd Level for further investigation (Step 13), depending on which level provided the resolution or workaround.
1st Level Support
1st Level Support
Customer
Step (1)
Reports that an outage has occurred and an Incident Ticket is logged.
IM Major/High Procedure Steps
Step (2)
Verifies that an outage occurred and determines type of outage.
Step (3)
2nd Level/3rd Level will verify the urgency and impact of the Incident within 10 minutes after the outage has been detected
Step (5)
Sends out an email outage notifications (or by phone if email is the affected service) to the following:
IT Support Staff (ITS and non-ITS) & ITS Management
:
• IT Support Teams and ITS Management: IM_IT_Support-L@ucop.edu
• Outage email distribution listing: SYSFAIL@LISTSERV.UCOP.EDU
Customers
:
For UC-Wide Services:
• All affected campus Service Desks: IM_Campus_Service_Desks-L@ucop.edu
• The affected customer and/or the affected business area managers if known.
FOR UCOP Only Services:
• UCOP–wide notification, which needs to be sent by IT Service Desk from the alias of “UCOP IT Alert” to: UCOP-L@ucop.edu
• The affected customer and/or the affected UCOP business area managers if known.
Step (6)
Investigate and diagnose the Incident
Step (7)
Determine if there is a workaround or solution. If no workaround or solution exists, go back to step 6. If the incident is not resolved after 1 business hour or 2 non-business, the Manager(s) and Director(s) of the affected business area must be contacted. Contact 1st Level to request the notifications. Provides the details for the outage notification to 1st level support team.
Step (8)
If there is a workaround or solution, is an eRFC required?
Step (9)
If an eRFC is required, the Change Management process is initiated.
Step (10)
If an eRFC is not required, resolve and recover. Make sure the issue is resolved.
Step (11)
Notify 1st Level Support, update ticket and change the Incident state to “Resolved”.
Step (12)

Customer notification (if necessary).
Step (13)
Validates if the Incident is resolved by accessing the affected applications/systems/websites, etc. This should occur within 10- 15 minutes of notification that the applications/systems/websites have been restored.
Step (14)
If the Incident is resolved (test to determine if the issue has been resolved, if possible), the Incident is marked as resolved and resolution email notifications are sent to the following:
IT Support Staff (ITS and non-ITS) & ITS Management
:
• IT Support Teams and ITS Management: IM_IT_Support-L@ucop.edu
• Outage email distribution listing: SYSFAIL@LISTSERV.UCOP.EDU
Customers
:
For UC-Wide Services:
• All affected campus Service Desks: IM_Campus_Service_Desks-L@ucop.edu
• The affected customer and/or the affected business area managers if known.
FOR UCOP Only Services:
• UCOP–wide notification, which needs to be sent by IT Service Desk from the alias of “UCOP IT Alert” to: UCOP-L@ucop.edu
• The affected customer and/or the affected UCOP business area managers if known.
Step (15)
If the Incident is not resolved, the Incident will be sent back to 2nd Level and/or 3rd Level (Go to Step 6).
Step (16)
Validates Incident details and resolves the Incident.
1st Level Support
2nd/3rd Level Support
1st/2nd Level Support
1st Level Support
2nd/3rd Level Support
1st Level Support
1st/2nd/3rd Level Support
1st Level Support
Step (4)
Contact 1st level to request the level of notifications required. Provides the details for the outage notification to 1st level support team.
Major Incident Notification Process
In most cases, the initiator will be the 1st Level Support Team. The 1st Level Support Team will refer to the ITS contact list for the appropriate contact number. In addition to contacting the on call team/individual the 1st Level Support Team will also contact the manager of the support team of the affected area (Incident Coordinator). If an analyst is notified of the Critical Incident directly or notices that a Critical Incident occurred, 1st Level should be contacted to initiate the notification process.

If the 1st Level Support Team is unable to make a “live” contact with the on call person/team within 15 minutes (business and non-business hours) hierarchical escalation is made to the next level (i.e. if unable to contact the primary and secondary person, then escalate to the Manager, if unable to reach the Manager, then escalate to the next level of Management).

The 1st Level Support Team will triage the call and attempt to troubleshoot and diagnosis the type of outage. The 1st Level support team that triages the call depends on which team receives the call or detects the incident. It may correlate to the time of day as well (service desk = business hours and data center operations = non-business hours). Once the type of outage is determined, the 2nd Level and/or 3rd Level Support Teams are contacted as appropriate. For example, if an Incident occurs with the Mainframe services, the Mainframe manager would be the Incident Coordinator. The Incident Coordinator will be responsible for providing the status of the outage to the Incident Manager. The Incident Manager is responsible for making sure that the 1st Level Support Teams are updated. They also work with the 1st Level Support Teams on which communication notifications are required and the details of those communications. See the Incident Notification Process document for more details.

When Outage Occurs
Within 1 hour of Incident
• The 1st Level Support Team will check with the Incident coordinator to identify the cause and the magnitude of the outage.

• Depending on the magnitude or type of outage, the Incident coordinator will determine and communicate the level of notification necessary.

• A group mailbox (IM_Campus_Service_Desks-L@ucop.edu) is used to notify all campus Service Desks of outage affecting campus wide services.

• For “Critical” Incidents resulting in network and/or email outages, the 1st Level Support Team will use a laptop to access the UCOP wireless network (OPNet). If that service is unavailable, they will use the MiFi to access web mail and from the ITS notifications account send the outage notification to the appropriate e-mail distribution lists.

• If the outage is network related and/or email communication is affected, the 1st Level Support Team will use voicemail to notify the distribution list of the outage and provide status updates.

• The 1st Level Support Team Manager/Supervisor will determine whether the voice announcement should be generated informing callers that “ITS is aware of the outage and working diligently to resolve the issue as quickly as possible.”

• If there is no resolution after 1 hour from the beginning of the Incident, the 1st Level Support Team will contact the Incident Manager. The Incident Manager is responsible for relaying the impact, options and an estimated time of resolution to the 1st Level Support Teams and ITS Management.

• The 1st Level Support will keep a log of the notification steps that were taken during an outage and enter them into the IT Service Hub Incident ticket.
• Verification with the appropriate Support Teams that all affected services are up and fully functional (see the Application Verification Process section below).

• Send the systems restored email and/or voicemail message to the various distribution lists.

• The 1st Level Support Team will turn off the outage announcement.

• The Incident Manager will update the Incident Ticket.

• 1st Level Support will verify that the Incident ticket is updated with all pertinent information.

After services are restored
Once the 2nd Level and/or 3rd Level Support Teams state that the Incident has been resolved, verification that the affected services are fully functional is required.

For outages that occur during business hours and affect the ITS Infrastructure, the 1st Level Support Team will verify the logon page and, in some instances, log into various services and perform simple lookup tests. As appropriate, 2nd Level and 3rd Level Teams will assist in verifying the availability of affected services.

Verification of services will include but is not limited to the following services:
AYSO, PPS, Apply UC, Email, Internet, Phone/voicemail,

For outages that occur off-hours, the 1st Level Support Team will contact the on-call person to verify the availability of the affected services.

The ITS Services’ Support Teams will be responsible for ensuring their services are available. The 1st Level Support will use the contact list to notify the on-call support personnel/team. The on-call person for the affected service(s) will then verify that the service(s) are functioning properly.

The 1st Level Support Team should be notified (by e-mail or phone) within 10 – 15 minutes after the verification has occurred.

The 1st Level Support Team will send an email to the same group that received the first outage notification. It is important that the subject line on the 2nd e-mail is the same as the subject line used on the 1st e-mail. “RESOLVED” should also be added at the beginning of the subject line.

Application Verification Process

An Outage Report (Appendix B) must be issued within 48 hours (or two business days) of the initiation of a critical or high Incident resulting in an outage. Outages include situations when a service is unavailable or significant performance degradation exists.

Reports will also be issued when a redundant service/component has lost redundancy and a single additional failure results in a critical outage. In these cases, the report will be issued with the title “ITS Outage Report.”

The Outage Report is the responsibility of the Manager responsible for the service or component that caused the outage. The Outage Report will be distributed to the ITS Services’ Management Team.

An Incident must be generated for the outage with a timeline of events documented in the work notes. If not directly entering the timeline/status updates into the work notes the same can be achieved by replying to the “Incident Opened” email notification of the Incident. The work notes will be updated with the content of the email.

If multiple Incidents exist for the same event, the Incident first generated is the parent and all subsequent Incidents should be attached to the parent Incident. The parent Incident number should be recorded on the Outage Report. It is at the discretion of the Incident coordinator to determine whether a new Incident should be generated to serve as the parent Incident.

When multiple services are impacted only a single report will be issued by the Manager responsible for the highest-level shared component in the stack. For example, if a single virtual machine fails that supports a single service, the Service Owner would be responsible for issuing the report. If a service fails on physical server supporting multiple virtual machines fail the Manager of the physical server would be responsible for issuing the report. The Outage report should be attached to the Parent Incident for the outage. Also, a copy of the ITS Outage Report template will be kept in the SharePoint site as follows: https://sp2010.ucop.edu/sites/its/itinfrastructure/datacenter/RCA Archive/Forms/AllItems.aspx
Outage Report:
https://sp2010.ucop.edu/sites/its/itinfrastructure/datacenter/RCA Archive/Forms/AllItems.aspx
Outage Logs
The monthly Outage Log will be used to insure that all critical and high outages recorded in have a corresponding Outage Report.

The Outage Log is distributed monthly to the ITS Managers and their direct reports.
Critical/High Incident Reviews
The Outage Report will be used to conduct Critical/High Incident Reviews that resulted in an outage. Once the initial Outage Report is generated the Service Owner will continue to update the report to reflect additional information.

Updates should reflect additional information such as root cause analysis, prevention steps and opportunities for improvement.

The completed report will be reviewed in team meetings to insure opportunities of improvement are incorporated into the team procedures.
Incident Management
Metrics
Objective
: Quick resolution of Incidents

Metric
:
% of Incidents resolved within target timeframe
By priority
By Assignment group
By Assigned to
By date
Objective
: Improved customer satisfaction

Metric
:
Number of customer satisfaction surveys sent to customers
Number of responses
Average customer satisfaction survey score
Trend of customer satisfaction score by quarter
Metric
:
Availability and performance metrics
Age of Active Incidents
Average duration time for Incident resolution
Objective
: Increased incident resolution efficiency


Metric
:
% of Incidents resolved by Service Desk - 1st, 2nd, 3rd Level
% of Incidents escalated by Service Desk – All Levels
Mean time to restore service from point of first call
Mean time to restore Critical Incidents
Mean time to restore High Incidents
Mean time to restore Routine Incidents
Number of Incidents caused by changes
Objective
: Meet agreed system availability and performance
Purpose of metrics. Metrics are used to determine the effectiveness of a process. If they do not tell you something about the efficiency, effectiveness of the process, discontinue or refine it. It is all too easy to fall into the trap of metrics for the sake of metrics.

Align with Business Functions. Regardless of the IT activity, you need to make sure your metrics tells you something about the business function that depends on what you are measuring.

Keep it simple. A common problem manager’s face is overloading a metric. That is, trying to get a single metric to report more than one thing. If you want to track more than one thing, create a metric for each. Keep the metric simple and easy to understand. If it is too hard to determine the metrics people often fake the data or the entire report.

Good enough is perfect. Do not waste time polishing your metrics. Instead, select metrics that are easy to track, and easy to understand. Complicated or overloaded metrics often require excessive work, usually confuse people, and do not get used.

A few good metrics. Too many metrics, even if they are effective, can overwhelm a team. For any process 3 to 6 metrics is usually all that is required. Any more and either the metrics won't get reported, or the data gets faked. Too many metrics transforms an organization into a reporting factory -- focusing on the wrong things for the wrong reasons. In either case, the usefulness of the metric is compromised.
Purpose of metrics.
Metrics are used to determine the effectiveness of a process. If they do not tell you something about the efficiency or effectiveness of the process, discontinue or refine it. It is all too easy to fall into the trap of metrics for the sake of metrics.
Purpose
The primary purpose of the Incident Management Notification Process is to establish a standard method of communicating unplanned outages to ITS management and stakeholders. All parties required to restore normal operations are contacted immediately and appropriate notifications are sent to ITS management and stakeholders in an effort to set realistic expectations for Incident resolution.
Objective
The objective of the notification process is to inform stakeholders that are impacted by the outage and provide periodic updates with current estimates of service recovery. Within 24 hours of an outage, ITS will provide an explanation in an “Outage Report” that will summarize findings, determine the root cause, and communicate preventive measures.
Scope
The Notification Procedure must be followed by all ITS Employees for critical or high outages impacting the IT Services Production Environment.
Disaster Recovery Failover
If the priority is critical and the entire Data Center or UCPC are down or there is a significant outage to the IT Services Environment, a fail over to UCSD must be initiated for all services in which this is an option. The DNS propagation takes 30 – 60 minutes. Failover to UCSD should occur when:

The outage is estimated to be longer than 1 hour
Shibboleth is down
A complete and total outage occurs
Email Notification Templates
Audience: IT Support Staff (ITS & non-ITS) & ITS Management
Initial Email:
Subject line of email:
“[Application, Service Name, System] Performance Issues”
Example: Network Performance Issues

Body of Email:
To: UCOP IT Support Staff
From: IT Service Desk

UCOP is currently experiencing a performance issue with [Application, Service Name, System]. The incident is affecting [all users, users from xx building]. The issue began on [day and date] at [time a.m., p.m.].

We have contacted the support team working to resolve the incident. They will be providing regular updates to the IT Service Desk. There is no need for you to contact them directly. We will provide additional information to you as soon as it becomes available.
We apologize for any inconvenience and thank you for your understanding. If you have any questions or need additional information, please do not hesitate to contact us at (510) 987-0457 or servicedesk@ucop.edu.
Updates:
Subject line of email:
“[Application, Service Name, System] Performance Issues”
Example: Network Performance Issues

Body of Email:
To: UCOP IT Support Staff
From: IT Service Desk

UCOP is currently experiencing a performance issue with [Application, Service Name, System]. The incident is affecting [all users, users from xx building]. The issue began on [day and date] at [time a.m., p.m.].

We have contacted the support team working to resolve the incident. They will be providing regular updates to the IT Service Desk. There is no need for you to contact them directly. We will provide additional information to you as soon as it becomes available.
We apologize for any inconvenience and thank you for your understanding. If you have any questions or need additional information, please do not hesitate to contact us at (510) 987-0457 or servicedesk@ucop.edu.
Resolved:
Subject line of email:
“Resolved - [Application, Service Name, System] Performance Issues”
Example: Resolved - Network Performance Issues

Body of email:
To: UCOP IT Support Staff
From: IT Service Desk

The [Application, Service Name, System] is now available. The issue was resolved on [day and date] at [time a.m., p.m.]. The root cause of the issue was [reason or is still being investigated].
We apologize for any inconvenience and thank you for your understanding. If you have any questions or need additional information, please do not hesitate to contact us at (510) 987-0457 or servicedesk@ucop.edu.

Audience: UCOP Business Customer
Initial Email:
Subject line of email:
“[Application, Service Name, System] Performance Issues”
Example: Network Performance Issues

Body of Email:
To: UCOP Community
From: IT Service Desk

UCOP is currently experiencing a performance issue with [Application, Service Name, System]. The incident is affecting [all users, users from xx building]. The issue began on [day and date] at [time a.m., p.m.].

ITS is working to resolve the incident as quickly as possible. We will provide additional information as soon as it becomes available.
We apologize for any inconvenience and thank you for your understanding. If you have any questions or need additional information, please do not hesitate to contact us at (510) 987-0457 or servicedesk@ucop.edu.
Subject line of email:
“Update - [Application, Service Name, System] Performance Issues”
Example: Update - Network Performance Issues

Body of email:
To: UCOP Community
From: IT Service Desk

We are still experiencing a performance issue with [Application, Service Name, System]. ITS is working to resolve the incident as quickly as possible. We will provide additional information to you as soon as it becomes available.
We apologize for any inconvenience and thank you for your understanding. If you have any questions or need additional information, please do not hesitate to contact us at (510) 987-0457 or servicedesk@ucop.edu.
Subject line of email:
“Update - [Application, Service Name, System] Performance Issues”
Example: Update - Network Performance Issues

Body of email:
To: UCOP Community
From: IT Service Desk

We are still experiencing a performance issue with [Application, Service Name, System]. ITS is working to resolve the incident as quickly as possible. We will provide additional information to you as soon as it becomes available.
We apologize for any inconvenience and thank you for your understanding. If you have any questions or need additional information, please do not hesitate to contact us at (510) 987-0457 or servicedesk@ucop.edu.
Updates:
Resolved:
Audience: UC Location Customer
Initial Email:
Subject line of email:
“UCOP Alert: [Application, Service Name, System] Performance Issues”
Example: UCOP Alert: AYSO Performance Issues

Body of Email:
To: All UC Location Service Desks
From: UCOP IT Service Desk

UCOP is currently experiencing a performance issue with [Application, Service Name, System], which may include customers from your location. The issue began on [day and date] at [time a.m., p.m.].
UCOP ITS is working to resolve the incident as quickly as possible. We will provide additional information to you as soon as it becomes available. Please share this information with your users as appropriate.

We apologize for any inconvenience and thank you for your understanding. If you have any questions or need additional information, please do not hesitate to contact us at (510) 987-0457 or servicedesk@ucop.edu..
Updates:
Subject line of email:
“Update - UCOP Alert: [Application, Service Name, System] Performance Issues”
Example: Update - UCOP Alert: AYSO Performance Issues

Body of email:
To: All UC Location Service Desks
From: UCOP IT Service Desk

UCOP ITS is still working to resolve a performance issue with [Application, Service Name, System]. We will provide additional information to you as soon as it becomes available. Please share this information with your users as appropriate.
We apologize for any inconvenience and thank you for your understanding. If you have any questions or need additional information, please do not hesitate to contact us at (510) 987-0457 or servicedesk@ucop.edu.
Resolved:
Subject line of email:
“Service Restored - UCOP Alert: [Application, Service Name, System] Performance Issues”
Example: Service Restored - UCOP Alert: AYSO Performance Issues

Body of email:
To: All UC Location Service Desks
From: UCOP IT Service Desk

[Application, Service Name, System] is now available. The issue was resolved on [day and date] at [time a.m., p.m.]. The root cause of the issue was [reason or is still being investigated]. Please share this information with your users as appropriate.
We apologize for any inconvenience and thank you for your understanding. If you have any questions or need additional information, please do not hesitate to contact us at (510) 987-0457 or servicedesk@ucop.edu.
Disaster Recovery Failover Notification Templates
Initial Email
Subject line of email:
“UCOP Alert - [Application, Service Name, System, Location] Outage: Disaster Recovery Initiated”
Example: UCOP Alert - Datacenter Outage: Disaster Recovery Initiated

Body of email:
UCOP is experiencing an outage that is affecting [Application, Service Name, System]. We have initiated failover procedures to our disaster recovery site at UC San Diego. The transition from UCOP to UCSD will take approximately [30 minutes]. You might experience intermittent availability issues until the transition is complete.
We apologize for any inconvenience and thank you for your understanding. If you have any questions or need additional information, please do not hesitate to contact us at (510) 987-0457 or servicedesk@ucop.edu.
Failover Complete
Subject line of email:
“UCOP Alert - [Application, Service Name, System, Location] Outage: Disaster Recovery to UCSD Completed”
Example: UCOP Alert - Datacenter Outage: Disaster Recovery to UCSD Completed

Body of email:
The failover transition to the UCSD disaster recovery site is now complete, and you should be able to access your [Application, Service Name, System]. If you are still experiencing connection issues to Shibboleth, please ask your system administrator to manually refresh your DNS cache.
We apologize for any inconvenience and thank you for your understanding. If you have any questions or need additional information, please do not hesitate to contact us at (510) 987-0457 or servicedesk@ucop.edu.
Failover Back to UCOP
Subject line of email:
“UCOP Alert - [Application, Service Name, System, Location] Outage: Failover Back to UCOP Initiated”
Example: UCOP Alert - Datacenter Outage: Failover Back to UCOP Initiated

Body of email:
We have identified the root cause of the UCOP outage. The incident was caused by [reason or is still being investigated]. We have resolved the issue and are initiating the failover procedures back to UCOP. The transition from UCSD to UCOP will take approximately [30 minutes]. You may experience intermittent availability issues until the transition is complete.
We apologize for any inconvenience and thank you for your understanding. If you have any questions or need additional information, please do not hesitate to contact us at (510) 987-0457 or servicedesk@ucop.edu.
Complete Restoration/Failover Back to UCOP Complete
Subject line of email:
“UCOP Alert - [Application, Service Name, System, Location] Outage: Failover Back to UCOP Completed”
Example: UCOP Alert - Datacenter Outage: Failover Back to UCOP Completed

Body of email:
The failover transition from the UCSD disaster recovery site back to UCOP is complete. UCOP systems are fully restored, and you should now be able to access to [Application, Service Name, System].
We apologize for any inconvenience and thank you for your understanding. If you have any questions or need additional information, please do not hesitate to contact us at (510) 987-0457 or servicedesk@ucop.edu.
If the network or Exchange Server is unavailable the method of communication will be voicemail using the same notification templates and distribution lists.
Full transcript