The Internet belongs to everyone. Let’s keep it that way.

Protect Net Neutrality
Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Problem management in practice

In this presentation you will be introduced to a major incident from the real world and the subsequent problem solving, step by step.
by

Thomas Fejfer

on 12 June 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Problem management in practice

ABC Spedition PLC transports and stores food in Scandinavia. The transport is carried out by trucks.
Introduction
1 September
27 August
Major incident
A failure similar to the July 29 incident caused approx. 4 hours of unavailability for most of the IT services. A reboot was tried, but had no effect.

A team was formed to investigate and diagnose the issue. After an hour the team reported that they had found the root cause, a hardware failure in a module in a central network component (distribution switch). The module was replaced, and all IT services were again available.
Problem
Conclusion
Mobile: +45 40 15 97 82
E-mail: tf@bluehat.dk
Web: http://www.bluehat.dk/
The CIO was not convinced that the IT specialists had prevented a similar incident in the future and initiated therefore a problem investigation.
by Thomas Fejfer BlueHat P/S
Thomas Fejfer
create a simple cause & effect diagram to explain the cause and effect relationships
PROBLEM
MANAGEMENT
IN PRACTICE

Note: This case is simplified, to make it easier to communicate, and times, business type, etc. has been changed to anonymous the company.
IT systems provide information about:
What must be loaded on a given truck
Where the truck must deliver its load
Do not confuse the problem with a cause
the switch module was one of several causes - not the problem
Problem solving is team work
and at least one in the team should not have a deep knowledge of the nature of the problem
ITIL® Incident and Problem Management describes well how to manage incidents and problems, but not how to solve them
Major incident
A possible failure in a central network component caused approx. 4 hours of unavailability for most of the IT services.

After a reboot of the component, all of the IT services were available again.
29 July
download at: http://www.bluehat.dk/downloads/
no registration needed
How Kepner-Tregoe can improve your ITIL processes
Whitepaper:
ITIL® is a Registered Trade Mark of the Office of Government Commerce in the United Kingdom and other countries.
ITIL® Problem Management describes well how to manage problems but not how to solve them.

Therefore, it is a huge challenge for many IT organizations to get Problem Management to work in everyday life.

In many cases, the Problem Management process acts as an Incident Management process and the IT organization achieve at best only a limited value of the process in relation to the potential that is possible.
Slow down – and think before you act
assess the risk before you start implementing solutions
Return to "normal" operation as quickly as possible
be aware of different perceptions of normal operation
Incident Management and Problem Management
be aware of the fundamental differences
ITIL® is a Registered Trade Mark of the Office of Government Commerce in the United Kingdom and other countries.
06:00 The first user reports slow response times
07:05 Network specialist began investigation
07:15 Sporadic loss of packets over ”Distribution Switch A”
07:00 Several users have reported slow response times
07:30 ”Distribution Switch A” rebooted. Automatic failover to ”Distribution Switch B”. All services were available
10:15 HW supplier received the log from the switch
10:00 ”Distribution Switch A” turned off. Automatic failover to ”Distribution Switch B”. All services were available
08:00 Automatic failover/fallback with less than 1 sec. interval
07:45 ”Distribution Switch A” was up running after the reboot. Automatic fallback to ”Distribution Switch A”. IT services were unavailable
Timeline
The problem solving team was staffed with:
1 problem coordinator with a limited knowledge of network and servers, but with knowledge and experience of problem solving methodologies
3 subject matter experts:
Cause-effect diagram
12:00 ”Distribution Switch A” turned on.
Automatic fallback to ”Distribution Switch A”. All services were available
Economy
Problem description
Sketch
Failover
is switching (Automatic or manual) to a redundant or standby computer server, system, or network upon the failure or abnormal termination of the previously active application, server, system, or network.

Fallback
is the process (Automatic or manual) of restoring a system, component, or service in a state of failover back to its original state (before failure).

Wiki
Problem solving team
10:45 Supplier reported back a hardware failure in a module in the ”Distribution Switch A”
11:45 Module replaced and tested
Decision
Identify possible solutions
Actions aimed at getting back to "normal operation" as quickly as possible had caused additional unavailability
A possible solution was to address this by training
A timeline (chronological analysis) is a valuable tool for problem solving, but it can not replace a cause-effect diagram, as it do not explains all cause and effect relationships.
PRINCIPLE:
Always perform an analysis of causality to explain why the problem occurred.
PRINCIPLE:
Problem solving is teamwork, and at least one in the team must not have a deep knowledge of the nature of the problem.
PRINCIPLE:
Problems are not solved in general. Always focus on how to prevent one specific incident.
PRINCIPLE:
Problems can be complicated but solutions may not, because complex solutions are new problems.
PRINCIPLE:
Problem are solved when specific solutions are implemented.
PRINCIPLE:
Choose best solution
(extract)
A hardware failure in the primary distribution switch caused extensive unavailability of IT services.


5 complaints of late delivery (€ 3.000)
75 trucks were grounded for approx. 4 hours (€ 60.000)
250 administrative employees each lost 4 work hours (€ 54.000)
Total: € 117.000


ABC Spedition’s IT data center in Silkeborg


August 27, 2011 from 06:00 am – 12:00 am
What:
Undesired outcome:
Where:
When:
Always start with defining the business impact to get a unique starting point for problem solving.
Simple
(extract)
PRINCIPLE:
- 2 network specialists
- 1 platform (server and OS) specialist
Possible solutions are identified by going systematically through each cause and ask:
“What can we do to remove the cause?” or
“What can we do to prevent the cause from having a negative effect?”
PRINCIPLE:
A timeline documents events in chronological order and is very useful to show which events may have been triggered by others – or to discount any claims that are not supported by the sequence of events.
a) Prevent recurrence or minimize the adverse impact including similar occurrences at e.g. different locations
b) Be within your control
c) Be simple
d) Provide reasonable value for its cost
e) Not cause other unacceptable problems


1) Replace with better quality HW
2) Establish manuel fallback procedure
3) Remove redundant setup
4) Reduce stabilization period


1) Reduces probability (uncertain), within control, complex, € 20.000
2) Prevents recurrence, within control, simple, € 1.000 (BEST CHOICE)
3) Reduces impact, within control, complex, € 2.000, risk of increased unavailability in other cases
4) Prevent recurrence if reduced to zero, outside of control, cost N/A


If we do not know when a failover has occurred, then we have no redundancy. We need to setup a notification that is sent to the service desk in case of failover (estimated cost € 2.000). Total cost € 3.000.
What are our objectives?
What alternatives do we have?
Which alternative best fits our needs?
What could go wrong with that choice?
(Fast version of Kepner-Tregoe Decision Analysis)
Note: This is classic ITIL incident management. An intermediate action (a workaround) is used to restore service.
Note: Many will see this as problem management, but is it not. In this case a corrective action is used to restore service and therefore it is pure incident management.
Note: This is where ITIL problem management starts!
Full transcript