Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Do you really want to delete this prezi?
Neither you, nor the coeditors you shared it with will be able to recover it again.
Make your likes visible on Facebook?
You can change this under Settings & Account at any time.
Service Health Management with HP Unified Operations
Mike Shawon 8 July 2010
Transcript of Service Health Management with HP Unified Operations
performance automatically discovered service dependency map If the fault and performance information for each domain (virtualization or Blackberry Servers, or database servers) go to the management tool for that domain then no-one has a single view of what is happening.
When we don't have a single view of what is happening, we get
DUPLICATION OF EFFORT - a number of teams will work on what they believe to be their problem because they got a fault event. What we need is to correlate all related fault events so we can figure out what is the true root fault event. This can't be done either if fault events are going to a series management consoles.
SLOW RESOLUTION of issues, especially those top-down performance issues. Because no one team can see the whole picture, we need to get all the teams into a room or on the phone in order to figure out the performance problem's root cause
It's impossible for us to PRIORITIZE which incidents to work on first because we can't "look upwards" and understand the business applications affected by each problem
We can't know the true, single root cause of the problem - each team that gets a fault event things it might be their domain that is causing the problem. Because of this, AUTOMATION fixing of the problem is DANGEROUS - we may actually take the wrong action and make things worse.
Note that we are talking about first level support here - handling of the first line fault events or trying to figure out the cause of a top-down application performance problem. Once we've figured out what's going on, we give it the subject matter experts, and of course they need to use the management tool for their domain to fix the problem. As we discussed earlier, if we have a service hierarchy map (also known as a service dependency map) we can figure out the cause of problems and prioritize our fix priorities.
It is hard to understate the importance of a service dependency map. With it, incident and problem resolution is faster, more accurate and less people intensive.
But, service dependencies are complex and humans can't possibly build them manually or, more importantly, keep them up to date. This is especially true now that virtualization makes relationships even more fluid.
So, the discovery of this service dependency map must be automated.
When we look change management, we will see that the service dependency map plays an equally central role there too. Modern applications are complex, depending on many moving parts. "Moving parts" is actually and apt term here. With virtualization, the parts may well be moving !! To monitor all the many components of an application and to thus assume that the application is performing at the correct level just doesn't work. Performance isn't a black and white thing like availability. Just because all the components that make up an application are running at the required levels doesn't mean the application itself will be running at the required performance level.
So, if we simply monitor all the parts of an application, there will be situations where "everything looks fine" and in fact, customers are having performance problems (and therefore either getting upset with your company, or simply taking their business elsewhere). We know this is the case because roughly 60% of all service desk performance incidents comes from customers and not from our own monitoring. This means that we are using our customers as the most expensive monitoring device money can buy.
If we have a really responsive service desk, won't that help? At HP, we have estimated that the time between a performance problem occurring and the first customer calling in is something like 30 to 45 minutes. 30 to 45 minutes when you are losing business. 30 to 45 minutes when you could have been working on fixing the problem. We provide integration between the teams solving incidents and problem: Because we pass information and state between 1st level support, the service desk and second level support, we can solve problems much quicker, stop "allocation ping pong" where the work bounces between groups, and fix the problem more accurately (i.e. not have to "redo" the fix). At HP, we know this as "CLIP" - Closed-loop Incident Process. multiple management consoles
using customers as monitors
manual analysis of data
We talked earlier about how we need to have all fault and performance information come to one place to get a single view of what is happening.
Let's imagine we do that - we have all fault and performance information come to a consolidated fault and performance console. Do we now have a single view of what is happening?
Possibly - provided the people sitting at the console are very experienced, very alert, and have a very, very large screen.
There will be lots of fault events hitting our consolidated console. Some of these events are "root faults" - the real cause of the problem. But many of them will be subsequent fault events caused by the root cause. The operator has to be able to manually figure out what is an actionable root event and what is a subsequent event they can ignore. This is not easy, especially as there can be lots of subsequent events separated in time by many minutes.
We will also have a ton of performance data coming at us. When we have an applicaiton performance problem, we're going to struggle to manually infer what the cause of that application performance problem is from the mountain of performance data. disjointed incident process An incident will either start with monitoring detecting a fault event or performance problem, or less preferable, a customer calling in. For most incidents a number of different teams will work on the incident - the first-line support team, the service desk, and then any number of subject matter experts.
Typically, the information flow between these teams is not good. The teams don't have a common view of the problem - what the first level support team knows about the problem isn't know to the service desk and so on.
This results in longer resolution times than are necessary, which means long down times and inefficient IT Operations. no single view of the truth Whenever we are trying to resolve an incident, whether it occurs as a result of fault events or as a result of application performance problems, there are a number of things we need to know to help us resolve the incident quickly and accurately:
We need a view of the hierarchy of services we have in IT Operations. This helps us understand what fault events are root events and what are subsequent. It allows us to prioritize faults. And it allows us to understand the root cause of a performance problem.
We need to be able to map fault and performance information we receive onto that service hierarchy map so that we can understand the health status in service terms
All this information needs to be available to everyone involved in resolving the incident
WIthout this common, single view of the service hierarchy and its status, resolving the incident will be tough. We bring all fault and performance information into one place. We then get one view of what is happening and we can stop duplication of effort between silos, and fix application performance problems more quickly because we have all the monitoring information in one place We automatically monitor the end-user experience of applications so that we are the first to know there is a problem, not the last. This will give us something like a 30 to 45 minute head start over the first customer calling in. When we put in place consolidated fault and performance, we will have a lot of data. We automate the analysis of this data so that :
We get told which fault events are root causes we need to do something about and which are subsequent events we can ignore
When we get an application performance problem, we can automatically correlate the performance data streams, thereby allowing us to understand what the root cause of the performance problem is. One of the advantages of automated fault and performance data analysis is that we are left with a true root cause of any problem. We can then apply automated fixing of the problem knowing that we are going to be taking actions against the root cause and not against a subsequent fault.
Our customers are finding that they can automate about 20% of their incident fix actions. These are, of course, the more trivial fix actions - but 20% is a good percentage of actions to take out and automate.
There are actually three ways in which automation can help us here..
It can collect data for us as close to the incident's occurance as possbile
It can fix the problem using a run-book workflow created by the subject matter expert
It can automatically verify that the fix actually worked IT Operations Transformation with
HP's Unified Operations solutions Let's now look at how HP's Unified Operations solutions deliver excellent Service Health - good and predicable availability and performance of IT services. We'll start with an overall picture, and then look at each component in turn. HP Unified Operations solutions :
Service Health Management Let's start by looking at the causes of poor or unpredictable IT service health This presentation is best
viewed by clicking "More"
and then "Full Screen"