Wednesday, November 25, 2009

Non-priority problems

What do you do with problems that aren't fixed but which the business never prioritise for fixing?

The other week a client of mine was expressing his frustration with his team for having too many open Problem records. Now, the Problems that were causing the worry weren't the ones they were currently working to resolve, but rather a backlog of Problems that were just not important enough to spend any time on. He was inclined to close them, seemingly because they were making his stats look bad.

So I picked on one, as an example, to explore further whether it would be valid to close it or not. It turns out this example was a bona-fide problem whereby the service was not performing to the customer's expectations, and may not have actually been performing to spec, but the Business kept on refusing to allow the resources to be allocated to fixing it. You see, the Business had their roadmap outlined with all the changes they wanted to see, naturally justified with business cases, and they did not want to add any risk to delivering their roadmap by including fixes to these Problems which aren't really losing them any significant money. Don't get me wrong: that's a perfectly good business decision, but the pain was being felt by this Operations manager who gets to keep these black marks against his SLA stats, when in truth he is not empowered to resolve them.

It's clear to me this has to go one of two ways: (1) move the hot potato to the party that has the power to resolve it; or (2) make it a cold potato.

If your processes treat Problem records as hot potatoes, i.e. it is a bad thing to be in posession of these, particularly for extended periods of time, then you need to be able to allocate them to the party that's blocking progress. In this case, this may be the business owner, whether they have access to the Problem management system or not.

The alternative is to stop them being hot potatoes. Allow the Problem record to be moved to a status where it is no longer a black mark on the SLA reporting, e.g. "on hold", or promote to a Known Error record. You then need a process to review those with the Business from time to time, so that they can then prioritise them for fixing at some point.

By simply closing the Problem record, you are denying yourself the opportunity to use your Problem Management tool as a knowledge base. If you've closed the Problem, it looks like the problem no longer exists, so the next time an Incident is raised, your operations team treats it as a new problem, which is a huge waste of resources.

How do you treat these kinds of Problems in your organisation?

Tuesday, October 27, 2009

Close Incident when Problem raised?

What's the relationship between Incident closure and Problem creation? One school of thought, reported by Juan Jimenez is that

(A) Incident ALWAYS ends when Problem raised.

This is apparently favoured by Management who are keen to show the best possible SLA statistics - so they grasp any possible excuse to close an Incident, even if the service disruption has not yet ended. "Why would you want to keep 2 tickets open?", they might reason. "The Incident has been evaluated and the cause is understood - Incident Management can take a back seat and let Problem Management deal with it." Clearly this is for expediency, not logical thinking. Or maybe it's just that these people work in a world where Incidents and Problems cannot coexist, e.g. because an inadequate ServiceDesk tool uses a single record type for both things, treating 'Problem' as a status in the lifecycle of a ticket which may start off life at an 'Incident' status.

Juan presents an alternative view, that

(B) Incident ALWAYS ends only when Problem ends.

Juan's reasoning seems to be as follows: ITIL documents that Problem closure can include closing any Incidents that were caused by the Problem. Therefore the Incidents must still be open.

Maybe I could agree with this, depending on what you mean by 'open'. The Incident, i.e. the service degradation, may have ended (perhaps by itself, or via the use of a workaround) but the Incident Record may still be held open, so that it can still be worked on in some way - e.g. preventative measures are being worked on. But I don't agree with that thinking. The Incident is over when the Incident is over. Abstracting the Incident record from the Incident itself for administrative purposes is just confusing. We can avoid such confusion by managing the resultant work (e.g. preventative measures) via Problems and/or RFCs.

What I'm saying is that Incidents' and Problems' respective closure dates are not consistently related. I'm saying sometimes it will be A and sometimes B, and sometimes something else, depending on whether and when a workaround brings the end user Service back to normal. Sometimes, even, the Incident will stay open long after the root cause Problem has been corrected, because an RFC is required to rectify some of the damage that had been caused.

Incidents and Problems are different animals with lives of their own, coexisting also with Events, Workarounds and RFCs, and they will often be related, but they won't always start and end in any particular order.

Monday, October 26, 2009

Capacity concern - Problem or RFC?

This started out as a tweet:

ITIL question: if you have a Capacity concern on a system where you have a projected bottleneck, do you raise it as a Problem or an RFC?


So, an Incident is defined by ITIL v3 as a variance from normal service, or the failure of a Configuration Item that reduces Availability, and a Problem is the cause of one or more incidents. But what happens if you detect a Capacity bottleneck that has not yet caused an Incident, but which you anticipate in future would cause one if usage grows as projected and nothing is done about it?

Surely it cannot strictly be a Problem, because an Incident has not yet occurred. Furthermore, in some situations, calling it a Problem might have some financial implications, implying that the service was not designed or implemented correctly; that would be unfair if the service was designed as optimal for a certain user load, and it so just happens that user load has increased since that time. If calling it a Problem is going to lead to financial penalties, wouldn't this end up incentivising people to bury their heads in the sand? That wouldn't be a good thing.

But from a technical standpoint, treating it as a Problem is attractive. It may well be treated in a similar way to a Problem: the same technicians may work on it, the same processes of design and transition may be employed to address the bottleneck. This is a point made well by Stephen Mann when we were discussing this on Twitter. Do we call it a potential future problem? James Finister described this as Proactive Problem Management.

Or should we treat it as an RFC? Stuart Knipe raised some good points regarding the differences between RFC's and Problems. Problems seem to imply starting from some root cause analysis, whereas RFC's seem to start from business cases and requirements specification. But would it be right to be more business-like with a bottleneck that has not caused an Incident yet, than one that has?

Ruth Arnold points us in the right direction, albeit one that may not be conclusive, when she points at the Capacity Management Process. It depends on your local policy. Capacity Management produces many outputs, including metrics and monitoring, plans and projections, but when it also produces a to-do list of capacity concerns, how do you treat them: as RFCs or as future Problems?

As far as I'm aware ITIL does not specify how these concerns should be managed, so I guess it's up to you.

Please let me know...

Why ITIL Agony?

I've been meaning to start this blog for a while. From time to time I have a burning question about ITIL and I scour the web trying to find a definitive answer, but can't find one. Makes me think it would be good if there was a blog that could address these questions. A blog where people could pose their own quandaries and find a community of people who could help with their opinions. A bit like an agony aunt, but for ITIL. So here it starts: ITIL Agony.