Tuesday, October 27, 2009

Close Incident when Problem raised?

What's the relationship between Incident closure and Problem creation? One school of thought, reported by Juan Jimenez is that

(A) Incident ALWAYS ends when Problem raised.

This is apparently favoured by Management who are keen to show the best possible SLA statistics - so they grasp any possible excuse to close an Incident, even if the service disruption has not yet ended. "Why would you want to keep 2 tickets open?", they might reason. "The Incident has been evaluated and the cause is understood - Incident Management can take a back seat and let Problem Management deal with it." Clearly this is for expediency, not logical thinking. Or maybe it's just that these people work in a world where Incidents and Problems cannot coexist, e.g. because an inadequate ServiceDesk tool uses a single record type for both things, treating 'Problem' as a status in the lifecycle of a ticket which may start off life at an 'Incident' status.

Juan presents an alternative view, that

(B) Incident ALWAYS ends only when Problem ends.

Juan's reasoning seems to be as follows: ITIL documents that Problem closure can include closing any Incidents that were caused by the Problem. Therefore the Incidents must still be open.

Maybe I could agree with this, depending on what you mean by 'open'. The Incident, i.e. the service degradation, may have ended (perhaps by itself, or via the use of a workaround) but the Incident Record may still be held open, so that it can still be worked on in some way - e.g. preventative measures are being worked on. But I don't agree with that thinking. The Incident is over when the Incident is over. Abstracting the Incident record from the Incident itself for administrative purposes is just confusing. We can avoid such confusion by managing the resultant work (e.g. preventative measures) via Problems and/or RFCs.

What I'm saying is that Incidents' and Problems' respective closure dates are not consistently related. I'm saying sometimes it will be A and sometimes B, and sometimes something else, depending on whether and when a workaround brings the end user Service back to normal. Sometimes, even, the Incident will stay open long after the root cause Problem has been corrected, because an RFC is required to rectify some of the damage that had been caused.

Incidents and Problems are different animals with lives of their own, coexisting also with Events, Workarounds and RFCs, and they will often be related, but they won't always start and end in any particular order.

Monday, October 26, 2009

Capacity concern - Problem or RFC?

This started out as a tweet:

ITIL question: if you have a Capacity concern on a system where you have a projected bottleneck, do you raise it as a Problem or an RFC?


So, an Incident is defined by ITIL v3 as a variance from normal service, or the failure of a Configuration Item that reduces Availability, and a Problem is the cause of one or more incidents. But what happens if you detect a Capacity bottleneck that has not yet caused an Incident, but which you anticipate in future would cause one if usage grows as projected and nothing is done about it?

Surely it cannot strictly be a Problem, because an Incident has not yet occurred. Furthermore, in some situations, calling it a Problem might have some financial implications, implying that the service was not designed or implemented correctly; that would be unfair if the service was designed as optimal for a certain user load, and it so just happens that user load has increased since that time. If calling it a Problem is going to lead to financial penalties, wouldn't this end up incentivising people to bury their heads in the sand? That wouldn't be a good thing.

But from a technical standpoint, treating it as a Problem is attractive. It may well be treated in a similar way to a Problem: the same technicians may work on it, the same processes of design and transition may be employed to address the bottleneck. This is a point made well by Stephen Mann when we were discussing this on Twitter. Do we call it a potential future problem? James Finister described this as Proactive Problem Management.

Or should we treat it as an RFC? Stuart Knipe raised some good points regarding the differences between RFC's and Problems. Problems seem to imply starting from some root cause analysis, whereas RFC's seem to start from business cases and requirements specification. But would it be right to be more business-like with a bottleneck that has not caused an Incident yet, than one that has?

Ruth Arnold points us in the right direction, albeit one that may not be conclusive, when she points at the Capacity Management Process. It depends on your local policy. Capacity Management produces many outputs, including metrics and monitoring, plans and projections, but when it also produces a to-do list of capacity concerns, how do you treat them: as RFCs or as future Problems?

As far as I'm aware ITIL does not specify how these concerns should be managed, so I guess it's up to you.

Please let me know...

Why ITIL Agony?

I've been meaning to start this blog for a while. From time to time I have a burning question about ITIL and I scour the web trying to find a definitive answer, but can't find one. Makes me think it would be good if there was a blog that could address these questions. A blog where people could pose their own quandaries and find a community of people who could help with their opinions. A bit like an agony aunt, but for ITIL. So here it starts: ITIL Agony.