Incident Escalation – The Devil in the Details

At a recent service management event I was speaking with a manager who was looking to “improve the way that incidents were handled, in both time reductions and amount”. She said that recently they were reviewing the incident records for a self-audit and felt that they didn’t have enough information in them to make lasting improvements.

The tool she is using has a template within it for incidents, which has a standard set of questions. The trouble, as she pointed out, were that many of the questions were going unanswered. A review was conducted of the questions which resulted in the removal of some of them as they appeared to provide little value. Now they only have a few questions which are regularly asked along with additional information which is gathered by the service desk depending on the escalation. To me the questions which are not asked may have more information in them than you might think.

I explained that I have also had a similar experience however I found the questions that were unanswered could provide as much information as the tangible data which was captured.

Here are some practical examples complete with responses and translations:

When did the issue begin?

Response: “I don’t know”

Translation: We may not know when the issue began but we should have an idea when was it working as it was meant to. From here we can fine tune when the issue began even if we are unable to tell exactly when that was by any quantifiable way. For example, this could be the result of a change gone wrong.

How many people are impacted?

Response: “I don’t know, it’s only me, I am the only one on site”

Translation: depending on the business operations you may have very few people on sites where gathering information from a selection of people could be limited. Since this person represents the entire office you may want to investigate what else they are unable to do in an effort to understand any impact beyond what is reported. For example they may not be able to access a particular application but after further questioning you may identify that they can access the network, email, phones. This will help to identify the scope of the issue occurring at their site which might not be immediately clear otherwise.

What steps were taken to reach this error?

Response: “I just can’t perform activity X”

Translation: typically when people escalate some issue they indicate what they are not able to do. In some cases we need to know all the steps leading up to the particular activity which is not working to diagnose what the underlying cause is. You are likely not that familiar with all the workflows of your business. Get the caller to take you through step by step what is working and then what does not. Get a screen capture whenever possible.

In some cases the issue may have nothing to do with infrastructure or applications. It could boil down to communication or training.

Communication

We have all heard that a large majority of incidents are generated by changes but what about the escalations that are coming in as a result of a change which is underway. For example someone calls in and they are unable to access application X. in the initial check you identify that they can’t access the application because there is a scheduled outage underway. Be careful how you communicate this to them. We don’t want to belittle them, but we want to understand why they were not aware. It could be that they simply did not see any communication OR that they did not get the communication in the first place. This information should be shared with the manager of the service desk as well as any pertinent service management teams such as change management.

Training

Applications tend to change in both functionality and appearance over time. Because of this there are situations which are escalated that are not actual issues, rather as a result in the person not having the appropriate training for the new functionality. Again we want to capture this information so that we can build a knowledge article in the event that anyone else has a similar challenge. We should also communicate this information back through the team responsible for the update as well as change and release management.

In the end we want to position IT to be able to respond to escalations quickly but to also make lasting improvements through understanding all of the moving parts of the issues which we are working to reduce for our customers.