In my last post, I discussed aspects of problem management in the context of a real-life situation regarding the first vehicle I owned. In that scenario, and throughout this series of posts, I’ve demonstrated a real-life situation from a standpoint of the incident and problem management processes that ITIL describes.
As I mentioned, for the one year that I owned that car, I periodically experienced interruptions in service. In other words, the vehicle stopped running. ITIL calls such an interruption in service an incident. As we learned in a prior post, I owned this car for one year, and if the interruption happened every two weeks, then I experienced twenty-six unique incidents.
In the previous post, I also described proactive and reactive aspects of problem management as it related to this situation. In this last post in this series I will discuss workarounds.
Workarounds
ITIL defines a workaround as a temporary fix. Workarounds temporarily restore service. ITIL doesn’t specify how long “temporary” is, just that a workaround does not correct the root cause of a problem. “Temporary” could be any time frame from one second to fifteen years and beyond.
In the case of my car, when it stopped running I applied a workaround to get it running again quickly. The workaround was simple; I would clear the blockage from the fuel filter and restart the vehicle. This temporary fix would work for almost precisely two weeks, before enough sediment from the fuel tank would work its way back into the fuel filter.
Workarounds often carry risk. Consider the risk that I was taking when I stopped on the side of the road, raised the hood on my car, disconnected the fuel filter, cleared the blockage, and then replaced the fuel filter. Applying this workaround represented a significant risk. I wouldn’t do it now, as these days I understand exactly how dangerous it is to work on a car on the side of the road while other cars are passing by. Not to mention the risk associated with sticking a fuel filter that contained fuel, rust and dirt in my mouth. These are all things I wouldn’t do now, because the risk is too high.
IT organizations often do similar things. It’s not uncommon for an IT organization to become accustomed to applying a workaround regularly and never investigating the root cause. One way that this is commonly seen is in a “server reboot”. Usually there is some process or application on the server that has a memory leak, and the organization decided that rather than pursue the root cause of the memory leak, it makes more sense to reboot the server periodically. This is fine, especially when there is a question of cost-effectiveness, but the organization takes significant risk because the reboot does not address the root cause. In other words, it’s very possible that applying the temporary fix in this situation could result in the server rebooting right back into the error condition. Not only that, but IT organizations have a tendency to automate processes like server reboots and process restarts through cron jobs or the equivalent. This is a really bad idea for many reasons, one of which is that eventually the person who automated this workaround will leave the company, and usually these things are poorly documented and ultimately break.
Sometimes it makes sense to keep doing a workaround. In the case of my car, it didn’t. I was simply avoiding fixing the root cause of the problem. Fixing the root cause would have only been a couple of hundred dollars, and it would have ultimately saved me time and money and eliminated significant risk. There were many potential corrections to the root cause, in fact, I used to tell my father that the “correction” was to buy me a new car. I’m still waiting for that “correction”. It could even be said that when I ultimately sold the car that I corrected the root cause, at least from my perspective. I never heard from the people who bought the car how they felt about this.
One last note, sometimes workarounds can be done preventatively. For example, with my car, when I had a date, I would usually apply the workaround before I would pick up my date for the evening. In my teenage mind it made sense to me that I didn’t want the car to stop running in the middle of a date. Thinking back, it might have made more sense to let the car fail in these situations…
Summary
In this series of posts I’ve discussed the differences between incidents and problems. Incidents are interruptions to service, whereas problems are the unknown causes of one or more incidents. I also discussed various aspects of the incident and problem management processes. This discussion was in the context of a real-life situation. Using real-life situations to understand the various best practices that ITIL presents is an excellent way to understand the ITIL best practices.