In ancient societies, many rituals existed, some targeted especially at recognizing the passage of an individual from youth to a recognized member of the adult community. These ceremonies or events are often referred to as a “rite of passage” and indicate an important milestone in the life of that person as well as the group of which they are a part. In most western cultures, this is no longer directly relevant, but certain experiences certainly play a similar role. For almost every network engineer, the “rite of passage” is a network outage or problem that was unpleasantly memorable and particularly difficult, and told in stories for years to come. The important concept here is to realize that while these things can and do happen, they should remain infrequent events rather than frequent occurrences. This is essentially what network troubleshooting and problem resolution is all about. Here we will examine five steps for addressing issues, and provide some tools for dealing with issues when they arise.
Step 1: Prevent as Much as Possible
Issues and problems with networks of any shape and size are fundamentally inevitable, almost entirely due to the nature of the human condition, namely, imperfection. On the one hand, the fact that imperfect people created the networking technology used in the world today guarantees that flaws and imperfections will exist in that technology.
The key is to prevent as many issues as possible before they even arise, through activities such as proactive maintenance, device monitoring, and so forth. The best solution to a problem is to keep it from occurring in the first place. This helps minimize true problems that are not possible to foresee and builds confidence on the part of the end-users that matters are well in hand.
Step 2: Diagnose the Issue Effectively
The investigation and diagnosis phase is the most critical part of the process, as it sets the stage for rallying resources and helps to narrow the scope of the actual issue. Without sounding disingenuous, understand that end-users will probably not understand networking technology at even a fundamental level, and that the complaint may not even have a technical foundation. For example, the user may report that they are experiencing network slowness and may even be impatient, but when you question further, you may discover that they are downloading large files or streaming videos from the Internet. In reality, that individual may have felt that the network was the issue when it was a problem of their own making. The skill required when interacting with end-users is to ask the right questions to get the information without creating offense.
Step 3: Identify the Root Cause(s)
Your task is to take a collection of reported symptoms, understand the potential causes (sometimes more than one), and discover the root problem. While this sounds straightforward enough, this certainly can present challenges, particularly in a complex network. Having other competent and experienced peers to rely on (whether on staff with you or in other organizations) can help close the gap on understanding the core issue. The acronym RCA (Root Cause Analysis) is important to know in this regard. The RCA may be performed after a particular issue is resolved, in order to prevent it from recurring.
Step 4: Apply the Most Effective Solution
While this might appear to be remarkably self-evident, in actual practice, the pressures of time, urgency, and management may encourage a “quick fix” rather than an actual solution. In many networks, temporary fixes for specific issues can be implemented, and if undocumented, can exist for years with many dependencies created as a result. With the turnover of network staff, which happens often, the knowledge regarding these phantom fixes is lost, and if removed as part of a cleanup process, may create immediate problems with no documentation to guide the staff in addressing the new issues.
Step 5: Document the Issue and Resolution
As mentioned before, staff turnover is an inevitable part of the networking staff at a given organization. When this happens, information is often lost, which underscores the importance of a documentation repository. As a network professional, I have often been amazed at the lack of accurate network information (diagrams, addressing schemes, configurations, etc.) and documentation in place. Preserving critical information ensures that any future staff will have an easier time dealing with the infrastructure that they inherit, and potentially can help address liability and compliance issues should they arise.
Reproduced from Global Knowledge White Paper: Troubleshooting Cisco Switches