A major challenge of problem determination is dealing with unanticipated problems. It is much like detective work: finding clues, making educated guesses, verifying suspicions, and other considerations. An ideal strategy for problem prevention is to monitor the system regularly. Use the strategies outlined in this paper to minimize downtime and detective work so you can maximize performance.
A troubleshooter's job doesn't start after a problem occurs, preparations should be made well in advance of a problem. Troubleshooters often see a problem (for example, the system crashes) and start conducting many long and complex analyses that can take days or even weeks. However, prudent system administrators or troubleshooters start planning long before a problem occurs. In other words, they prepare the environment so that troubleshooting can be done more quickly and effectively if and when problems occur. The following key strategies will help prevent problems from occurring with IBM WebSphere Business Process Manager V8.5.
Three Strategies for Problem Prevention
#1: Tune and Monitor the Environment Regularly
Monitoring tools and a plan are needed to effectively detect problems or anomalies when they emerge. Monitoring is a trade-off. You want to detect important events, and yet not adversely impact the normal operation of the system. Monitoring is an entire technical area in itself, different from problem determination. This paper covers only a few points on this topic.
Passive monitoring can be done at all levels: network, operating system, application server, and application. The main system log files of dependent systems such as databases and LDAP directories can also be monitored for errors and events. For example, you might detect application server restarts that indicate the server is failing. Some tools for passive monitoring include the Tivoli Performance Viewer and IBM Tivoli Composite Application Manager for Application Diagnostics.
Active monitoring goes beyond passive monitoring-you periodically test the operation of the entire system from end to end. One technique is to ping system components, such as one server or one database connection. Another technique is end-to-end pinging: you periodically send an entire "dummy" transaction through the system and verify that it completes. Some tools for active monitoring include IBM Tivoli Composite Application Manager for Transactions and web-based, load-generating programs like Rational Performance Tester. Make sure that you:
- Monitor the system at all levels, from the network to applications and dependent systems
- Monitor the main system log files for errors and events such as detecting application server restarts
- Use monitors and alerts on the following key system metrics: memory usage, default PMI statistics and performance advisors.
Examples of ongoing system "health" monitoring include:
- Significant errors in the logs that the various components emit.
- Metrics that each component produces should remain within acceptable norms. For example, operating system processor and memory statistics, IBM Business Process Manager Performance metrics, and transaction rates through the application.
- Spontaneous appearance of special artifacts that get generated only when a problem occurs, such as Java dumps or heap dumps.
- Periodically send a "ping" through various system components or the application, and verify that it continues to respond as expected.
Be prepared to actively generate more diagnostic tests when a problem occurs. In addition to dealing with diagnostic artifacts that are present when an incident occurs, your troubleshooting plan should consider any additional explicit actions to take as soon as an incident is detected. You want these actions to take place before the data disappears or the system is restarted. Here are some examples of explicit actions to generate more diagnostics:
- Actively trigger various system dumps, if they are not generated automatically (such as Java dump, heap dump, system dump, or other dumps that various products and applications might provide). For example, when a system is believed to be "hung," it is common practice to collect three consecutive Java dumps for each potentially affected JVM process.
- Take a snapshot of key operating system metrics, such as process states, sizes, or processor usage. - Enable and collect information from the IBM Business Process Manager Performance Monitoring Infrastructure instrumentation.
- Dynamically enable a specific trace, and collect that trace for a specified interval while the system is in the current "unhealthy" state.
- Actively test or "ping" various aspects of the system to see how their behavior changes compared to normal conditions. This activity is done to try to isolate the source of the problem in a multicomponent system.