For years, vSphere has had alarms to let you know when things are above or below thresholds you specify. This is a great first step in identifying items that may require your attention and/or further investigation. The problem is that these thresholds are static; you set a value and are notified if it is above that value, such as CPU utilization > 75%. While useful, it can lead to many false alarms if you have a virtual machine (VM) that routinely exceeds that value or one that spikes to that value for a while during a batch-processing interval. In those cases, that level of CPU utilization is expected and normal, and looking at it again just leads to wasted time and soon to ignoring alarms as probable false positives. In addition, it fails to alert you if utilization is below normal, such as a service failure that causes processing to stop. Knowing when you have a “real” issue is the key. The problem is when there really is an issue you may ignore it, thinking it is not a real issue. There have been case studies done of companies that have worked on setting a threshold for a year to find what the “right” values are for VMs. That is a huge waste of time and money, and it does not work well in a dynamic environment.
Another issue for many deployments is that the VMs may have wasted resources – VMs that are over-provisioned, powered off, or even removed from the inventory but still on disk. This wastes resources and drives up the Total Cost of Ownership (TCO). The question is how to identify those resources and get them back. Conversely, other machines may not have the required resources to run well – which ones are they and what do they need? How do I know which VMs consume a lot of resources and impact the performance of other VMs?
All of these, and many other questions, can be answered by carefully studying performance graphs in the vSphere Client (or the new Web Client) and monitoring alarms. The problem is these tasks are time-intensive, and administrator time is both expensive and at a premium. Enter vCenter Operations Manager (vCOPS). Let the computer do what it does best: monitoring and alerting, figuring out what is “normal,” and then notifying administrators when things are abnormal.
What Is It?
vCOPS is a tool from VMware that is designed to analyze your environment, figure out what is “normal,” and alert you when abnormalities occur. These abnormalities can be at the VM, host, cluster, or data store levels. vCOPS is designed to help you find both undersized and oversized VMs as well as wasted resources. It can help you spot issues early. It provides root-cause analysis for issues detected. If you also have vCenter Configuration Manager (vCM – part of the vCOPS Management Suite), it can correlate events that occurred in the environment with results in the VM, host, etc. The tool will gather data over time and dynamically set thresholds and report back when they are exceeded. vCOPS is designed to alert you when things are not normal and not bother you about little events that are normal (and probably common) in your environment.
The vCOPS Management Suite includes several other tools that are also useful in analyzing and diagnosing issues and relationships in your environment.
Why Do I Need It?
Most environments with more than a few servers and a few dozen VMs need vCOPS (or a similar tool). There are just too many things going on and too few administrators to watch all that is happening to effectively manage issues. Also, there are too many false alarms raised for things that may be normal in an environment, such as a server using a lot of CPU doing batch processing overnight. This could be fixed by adjusting alarm values for those VMs, but that requires a lot of data-gathering and analysis to figure out what is “normal” for each VM and then implementing of those custom alarms on all affected VMs, leading to management by exception. The more data-gathering and analysis that take place in an environment, the more complex and costly it is to manage.
vCOPS does what computers do best: gathers data and analyzes it, alerting you to abnormal conditions. You can then fight the real fires by doing what people do best. The costs of not managing an environment well are unplanned outages, capacity limitations that lead to performance degradation, and, if left unchecked, possible outages, lack of resources when a failover event occurs, or no disk space, freezing all the VMs on any datastore in such a condition. The costs are just too high. For those who don’t monitor and then run into the problems above, days of downtime will quickly convince them of the need for some sort of monitoring.
As of vSphere 5.1, the foundation edition of vCOPS is included in the package for free, so there is really very little reason to not have at least a basic level of monitoring in place.
According to VMware, the following are some of the benefits of deploying vCOPS.
- A 36% reduction in downtime at the application level
- A 26% reduction in the time it takes to diagnose and resolve a problem
- A 40% increase in capacity utilization
- A 37% increase in consolidation ratios (the number of VMs hosted per ESXi server)
There have been case studies where organizations wanted to reduce the false alarms to as close to zero as possible, and it often takes many months (sometimes as long as a year) of data-gathering, weekly meetings to discuss the data, incremental changes in alarms, and analysis of the results to achieve the desired results. The cost of doing so is much higher than implementing vCOPS, and vCOPS brings many other capabilities to the table.
VMware vCenter Operations Manager: Analyze and Predict [V5.0]