Which factors should we consider as the monitoring metrics?
Choosing between monitoring solutions is one of the important decisions that each IT manager should decide. But before selecting the software/platform, it’s vital to investigate and collect the critical components of infrastructure that we need to monitor. With the growth of virtualization and cloud computing technologies, we have to attend to many aspects, especially physical resources and virtualization components. Analyzing detailed metrics on each of these parts will give us many benefits, like protecting virtual machines against failures and increasing the rate of availability in the virtualization infrastructure. Many factors we need to consider, like the following:
Physical components:
- Computing resources consumption: Physical processor and memory
- Cluster available capacity: cluster’s resource while the host failures
- Availability of hosts: ESXi heartbeat issues, failover network infrastructure, and VMkernel settings
- Storage usage during work hours and backup operations: Datastore IOPS and rate of space usage
- Over-allocation & under-allocation of each physical resources: CPU, RAM, NIC, Disk
- Memory ballooning and dedicated datastores for swap files
- Extra VM log files and their unexpected storage usage
- Not-installed VMware Tools or old version of them that are installed on the virtual machines
- Old-remained snapshots and many parent-child VMDK files
- Inactive or unused virtual machines
- Not-used mounted ISO files and old connected physical media
- Orphaned VM files, especially VMDK files
- Rate of VM’s memory swapping and overall memory performance
Although some of these issues are very easy to resolve, they’re required a real-time 24*7 monitoring system also dedicated response teams for proper reactions against possible or even unexpected problems. Regardless of chosen monitoring solutions in your infrastructure, it’s more important to have some well-done plans for counterattacks against forecasted challenges, availability issues, and every detected incident that causes many risks against our infrastructure or data center.
No comments:
Post a Comment