Once again I write a new post to emphasize the importance of vCenter server availability as the major critical component of all virtualized environments based on VMware products. Recently we had a strange situation that all ESXi hosts from different datacenters (sites) had been disconnected from the vCenter server suddenly and the reconnect operation was pending at 0% in the Tasks section. When we wanted to explore the sub-objects inside of a datacenter-level, the operation was delayed, and removing and re-adding is not possible for these ones. As the result of investigation activities around this catastrophic issue, we check all the following related matters:
- DNS configuration on both sides: VCSA and some disconnected ESXi hosts, everything seemed to work correctly.
- Resetting Management Agents (VPXA & Hostd) has no success result
and the problem still exists. Also, all related services in the VCSA
like VPXD could be restarted successfully.
- VCSA guest OS partitions, especially /storage/archive directory had
enough free space. (In other posts, I described how to resolve the
VCSA service interruption because of low disk space on two posts: Part1 and Part2)
- There was no reason to suspect the recent infrastructure changes like the modification of firewall rules, while all ESXi hosts from all clusters on each datacenter were involved in this problem. So the rollback operation to the latest network device configuration brought no success for us. Also, running the TCPDUMP command on both sides and watching the result, give us enough evidence this issue is not related to the network configuration.
- Even restoring a backup to a normal state couldn’t resolve the mentioned issue, because after booting the vCenter Server, are hosts were still disconnected. Even by trying actions mentioned in matter 2 again, never reach us to the normal situation.
Then in the continuous troubleshooting operations, I decided to deep dive into the vCenter log files, especially checking the log files inside this directory: /storage/log/vmware/vpxd.
There were just some info and warning messages, for example in every resync between the VCSA and all the hosts belonging to a datacenter, it prompted some information like this for each ESXi: “info vpxd [id] [originator … HeartbeatModuleStart - …] Certificate not available, starting hostsync for host: host-id”
However, we checked all used certificates and find nofthing related. So I decided to back to my zero states: Bring up an older restore point with a better situation, and do whatever I did once again, with one more important operation: Ignore all current ESXi host time settings (even NTP) and synchronized the “hardware clock/system time” for a selected host exactly with the currently configured time of the vCenter Server. Then restart VPXA/Hostd agents again, and after some moments I saw the ESXi object react in the vSphere web client. At that moment, we could run the “connect the host” action completely, because it didn’t pending and even pop-up the warning, the vpxuser account is not correct: cannot complete login due to an incorrect username or password”. Finally running the connect wizard could easily complete and the ESXi host stayed in its normal state.
So what was the root cause of this disaster? I couldn’t find it yet. However, I still suspect the MAC duplication/mismatch issue between the vCenter and ESXi host. but there are some essential tips that we should keep in mind always:
- vCenter is the core component of vSphere management for VDS, Cluster, VSAN, and Template objects, and also is the primary connection point to other solutions like NSX, Horizon View, and vRealize or even 3rd-party solutions like virtual machine Backup & Replication software. Although they are not exactly dependent on the vCenter server for all operations. But any interruption of vCenter can lead them to lose some critical sections of their own management actions. For example, all deployed desktops inside the Horizon View environment are always through their connected Horizon Agent, but for generating new desktops, the Connection Server requires to call the vCenter to generate them via its Template/VM based on desktop pool type.
- Having a backup system and a scheduled job for vCenter Server protection is not enough for the safety of this primary component of virtualization. Even the vCenter HA setup cannot guarantee all aspects of availability (Like in our case, VCHA solution couldn't help us). So we need to always monitor the whole system including checking randomly possible warnings inside the VAMI interface, checking the detail log files inside the shell, configuring Syslog, and also inspecting them always.
- Some important configurations are easy, but ignoring them is easier! Like DNS, NTP, Syslog, and so on. Never postpone their configuration to another time because each one can lead our infrastructure to a sudden interruption. Although some other settings like SNMP are a little complex in comparison to the mentioned parts, we can always use the benefits of Automation. Creating scripts including PowerCLI cmdlets is not easy for all administrators, while it's enough to make them just once and use them forever. If it’s possible based on provided vSphere license, you can use features like Host Profile to configure mentioned settings for all managed ESXi hosts too.
- vCenter Server restore points (via VAMI or 3rd-Party solutions) must be defined based on vSphere Infrastructure changing intervals, so we need a reliable Change Management procedure to correspond the backup system to any type of modifications. Changes like removing a host from a Cluster, adding a host to a Distributed vSwitch, Changing Permissions and credentials, and so on.
No comments:
Post a Comment