Thursday, March 26, 2020

Irrational ESXi host disconnecting problem

Respect to the List of VMware Hardware Compatibility Guide (VMware HCL) I want to tell about one of my strange experience related to this pre-requirement checking operation.
Some months ago we had a big issue in virtual infrastructure at one of our projects that caused many irregular problems. Basically situation seems to be normal, but web client console suddenly shows all hosts (inside and outside of the cluster) are disconnected from the vCenter server, while they had been really working correctly. When I checked them, all of the following situations were OK:
  • Health status of all hosts & VMs are normal.
  • VPXA and Hostd work without problem and ESXi host is reachable on the network.
  • All distributed switches (VDS) and dvPortGroup seem to be healthy and all virtual machines are connected.
  • VCSA management interface (VAMI) shows the vCenter components are healthy.
After two days of checking and investigation of all logs related to the hosts and vCenter server, sadly we couldn't find the cause of this problem. So we decided to restart them one by one. But suddenly after restarting the vCenter server, we encountered with another unknown problem: vCenter's machine is gone. 
Because the hosts have been added in Lockdown Mode (Normal type) I forced to try registering that VM with CLI in shell access via using the following vim-cmd

vim-cmd  solo/registervm  /vmfs/volumes/SAN1/VCSA/VCSA.vmx

But after re-add and power-on the VCSA, nothing fixed. Sadly dvportgroup were not in Ephemeral Port Binding type and we couldn't connect the vCenter vNIC to the VDS.
In the next step of troubleshooting, I thought it's necessary to turn back to the first point and check everything with more details about the hosts from the beginning. After reviewing them in VMware HCL, I saw the physical hosts were not compatible with the ESXi version. (Servers are HP Proliant DL380 G8 and ESXi version is 6.7U3). So the only available solution left for me is downgrade the host to an ealier version. At last happily ever after doing this,  normal situation returned back and everything worked correctly.
But the only problem that has been left is related to the VDS version. After finding the downgrade is the only solution left for us, Unfortunately saw that ESXi host couldn't attach to the VDS version 6.6, because this version is incompatible with ESXi 6.5 u3. So we forced to deploy the VDS structure again.
As a general conclusion,  for avoiding abruptions or problems like this one that I mentioned before, always remember to check the VMware hardware compatiblity list before you choose the ESXi version as the suitable hypervsior for your physical server. It's not a recommended option, because it's necessary to observe. So do not ignore the vmware HCL information.

No comments:

Post a Comment

I will start a new journey soon ...