Saturday, February 29, 2020

vExpert 2020


It came and unfortunately went faster than I believe :/  Just because of my country: "Iran"
I hope ... and I will keep it to remind me and other Iranian IT Men and Women like me in whole of the world, "One Day will come and everything will be Ok on that day ... No need to shame, no need to be quiet"


 



Wednesday, February 19, 2020

What is the vSphere Fault Domain Manager (FDM) agent

FDM agent is a part of vSphere HA to monitor availability of the ESXi host and its VMs and also power operations of that protected VMs in front of failures. After release of vSphere 5.0 one of the greatest changes is happened inside the VMware clustering architecture: Operation of vSphere HA! There is no more AAM agent (Automated Availability Manager) and it was replaced by the FDM agent (Fault Domain Manager), so basically there is some sort of differences, include of:
  1. Reduce the time duration of cluster configuration.
  2. Introduce the datastore heartbeating to prevent restarting of VMs in case of host isolation. So the HA mechanism does not decide based on only the network situation, now there must be at least two shared datastore to decide "Is really a some sort of failure happened?"
  3. No more dependency to the VPXA.
  4. To setup an HA cluster, FDM agent is not rely on DNS to work, like the old time with AAM agent
  5. There is no more primary/secondary nodes and is changed to the master/slave relationship with an automated election system to choose the master node.
FDM agent on ESXi host is responsible for commmunication with the vCenter server (in each of master or slave nodes). However you should understand one important point about the HA operation: Even if the vCenter is down, HA agent on ESXi hosts will work in response to any host or VM failures, but without the vCenter server you cannot re-configure and re-struct the cluster settings (and also HA as a part of it).  
FDM also supports the Jumbo frames for communication in MTU size larger than 1500 bytes, but be careful when you device to change this value. Because the MTU size must be compatible in every devices in path of the HA failover operation from host to host.
Also sometimes enabling the Lockdown mode in the ESXi can cause to FDM operation interruption. So for the FDM troubleshooting procedure, addition to the reconfigure HA, you may need to disable the Lockdown mode temporary.

Related files to the FDM
  • First of all, there is a fdm.log file for recording every events related to the FDM agent operations in the /var/log directory.
  • Also fdm-profiler-1.log in the /var/log/vmware/fdm shows the FDM information about the version, build and service PID.
  • All of the configuration files include the FDM agent config, cluster settings and list of member nodes in the cluster, is in the following path (right-side image): /etc/opt/vmware/fdm
At last to check the status of FDM VIB package, run the following ESXCLI command: 
esxcli software vib list | grep vmware-fdm


Wednesday, February 12, 2020

vSphere Storage Troubleshooting - Part 1: HBA & Connectivity

 Storage infrastructure is one of the main part of IT environment, so good design and principled configuration will cause better and easier troubleshooting of each possible issue related to this area. One of the primary components of storage infrastructure is HBA, connector of servers to the storage area. Then we can consider some of the greatly possible storage-related problems back to the Host Bus Adapter installed in the ESXi host and also its physical connections to the SAN storage or SAN switche. So let's begin how to investigate step by step storage troubleshooting inside the VMware infrastructure.

 First situation may be occured for local array of disks that are not detected as a local datastore. You can check status of internal disk controller (for exmaple in a HP Proliant server) via running the following command:
  
cat /proc/driver/hpsa/hpsa0

The result will be shown like this: 
(please be careful when I used the hba word and when the capital form)


 But if the considered datastore is not local and is a shared volume of existing SAN storage in our infrastructure, then we must check the HBA status:

./usr/lib/vmware/vmkmgmt_keyval -a | less
 The last mentioned command has been used in the ESXi version 5.5 and higher, so for older versions you must check the following folder for both of HBA most popular vendors:
  •    Qlogic:   /proc/scsi/qla2xxxx
  •    Emulex: /proc/scsi/lpfc




 

 Also if you don't find the related vmhba adapter in result of the following command, it means the ESXi host did not detect your HBA yet
  • vmkchdev -l | grep hba
  • esxcfg-info | grep HBA

 
 Also you can run the swfw.sh command and combine it with grep to find related information of the connected HBA devices to the ESXi, include: device model, driver, firmware and also WWNN for FC-HBA (InstanceID value)

 ./usr/lib/vmware/vm-support/bin/swfw.sh | grep HBA

 
 In another situation imagine you have deploy a new SAN storage inside a vSphere cluster, but you are not ensure that HBA could detect the provided LUN or not. As the first step run the below ESXCLI:
esxcli storage core device list

 For the shown result, please check important fields, like these ones: Display Name, Device Type, Devfs path, Vendor & Model. 

 And next you can run the following command, then it will give you back more information about the HBA adapters and state of each one of them:

esxcli storage core adapter list

 VMware Definition Tip1: NAA (Network Addressing Authority) or EUI (Extended Unique Identifier)  is the preferred method of identifying LUNs and the number that follows is generated by the storage device itself. Since the NAA or EUI is unique to the LUN, if the LUN is presented the same way across all ESXi hosts, the NAA or EUI identifier remains the same.

 Also this command will show you list of available and detected partition by the ESXi host: 
esxcli storage core device partition list


 VMware Definition Tip2: You can see two types of fb & fc. fb is the system ID for VMFS and fc is the vmkernel core dump partition (vmkcore)






 There is more useful storage command, like the oldman CLI esxcfg-scsidevs. (-a show HBA devices, -m for mapped VMFS volumes and -l list all known logical devices)


 So finally as the conclusion of first part of troubleshooting the problems related to the storage side of vSphere environment, we understood that we need to check the status of HBA, how they are performing and connected disk devices, LUNs & volumes via each one of them. I hope it can be helpful for you all ;)

I will start a new journey soon ...