Undercity of Virtualization: tcpdump

Showing posts with label tcpdump. Show all posts

Thursday, June 23, 2022

vCenter Server suddenly lose all Hosts connectivity

Once again I write a new post to emphasize the importance of vCenter server availability as the major critical component of all virtualized environments based on VMware products. Recently we had a strange situation that all ESXi hosts from different datacenters (sites) had been disconnected from the vCenter server suddenly and the reconnect operation was pending at 0% in the Tasks section. When we wanted to explore the sub-objects inside of a datacenter-level, the operation was delayed, and removing and re-adding is not possible for these ones. As the result of investigation activities around this catastrophic issue, we check all the following related matters:

DNS configuration on both sides: VCSA and some disconnected ESXi hosts, everything seemed to work correctly.
Resetting Management Agents (VPXA & Hostd) has no success result and the problem still exists. Also, all related services in the VCSA like VPXD could be restarted successfully.
VCSA guest OS partitions, especially /storage/archive directory had enough free space. (In other posts, I described how to resolve the VCSA service interruption because of low disk space on two posts: Part1 and Part2)
There was no reason to suspect the recent infrastructure changes like the modification of firewall rules, while all ESXi hosts from all clusters on each datacenter were involved in this problem. So the rollback operation to the latest network device configuration brought no success for us. Also, running the TCPDUMP command on both sides and watching the result, give us enough evidence this issue is not related to the network configuration.
Even restoring a backup to a normal state couldn’t resolve the mentioned issue, because after booting the vCenter Server, are hosts were still disconnected. Even by trying actions mentioned in matter 2 again, never reach us to the normal situation.

Then in the continuous troubleshooting operations, I decided to deep dive into the vCenter log files, especially checking the log files inside this directory: /storage/log/vmware/vpxd.

There were just some info and warning messages, for example in every resync between the VCSA and all the hosts belonging to a datacenter, it prompted some information like this for each ESXi: “info vpxd [id] [originator … HeartbeatModuleStart - …] Certificate not available, starting hostsync for host: host-id”

However, we checked all used certificates and find nofthing related. So I decided to back to my zero states: Bring up an older restore point with a better situation, and do whatever I did once again, with one more important operation: Ignore all current ESXi host time settings (even NTP) and synchronized the “hardware clock/system time” for a selected host exactly with the currently configured time of the vCenter Server. Then restart VPXA/Hostd agents again, and after some moments I saw the ESXi object react in the vSphere web client. At that moment, we could run the “connect the host” action completely, because it didn’t pending and even pop-up the warning, the vpxuser account is not correct: cannot complete login due to an incorrect username or password”. Finally running the connect wizard could easily complete and the ESXi host stayed in its normal state.

So what was the root cause of this disaster? I couldn’t find it yet. However, I still suspect the MAC duplication/mismatch issue between the vCenter and ESXi host. but there are some essential tips that we should keep in mind always:

vCenter is the core component of vSphere management for VDS, Cluster, VSAN, and Template objects, and also is the primary connection point to other solutions like NSX, Horizon View, and vRealize or even 3rd-party solutions like virtual machine Backup & Replication software. Although they are not exactly dependent on the vCenter server for all operations. But any interruption of vCenter can lead them to lose some critical sections of their own management actions. For example, all deployed desktops inside the Horizon View environment are always through their connected Horizon Agent, but for generating new desktops, the Connection Server requires to call the vCenter to generate them via its Template/VM based on desktop pool type.
Having a backup system and a scheduled job for vCenter Server protection is not enough for the safety of this primary component of virtualization. Even the vCenter HA setup cannot guarantee all aspects of availability (Like in our case, VCHA solution couldn't help us). So we need to always monitor the whole system including checking randomly possible warnings inside the VAMI interface, checking the detail log files inside the shell, configuring Syslog, and also inspecting them always.
Some important configurations are easy, but ignoring them is easier! Like DNS, NTP, Syslog, and so on. Never postpone their configuration to another time because each one can lead our infrastructure to a sudden interruption. Although some other settings like SNMP are a little complex in comparison to the mentioned parts, we can always use the benefits of Automation. Creating scripts including PowerCLI cmdlets is not easy for all administrators, while it's enough to make them just once and use them forever. If it’s possible based on provided vSphere license, you can use features like Host Profile to configure mentioned settings for all managed ESXi hosts too.
vCenter Server restore points (via VAMI or 3rd-Party solutions) must be defined based on vSphere Infrastructure changing intervals, so we need a reliable Change Management procedure to correspond the backup system to any type of modifications. Changes like removing a host from a Cluster, adding a host to a Distributed vSwitch, Changing Permissions and credentials, and so on.

Friday, November 29, 2019

tcpdump vs pktdump: How to use them

tcpdump & pktdump are two different tools for capturing and analyzing received/transferred packets/frames to/from ESXi host. for some troubleshooting situation especially in the case of networking and communication problems, you will need these tools. In this post I want to demonstrate and talk about how to work with these useful CLIs

tcpdump-uw is a great CLI that exists in the ESXi host for packet capturing. Most of the times, we should know about the details of network traffics of each VMkernel port on the ESXi, But before that, you need to understand, verify and analyze the results of the tcpdump-uw command.

Before working with tcpdump-uw, we need to learn about existing VMkernel ports in the host via running:

esxcli network ip interface ipv4 get

or you can check it via tcpdump-uw -D

-i select the appropriate interface/network adapter for listening Rx/Tx packets

-n no name resolution occurred

-t no time information included

-c to specify count of caputerd packets

-e ethernet frame include MAC address for each packet

-w export the capture packets into the file

-s0 collect the entire packets.

Also if you need to exclude specific protocol or port for example http traffic on TCP port 80 you can add not tcp port 80.

It's possible to show more details of captured data by adding -v syntax( or -vv & -vvv to give more detail).

for including TCP headers and TCP flag states, consider each of following syanxes (with -q you can skip all of them):

-s syn / -p push / -f finish / -r reset

Some examples of tcpdump-uw usage:

# tcpdump-uw -i vmk0 icmp
# tcpdump-uw -i vmk0 -w caputerdpackets.pcap
# tcpdump-uw -i vmk0 host x.x.x.x# tcpdump-uw -i vmk0 not arp and not port 22 and not port 53
# tcpdump-uw -i vmk0 -c 10

Just remember this CLI can only capt660606ure packets / frames in the vmkernel level so to capture frames at the uplinks or vSwitch or virtual port pktcap-uw can be used for other traffics of ESXi host. By default pktcap-uw will capture ony inbound traffics, but after release of ESXi 6.7 you can specify direction path:

--dir 0 (Incoming) / --dir 1 (Outgoing) / --dir 2 (In/Out)

(Remember that in the earlier versions you can only specify for only one direction.) There is a list of useful syntax of pktcap-uw:

--vmk vmk0 capture traffics on vmkernel port vmk0

--uplink vmnic0 capture trafffics on physical port vmnic0

-o capturedfile.pcap export the output to the file

-G 10 specify time per seconds for specifying capturing duration

-C 100 specify file size per megabytes

--swichport 11 specify exact port on virtual switch.

There is an example of pktcap-uw:

pktcap-uw --vmk vmk0 -o /vmfs/volumes/datastore1/_export_/capture.pcap -switchport 6666 -c 1000

For more information you can reffer to following links:

https://kb.vmware.com/s/article/2051814

https://www.virten.net/2015/10/esxi-network-troubleshooting-with-tcpdump-uw-and-pktcap-uw/

https://nielshagoort.com/2018/12/20/esxi-network-troubleshooting-tools/

Wednesday, January 9, 2019

Time differentiate between ESXi host & NTP Server

Yes exactly, another post about NTP service and important role of time synchronization between virtual infrastructure components. In another post i described about a problem with ESXi v6.7 time setting and also talk about some of useful CLIs for the time configuration, manually ways or automated. But in a lab scenario with many versions of ESXi hypervisors (because of servers type we cannot upgrade some of them to higher version of ESXi) we planned to configure a NTP server as the "Time Source" of whole virtual environment (PSC/VC/ESXi hosts & so on).

But our first deployed NTP server was a Microsoft Windows Server 2012 and there was a deceptive issue. Although time configuration has been done correctly and time synchronization has occurred successfully, but when i was monitoring the NTP packets with tcpdump, suddenly i saw time shifting has been happened to another timestamp.

At the first step of T-shoot, i think it's maybe happened because of time zone of vCenter server (but it worked correctly) or not being same version of NTP client and NTP Server. (to check NTP version on ESXi, use NTP query utility: ntpq --version) and also change ntp.conf file to set exact version of NTP. (vi /etc/ntp.conf and add "version #" to end of server line) But NTP is a backward compatible service as and i thought it's not reason of this matter.

So after more and more investigation about cause of the problem, we decided to change our NTP server, for example a Mikrotik router Appliance. and after initial setup and NTP config on the Mikrotik OVF, we changed our time source. So after setting again the time manually with "esxcli hardware clock" and "esxcli system time" configure host time synchronization with NTP. Initial manual settings must be done because your time delta with NTP server must be less than 1min.

Then after restart NTP service on the host ( /etc/init.d/ntpd restart) i checked it again to make sure the problem has been resolved.

Thursday, October 11, 2018

NTP setting revert problem with ESXi 6.7

Last weekend we had encountered a big problem with ESXi host time settings after upgrading one of the test servers to version 6.7 (and also last build 8941472, but I'm not really sure about that it's the cause of the problem). After server starting up, NTP service stop working correctly and sadly there is no way to change it (manually or automatic by any NTP servers) and any changes will fail back to defaults.
So after trying the GUI method, we edited /etc/ntpd.conf file and unfortunately nothing happened. As the last way of NTP troubleshooting, useful ESXCLI commands help us to fix it by setting these below command:

# esxcli system time set -d 11 -H 01 -m 55 -M 05 -y 2018

To prevent possible host revert to old-time setting, you must ensure that the hardware clock is the same as the system time:

# esxcli hardware clock set -d 11 -H 01 -m 55 -M 05 -y 2018

There are some commands by shell our SSH access for ESXi NTP settings that's useful to know:

Enable NTP service:
# chkconfig ntpd on
# chkconfig --list | grep ntpd

Restart NTP service:
# /etc/init.d/ntpd restart

Display NTP peers:
# ntpq -p

Check ESXi time:
# esxcli hardware clock get
# esxcli system time get

Monitor NTP transmits between host (as NTP client) and NTP Server:
# watch ntpq -p Host / NTP server
# tcpdump-uw -c 5 -n -i vmk0 host NTP_server and port 123

I hope this is gonna be useful for you ;) ... and never lose your host's time as I do :D

Thursday, March 15, 2018

Analyze SNMP Traffic inside the ESXi

It's recommended that as a network admin, you should consider that monitoring of "ESXi hardware usage and network transmit" as one of your virtual infrastructure management phases. Regardless of using monitoring tools or not, SNMP Traffic that is generated from your host, maybe face with an error. After reviewing your "community string" (SNMP v1/v2) or "credential" (SNMP v3) and checking network connection, if still there is a problem, you can execute an useful command for SNMP traffic inspection.

After logging to ESXi Host directly (DCUI) or by SSH connection (e.g Putty) , run this command to resolve the problem:

tcpdump-uw -vvv -i vmk0 -T snmp udp and port 162

Therefore you will see each SNMP UDP packets that are transferred on port 162. Also note this repeated "-vvv" syntax, which means you want to see more information of your command's result. Literally you can put only "-v" or "-vv" on your command.

Undercity of Virtualization