Undercity of Virtualization: 2019

Monday, December 16, 2019

Best practice for a good Virtualized Datacenter Design - Part 1

In this post and other series of this title, I will review some great hints of a good datacenter virtualization design. But before anything, I want to ask you some major question:

What are the key components for an ideal virtual structure for different IT environments?
How will you set up the virtual infrastructure?
And what elements are required for attending, before and after deployment and implementation phases?

In this post and other parts of this series, I want to deep dive into the details of good design for the virtual infrastructure based on VMware products.

In the first part, I investigated more about the basic requirements and prerequisites of IT infrastructures to migrate into virtualization. In other parts, I will review VMware's primary services and their impacts to achieve this goal.

1. Physical to Virtual

The first step is the estimation of the real needs of physical resources for the service providing. Processor Clock Rate (GHz), Memory & Disk Usage (GB) and also Network Transmission Rate (Gbps) must be calculated separately per each existing service and then we can talk about the required resources for the server virtualization. However, we should consider the hypervisor (ESXi host) overhead and add this measure to the total estimated count.

P2V migration always impacts to the service availability and usually needs to operationally downtime of the migrated service/OS. There are also some complexities in this manner, including:

Type of OS and supportability for converter application.
Specific Application dependencies via a hardware-locked.
Software Licensing problems.
SID/GUID changing issue for services like Active Directory.

So in the following, I provided a questionnaire about the P2V operation and you must answer to each of them carefully before the executing real migration:

Is it necessary to virtualize everything? And are you really sure about your answer? Why or why not, what’s the reason for keeping them into the physical area? or migrating to the virtual world… Answer of these questions is depended on your infrastructure requirement and you should reply it correctly for each of your important components and servers in your infrastructure.
Are you organized and prioritized each of the physical servers? Which ones must be on top of this list and which ones are good candidates for the pilot and test phase? I think selecting low-risk and non-critical workload servers is a good option for this state.

At last, you should provide a checklist like the following list to specify the server’s priority orders:

Application servers with low storage resources and simpler network and OS configuration.
Web servers with normal demand/request handling rate and also fewer dependencies to/from other servers
Network infrastructure services like VPN, DHCP, NPS
Mission-critical and organizational Application servers
Database servers based on SQL, Oracle and so on
Unified communication services like Mailbox, VoIP, IM servers.
Most important services in IT infrastructure like Directory services

2. Storage resources… How to provision?
If the physical server attached to a storage device/LUN/volume, there may be two difficulties exist:

Lack of enough space, if all mentioned storage used space must be migrated with the server to the new space provided by the hypervisor local storage
Access to the storage management system for zoning re-configuration and providing storage accessibility for the new deploying VM

On the other-side, in services with high critical transaction log files like Exchange server, migration of mailbox databases needs to consider the rate of the log space suddenly growth. Finally in every kind of P2V Migration, we need to more attention to temporary and permanent storage resources space.

3. Security consideration as the physical and traditional deployment

For choosing the virtualization platform, the selected solution must supply every security technologies that are deployed in the physical networking. It’s recommended that every aspect of physical switch security features like MAC learning, Private VLAN and so on can be supported by virtual switches. Distributed vSwitch technology used in the VMware vSphere platform is an ideal virtual networking solution for supporting many advanced security concepts like port mirroring and NetFlow. Except for VMware distributed switches (VDS), products of many vendors like Cisco, HP, IBM are supported by the vSphere networking platform. For example, Cisco Nexus 1000v is designed just as an integrated distributed vSwitch for the VMware platform. Of course, VDS design and migration from vSphere standard switch (VSS) to the VDS, requires to its implementation considerations (that I reviewed in this video playlist on my YouTube channel.)

4. Provide suitable physical resources for virtual infrastructure

One of the important characteristics of server virtualization in front of traditional server provisioning is the increasing rate of service availability and this requires the construction of VMware clustering. As a result, comply with the deployment prerequisites like employment of the same CPU generation and technologies in the ESXi members of the cluster is required.

It’s also recommended to use more similar physical servers instead of fewer servers with more physical resources. Thereby the Blade servers are a better choice as the hypervisor physical resources in front of other types of servers like the Tower servers.

5. Do not forget cleanup operation
After migration successfully has been done, you should start the post-migration operations, include checking the detected virtual hardware devices into the VM and also remove everything that is not required anymore on the new converted VM. For example in the windows guest OS you can run: devmgr_show_nonpresent_devices=1 and next run devmgmt.msc, then go to the view>show hidden devices and finally you can remove unnecessary or hidden items.
In the next part, I will talk about the power supply used for the computing and storage racks and how to calculate it.

Saturday, December 7, 2019

ESXCLI Networking - Part 2 (Video Series)

And now you can watch the next part of the ESXCLI video series. This is the second and the last part of ESXCLI Network namespace (Networking) and after this, I will address to other existing namespace of ESXCLI based on ESXi6.7.

I hope you enjoy it:

Tuesday, December 3, 2019

Virtualization Tip1: Relation between physical CPU & virtual CPU

Many people have confusion sight about how really a host assigns CPU resource to the virtual machines; More precisely I can say how the processing operation of a VM has been executed via the physical CPU resources. In the Intel terminology, the physical processor is a CPU socket, but in this post, I consider the pCPU as a physical core in the existing sockets of servers.

By default, each of the added vCPU to the VMs is assigned to one of the existing pCPUs. So if we configure 8 vCPU for a VM, there must exist at least 8 pCPU in the host. In other word, if there is not enough pCPU for the VM, it cannot be started.

Based on design, VMware ESXi can handle the CPU oversubscription (request of vCPU more than existing processors/pCPU). It means the pCPU~vCPU ratio is not one by one (1:1) anymore. In the vSphere environment, the ESXi host will handle the processing operations to execute requests of every VM, then the host needs to schedule processing time for each of them. But here the question is what ratio should be configured as the best settings? The answer depends on choosing Capacity or Performance aspects, really it can be very different based on the virtualized application requirements ...

Each VM needs the pCPU resources, then implementation of many VMs specially highly-applicable and resource-consumption virtual machines demand more CPU clocking. So if you provision more VMs and also increase the pCPU~vCPU Ratio (1:2, 1:4 or greater) the performance of the ESXi host will be affected.

As the VMware mentioned vSphere ESXi scheduling mechanism prefers to use the same vCPU-to-pCPU mapping to boost performance through CPU caching on the socket. If there is no specific documentary for the CPU design of the Application, you can set it up with a single vCPU, then scale up based on requires. So oversubscription will not have a serious negative impact.

Also, we must consider the CPU Ready Time is another important metric as the CPU utilization metric is. Generally, vCPU~pCPU ratio is based on many factors like the following:

Version of ESXi host. Each newer version supports more ratio.
Supported features and technologies by physical processor.
Workload rates of critical Applications that are implemented in the virtual environment.
The capacity of existing processor resources in other members of the cluster and their current performance, especially when we require a higher level of hosts fault tolerance in the virtualization infrastructure. Available resources in the cluster will specify each VM that can be placed on which host in front of a host failure.

Should we use Hyperthreading or not ?!

Hyperthreading is a great technology that makes a single pCPU act as the two logical processors. In the case of the low-usage of ESXi host, each of those logical cores can handle two independent applications at the same time. So if you have 16 logical processors in the ESXi host, after enabling of HT (In both of the BIOS config and ESXi advanced settings) you will see the host has 32 logical processors. But using HT does not mean performance is increased always and it's highly dependent on application architecture. So in some cases maybe you encounter performance degradation via HT usage. Before enabling of HT in the ESXi hosts, review critical virtualized applications deploy on their VMs.

Friday, November 29, 2019

tcpdump vs pktdump: How to use them

tcpdump & pktdump are two different tools for capturing and analyzing received/transferred packets/frames to/from ESXi host. for some troubleshooting situation especially in the case of networking and communication problems, you will need these tools. In this post I want to demonstrate and talk about how to work with these useful CLIs

tcpdump-uw is a great CLI that exists in the ESXi host for packet capturing. Most of the times, we should know about the details of network traffics of each VMkernel port on the ESXi, But before that, you need to understand, verify and analyze the results of the tcpdump-uw command.

Before working with tcpdump-uw, we need to learn about existing VMkernel ports in the host via running:

esxcli network ip interface ipv4 get

or you can check it via tcpdump-uw -D

-i select the appropriate interface/network adapter for listening Rx/Tx packets

-n no name resolution occurred

-t no time information included

-c to specify count of caputerd packets

-e ethernet frame include MAC address for each packet

-w export the capture packets into the file

-s0 collect the entire packets.

Also if you need to exclude specific protocol or port for example http traffic on TCP port 80 you can add not tcp port 80.

It's possible to show more details of captured data by adding -v syntax( or -vv & -vvv to give more detail).

for including TCP headers and TCP flag states, consider each of following syanxes (with -q you can skip all of them):

-s syn / -p push / -f finish / -r reset

Some examples of tcpdump-uw usage:

# tcpdump-uw -i vmk0 icmp
# tcpdump-uw -i vmk0 -w caputerdpackets.pcap
# tcpdump-uw -i vmk0 host x.x.x.x# tcpdump-uw -i vmk0 not arp and not port 22 and not port 53
# tcpdump-uw -i vmk0 -c 10

Just remember this CLI can only capt660606ure packets / frames in the vmkernel level so to capture frames at the uplinks or vSwitch or virtual port pktcap-uw can be used for other traffics of ESXi host. By default pktcap-uw will capture ony inbound traffics, but after release of ESXi 6.7 you can specify direction path:

--dir 0 (Incoming) / --dir 1 (Outgoing) / --dir 2 (In/Out)

(Remember that in the earlier versions you can only specify for only one direction.) There is a list of useful syntax of pktcap-uw:

--vmk vmk0 capture traffics on vmkernel port vmk0

--uplink vmnic0 capture trafffics on physical port vmnic0

-o capturedfile.pcap export the output to the file

-G 10 specify time per seconds for specifying capturing duration

-C 100 specify file size per megabytes

--swichport 11 specify exact port on virtual switch.

There is an example of pktcap-uw:

pktcap-uw --vmk vmk0 -o /vmfs/volumes/datastore1/_export_/capture.pcap -switchport 6666 -c 1000

For more information you can reffer to following links:

https://kb.vmware.com/s/article/2051814

https://www.virten.net/2015/10/esxi-network-troubleshooting-with-tcpdump-uw-and-pktcap-uw/

https://nielshagoort.com/2018/12/20/esxi-network-troubleshooting-tools/

Thursday, November 14, 2019

VMware Tanzu Mission Control (TMC)

Do you need to accelerate the building of cloud-native applications or improve their deployment in your cloud without any limitation?

VMware Tanzu is one of the greatest subjects of VMworld 2019 that want to answer and resolve your issues and lead your cloud apps provisioning with this motto: Any App in any Cloud & Any Cluster even Kubernetes!

VMware explained about Tanzu Mission Control (TMC):

1. Automatically provision new clusters and attach existing clusters running in multiple environments—including vSphere, VMC, public clouds, and managed Kubernetes services—for centralized management and operations.

2. Easily set policies for access, backup, networking, and more, and enforce the right configuration across fleets of clusters and applications running in multiple clouds."

3. With policies and configuration in place, safely enable developers with self-service access to the resources they need to deploy their applications in multiple clouds—without changing their native workflows.

As Tom Fenton mentioned in this link it looks like TMC will be a SaaS-based control plane and will treat your K8s clusters like a new layer of abstraction. TMC will give you lifecycle management and control, role-based access control and the ability to inspect the health of your K8s clusters (most of Day2 Operations). Also, you'll be able to manage the entire lifecycle of your K8s clusters: From instantiation to decommissioning

It seems that managing multiple Kubernetes clusters regardless of their location is a complex challenging that VMware TMC wants to resolve it.

Also see the following overview video of VMware CEO Pat Gelsinger about VMware Tanzu:

Friday, November 8, 2019

vSphere Distributed Switch Design & Configuration - Part V: Migrating VM Networking

This is the final chapter of vSphere Distributed Switch Migration & Configuration. I hope it would be helpful for you all:

Tuesday, October 29, 2019

vSphere Distributed Switch Design & Configuration - Part IV: Add & Migrate VMkernel ports

VCSA Low Disk Space Problem - Part 2

In the previous post, part 1 of VCSA Low Disk Space Problem, I described a situation when there is no more space left on vCenter Server Appliance volumes and this server will encounter many complex problems like the interruption of vCenter services operation.

Unfortunately in some cases, you need to find some voluminous files that occupied the VMDK spaces of vCenter server and remove them manually (like large log files). In this post, I want to show you how to find and remove them when you require to start the vCenter Server immediately. So let's begin:

1. First of all run a Backup job on the vCenter Server VAMI web interface before doing any action to generate a new full backup of the VCSA.

2. Connect with SSH/Shell to the VCSA and check the remaining space on each of its volumes via running disk free command line (df -h)

3. Check every large size containers/folders and search for any unneccessary large files like old log files via running disk usage command line (du -chx)

4. Remove some of old files and check the space left again, then retry to start the vCenter services or restart its VM.

Now you can work with the vCenter Server Appliance without any problem. But consider this method as a worse-case for vCenter server operation recovery, because you need to remove files from the server, even if you consider them with low priority, and also the chance of wrong deletion of important file will be increased totally. So it's very important to avoid this method as it's possible. In the VCSA deployment phase, you must calculate first and then consider suitable storage size for this critical virtual appliance to prevent this issue.

Wednesday, October 23, 2019

History of vCenter Server - From v4.0 to v6.7

In this post I want to review about major and critical published features in every version of vCenter Server in every released version of VMware vSphere from version4.0 to latest version of 6.7:

1.Version 4.0: It was the first publish of vCenter Server as the virtual infrastructure management system by release of vSphere 4.0

2.Version 4.1: The great features of Storage I/O Control (SIOC) & Network I/O Control (NIOC) has been released, also the vStorage APIs for Array Integration (VAAI) & Data Protection (VADP)

3.Version 5.0: Awesome VMFS5! and also supports Ultra 2TB VMFS datastore. Also included some new features like vSphere Auto Deploy, Storage DRS, Software FCoE and swap to host cache.
Also the VMware FDM was introduced, then the HA could use the Datastore Heartbeating mechanism to avoid detecting the host isolation as a host failure.

The first release of VCSA was announced in this version.

4.Version 5.1: First release vCenter Single Sign-On (SSO) to separate authentication service and prevent default access of local or AD domain administrator accounts.
After this version, Core vCenter services run on separate nodes (six nodes including the VCDB & VUM servers)

Also, the support of Single-Root I/O Virtualization (SR-IOV), Introduce of vSphere Replication (VR) for replicating VMs over LAN/WAN and vSphere Data Protection (VDP) based on EMC Avamar technology for VM Backup & Recovery operations.

5.Version 5.5: The VSAN is introduced in this release, also published a new feature of Big Data Extension (BDE) for Hadoop clusters in the Ent/Ent+ editions. After this version HA is aware of DRS Anti-Affinity rules for restarting the virtual machines.

6.Version 6.0: Introduce Platform Service Controller (PSC) & Enhanced Linked Mode (ELM) and also Content Library & virtual volume (VVOL) features.

7.Version 6.5: Very useful feature of vCenter HA (VCHA) that is acting as a special clustering for VCSA, and also the Native Backup and Restore Functionality are released in this version. Management interface based on HTM5 technology introduces in this release.

8.Version 6.7: By starting this version Embedded deployment of PSC supports ELM too. (U1 let you converge the PSC deployment from external to embedded by the VCSA-Converge-CLI

Domain repoint is also another greate feature of this release by (cmsso-uti)

Per-VM EVC and vSphere Health are two other features of this version.

In continuous of this post, I will explain about the history of ESXi after release vSphere4.0 and also SSO domain changes in every version of after release vSphere5.1

Sunday, October 13, 2019

hostd & vpxa & vpxd

One of my students, asks me about the difference between vpxa & hostd.

hostd (daemon) is responsible for the performing main management task of the ESXi, like virtual machine operations (such as Power-On, Migration & ...)

But what is going on when we join the ESXi to the vCenter server?

Now it's time fo the vpxa (agent) to come. vpxa is the related agent for communication between ESXi and vCenter Server, so whenever you try to add the host to the vCenter, vpxa will be started automatically.

Despite the hostd is used for managing most of ESXi operations, the vCenter calls the vpxa to send its commands to the hostd agent. Actually, vpxa is like an intermediate between the hostd of the ESXi and vpxd (daemon) on the vCenter to pass the executed commands from the vCenter server to the ESXi (TCP/UDP Port 902). Although if you want to manage the ESXi directly, management communication will handle by the host itself (UDP Port 902).

vpxd (daemon) also acts as a part of the vCenter server and is the responsible for sending commands via vpxa agent to the ESXi hostd service. (Also if the vpxd is stopped, you cannot connect to the vCenter server via the vsphere client.)

vCenter Server --> vpxd --> vpxa ---> hostd --> ESXi

For restarting the ESXi host daemon and vCenter Agent services, you can run the following commands:

/etc/init.d/hostd restart

/etc/init.d/vpxa restart

Tuesday, October 1, 2019

Great Memorial for the vSphere version 6.0

At last VMware announced the deadline for the General Support of vSphere version 6.0 that is March 12, 2020...
However, I believe this version was the most impressive release of vSphere because of its grateful improvement. There were so many enhancements in many areas on virtualization infrastructure like:

Architecture & Platform Service Controller (PSC)
Support NVIDIA GRID vGPU
Very Very unbelievable scalability (64 hosts and 8000 VMs per cluster and also 480 logical CPU, 12TB RAM, 1024 VM)
Certificate Management by VMCA
Improvement for vSphere Web Client
Storage Enhancement include Virtual Volumes & VDP / VR improvement
Network Enhancement include NIOCv3 & Multiple TCP/IP Stacks
Availability area (vMotion, HA & FT improvement)
Security area like Smart Card authentication to DCUI & two Lockdown mode
Multisite Content Library

Goodbye, vSphere 6.0 ... I hope ASAP VMware publishes another stable version of vSphere like this one.

Sunday, September 29, 2019

Standard vSwitch and Port Group (Drawing Diagram)

Introduction to vSphere Standard Switch (VSS) & its related objects & components (Class VCA - September 29 - 2019)

Saturday, September 28, 2019

ESXCLI Networking - Part 1 (Video Series)

After all I prepared and uploaded the first part of this video tutorial series and it's about how to work usefully ESXCLI command line and its related syntax, to manage of ESXi networking and communication area.

I hope you enjoy it

Thursday, September 26, 2019

Migrate from Standard vSwitch to Distribute vSwitch

Today one of my students asked me to describe a procedure how to safety upgrade from VS to VDS, So I decide to write a checklist for it. Also a few days ago I talked about it in some of VMware community threads here: 618874 & 613321

First of all we must design & create the vSphere Distributed Switches based on vSphere networking assessments & virtual infrastructure requirements, Also configure Distribute uplink port Group. We must set the maximum number of required uplinks (VMNICs) for that VDS.
Add every ESXi hosts in the datacenter that you need to connect them to the VDS. Do not forget there is no dependencies or relations between ESXi hosts that are part of VDS and their membership in the existing clusters. So each host from every cluster can be added to the VDS.
Consider redundancy on physical uplinks on existing VSS! Yes it's better to do it. Although it's not a pre-requirments for VDS migration but for safety of migration procedure I always try to follow this instruction. So when we want to start the major migration operation, we can do it without any risk of network connectivity interruption. Otherwise if we can't provide any redundant uplink for the VSS, then we should run this step and three next ones in a single running of VSS to VDS migration wizard.
Move first series of physical uplinks, So we need to assign one of VMNICs from each Hosts as the VDS first uplink. Then check your co-existence of VSS and VDS and their connectivity.
Create a new Distributed Port Group and then migrate the VMkernel ports to the new assigned dvPortGroup, especially that one is designed for the management port. Then check the host management connection after successful migration.
Create designed Port Groups for the virtual machine communications. Then migrate them with their associated uplinks to keep their connectivity.
Move all the remaining objects like the other VMNICs to the VDS. So after the last time of all networking communication checking, you can remove everything related to the VSS if you don't need them at all .

I hope it can be useful for you all ;)

Saturday, September 21, 2019

ESXi Networking Management & Troubleshooting by ESXCLI - Part 1

In this series I want to demonstrate how to work with ESXCLI command line tools to manage and troubleshoot ESXi network configuration. Now let's start step by step, first of all respect to NIC status I want to explain about them and related syntax:

1. To list all of the physical interfaces (pNICs) belong to that host and give more information about them:

esxcli network nic list

2. Detail information about an specific pNIC:

esxcli network nic get -n vmnic0

3. Software details include VLAN Tagging and VXLAN encapsulation:

esxcli network nic software list

4. Return sent / received rate of packet in the VLAN associated to that pNIC:

esxcli network nic vlan stats get -n vmnic0

5. Configure pNIC attributes include Speed, Wake-on-LAN settings, Duplexity and so on:

esxcli network nic set -n vmnic0 -S 1000 -D full

6. Sort list of connected VMs and their associated:

esxcli network vm list

7. Retrieve details of connected ports of a VM with (-world ID) include vSwitch, Port Group, IP and MAC addresses and related Uplink:

esxcli network vm port list -w 136666

8. Detail information about Distributed vSwitches that are associated to the Host:

esxcli network vswitch dvs vmware list

In the next post I will show you how to work with ESXCLI in networking in a new video series.

Friday, September 13, 2019

vRealize Network Insight - New Posters

Thanks to VMware, They recently released two very fantastic posters about vRealize Network Insight (vRNI) to use in the search engine:

1. Network flows search
2. Virtual Machine search guide

https://blogs.vmware.com/management/files/2019/09/vRNI-Flow-Search.png

Wednesday, September 11, 2019

vSphere Distributed Switch Design & Configuration - Part III: Managing Hosts & Physical Uplinks

And now third part of VDS configuration ...
Managing attached hosts & their physical NICs

Monday, September 9, 2019

An Example of Importance of Management and Controlling Virtual Infrastructure Resources

In one of my projects I had a bad problem with vSphere environment . The issue had been occurred in following situation:

In the first episode VCSA server encountered with a low disk space problem and suddenly crashed. After increase size of VMDK files and fix the first problem, I saw one of the ESXi host belongs to the cluster is unreachable (disconnected and also vCenter cannot connect to it, but both of them is reachable by my client system. In a SSH access I checked the ESXi host is accessible but vCenter server couldn't connect only to this host.

All network parameters and storage zone settings, and all time settings and service configuration were same for each hosts. Sadly syslog settings was not configured and we didn't have access to scratch logs in duration of the issue had been occurred (I don't know why). Trying to restart all management agents of the host was suspended and suppressing to it by running services.sh restart process was stuck and nothing really happened. also trying to restart vpxa and hostd didn't fix the issue.

There was only one error in summary tab of disconnected host that described about the vSphere HA that is not configured and ask to remove and add the host again to the vCenter. But I couldn't reconnect it. My only guess is it's only related to startup sequence of ESXi hosts and storage systems because tech support unit restarted some of them after confronting to the problem, So HA automatically tried to migrate VMs of that offline hosts to other online hosts and this is the moment I want to call it "Complex Disaster". So was stuck decided to disable HA and DRS on cluster settings, nothing changed! problem still existed. After fixing the VCSA problem I knew if we restart that host, maybe the second problem will be solved but because of a VM operation, we couldn't do it. Migration did not work and we were confused.

Then I tried to shutdown some of not-necessary VMs belong to the disconnected host. after releasing some CPU/RAM resources, this time management agent restart was done successfully (services.sh restart operation)

So trying to connect VCSA to that problematic ESXi was possible and the problem was gone forever!

After that I wrote a procedure for that company's IT Department as the Virtualization Checklist:

1. Attend to your VI's assets logs. Don't forget to keep them locally in a safe repository and also in a syslog server.

2. Always monitor used and free process/memory resources of cluster. Never override their thresholds, because a host failure may cause to consecutive failures

3. Control status of virtual infrastructure management services include vCenter Server, NSX Manager and also their disk usage. Execute "df -h" in CLI or check status of their VMDKs in GUI. (I explained about how to do it in this post)

4. In critical situations or even maintenance operations always first shutdown your ESXi hosts and then storage systems and for reloading the system first start the storage, then the hosts.

5. In the end, please DO NOT disconnect vNIC of VCSA from associated Port Group if it is part of a Distributed vSwitch. They did it and it's made me to suffer a lot to reconnect VCSA. Even if you restore a new backup of VCSA, don't remove network connectivity of failed VCSA until the problem is not solve.

Saturday, August 31, 2019

vSphere Distributed Switch Design & Configuration - Part II: Create & Manage dvportgroup

The second chapter has been published. I hope it can be useful for you all :)

Thursday, August 29, 2019

vSphere HA vs vCenter HA

Many times I heard my students ask what is the VCHA really and what is different between this feature and vSphere HA?

vSphere HA is a cluster-level feature that can be enabled to increase total availability of VMs inside the cluster and works whenever an ESXi host has been crashed, then HA will move VMs of that failed host to another available resources inside the cluster and reboot them in the new hosts. HA interacts directly to the ESXi HA Agent and will monitor status of each host of a cluster by investigate their heartbeats, So if an network segmentation / partitioning/ downtime is happened and also ESXi cannot provide its heartbeat to the shared datastore, HA will consider the host is failed and execute VM migration operation.

But vCenter HA is a new feature published after release of vSphere6.5 and directly related to the vCenter Server Appliance. It will create a cluster-state of VCSA VM in a triple-node structure: Active node (Primary vCenter server), Passive (Secondary vCenter acting after disaster) and Witness (act as a quorum). It's just about the VCSA availability factor only. vCenter HA is a new vCenter feature that is enabled only for VCSA (Because of PostgreSQL native replication mechanism), also can provide more availability for this mission-critical service inside the virtualization infrastructure.

As VMware said whenever VCHA is enabled, in case of vCenter failure, operation will be revived after 2~4 minutes depends on vCenter config and inventory size. Also VCHA activation process can be done less than 10 minutes.

Now I want to compare these two feature with respect to each related concept of IT infrastructure:

1. Network Complexity:

vCenter HA configuration needs a dedicated network to work and is totally separated from vCenter management network, Then to run VCHA cluster successfully it's required to have only three static IP or dedicated FQDN for assigning to each of cluster node. (I always prefer to choose a /29 subnet for them) After Active node failure, Passive will be automatically handle the vCenter management traffic and users just need to re-login their connections to the vCenter (VPXD through API or Web Client).

But a good vSphere HA operation is highly depends on cluster settings, so you don't need to do more network configuration especially for HA operation. (Just maybe in some situations you may need to separate host management and vMotion port groups based on network throughput)

2. Network Isolation:

In situation where there is a partitioning between hosts of a cluster, if a host cannot send any heartbeat to the shared datastore, it will be considered as a failed host. So HA tries to migrate and reboot all running VMs of that Host to another healthy hosts. I want to emphasis respect to availability of VMs belong to the host cluster there is two mechanisms of checking failures: network connections (between hosts and vCenter) and storage communication (inside the SAN area).

But if there is a network segmentation between vCenter HA nodes, we must care about what's really going on? I mean separation is happened between which nodes of the cluster? If Active-Passive or even Active-Witness nodes are connected no need to worry, because the active node is still responsible of VI management operation. But what happened if active node is the isolated node?! Operationally it will get out of the VCHA cluster and stop to servicing, now the passive node will continue its job.

3. Multiple failures:

In the case of consecutive failures, if there is enough resources (RAM & CPU) inside the cluster, it can handle this problem, because vSphere HA will migrate VMs more and more to another available ESXi hosts. Just remember you must check out the Admission Control Policy settings respect to handle multiple ESXi failure.

But in vCenter HA, you should know about VCHA is not designed for multiple failures, So after the second failure, the VCHA cluster is not available and functional anymore.

4. Utilization, Performance and Overhead:

There is a little overhead for primary vCenter when VCHA is enabled, especially every time there is too many tasks to do for vCenter Server.

Witness needs the lowest CPU, because there is only VCHA service. Also it's almostly same for Passive node just for VCHA and PostgreSQL. There is no concern for memory usage.

But if you want HA works in its best mode you must pay attention to remaining resources in the cluster because bad HA configuration can make the cluster unstable, So for best performance in whole cluster you need to calculate availability rate based on remained and used physical resource. Specifying at lease two dedicated failover ESXi hosts to encounter against failure can be a suitable HA config.

Sunday, August 4, 2019

Virtual Machine Snapshot Details Investigation - Part 1

What really is a Snapshot? Let's break into the more details. A Virtual Machine Snapshot is a technology that is executed to save a specific state of that VM by purpose of preserving VM's data, power state of VM and also its virtual memory. You can generate many snapshots to keep different states of your VM and is required for VM Backup procedure and is a great ability in test / pilot scenarios. So you can revert to any snapshot state if you need by Snapshot Manager. Then should be remembered very change is this duration (from snapshot to recent moment) will be discarded.
But what have been affected whenever we create a new snapshot and what are the Pros and Cons of this feature? In this post I want to describe more detail about virtual machine snapshot feature inside the vSphere environments…
As a detail view snapshot exactly is a replica copy of VMDK in a specific moment. So it can be used for recovering a system from a failure state. All the Backup solutions work with snapshot every time they start a VM backup task to provide a copy of that. So as I said before snapshot generation provide a copy from these component’s contents:
1.   VM Settings (Hardware settings or any changes to the VM itself)
2.   VMDK state (Data has been write inside the VM Guest OS)
3.   VMEM Content (Virtual Memory like clipboard or swap contents)

So must be careful of using revert action because it will return all of this objects to the snapshot state. During snapshot generation it will create a delta file with .vmdk extension (also called redo logs or delta links) that acts like a child for its parent .vmdk (main VMDK before snapshot creation). So guess OS cannot do write operation on parent vmdk anymore and after that any disk writes action will be happened into the delta/child disks. First of all, child disk will be created from its parent and then the other successive children snapshots will be created from latest delta.vmdk in this chain. As its name shows, delta means difference between current state of the VM disk and last snapshot creation moment. Now any change in VM (and its Guest OS) will be writing to this new VMDK (delta file) from this moment. So we can say they are important as their parent.

But what is the exact content of snapshot files? And where data has been written after taking snapshots? Consider A as the primary VMDK of virtual machine when there is no snapshots. B series (B1, B2) are children of A and also C files (C1, C2) are descendant of B. If you are after C2 snapshot, content data after reverting to C1 state is included base VMDK (or flat file) and previous delta files: A+B1+B2+C1. Flat.vmdk is raw data structure of base file but it’s not a separate file when you check it into the datastore.

Into the Virtual Machine File System (VMFS), delta disk act as a sparse disk and it’s required to know about how data will store in virtual disks. There is a mechanism called COW (copy-on-write) for optimization of storage spaces. It means there is nothing into the VMDK until data copy occurs. I will explain more about COW mechanism and sparse disks more deeply in another post.
Now when you create many snapshots and cause complexity in the parent/child relations between snapshots, you may need to execute consolidation to reduce this confusing situation. It will merge these redo logs/delta vmdk inside a single vmdk to avoid complex status of snapshot managing. If the child disks are large in size, the consolidation operation may take more time.
There is also some another files related to the snapshot operation:
VMSN: It is a container for memory contents of VM. As VMware said if the snapshot includes the memory option, the ESXi host writes the memory of the virtual machine to disk. The VM is stunned during memory is being written but sadly you cannot pre-calculate time duration of that because it’s dependent on many factors such as disk performance and size of memory.
Remember VMSN always will be generated even if you don’t select memory option in snapshot creation. But its size is much lesser in non-memory state. So VMSN size is an overhead for total space calculation of snapshot generation in the datastore.
VMSD: It is the snapshot database and is the primary source for snapshot manager usage and its contents are relation tree of snapshots. So snapshot.vmsd consists of current config and active state of virtual machine.

This is a sample of VMSD file:
" .encoding = "UTF-8"
snapshot.lastUID = "1"
snapshot.current = "1"
snapshot0.uid = "1"
snapshot0.filename = "vm-Snapshot1.vmsn"
snapshot0.displayName = "Clean"
snapshot0.createTimeHigh = "360682"
snapshot0.createTimeLow = "2032299104"
snapshot0.numDisks = "1"
snapshot0.disk0.fileName = "vm.vmdk"
snapshot0.disk0.node = "scsi0:0"
snapshot.numSnapshots = "1"a "

I will talk about snapshot quiescing option and some more detail in the next part as soon as possible. For more information you can refer to following links:

Understanding VM snapshots in ESXi/ESX

Managing Snapshots vSphere 6.7

Monday, July 29, 2019

vCenter and other VI Components (Drawing Diagram)

A new drawing diagram (by me) about vCenter Server's role in the virtual infrastructure and incorporation of this server with other VMware components inside the management layer.

I spent 40 minutes to draw it ;)

VMware VCA-DCV study class

2019 28 July

Saturday, July 20, 2019

vSphere Distributed Switch Design & Configuration - Part I: Create & Basic Setup

As I promised, I publish my second video today. It's about creating and managing of VMware vSphere Distributed Switches and this is the first video of this series. I Hope you enjoy it.

Thursday, July 11, 2019

Storage Pools: Homogeneous vs Heterogeneous

Homogeneous in storage system means using identical disk in a single array, and identical itself defined same characteristics: disk type (SATA, SAS, Flash), throughput (I/O Speed), capacity (thousands of GB or TB or even more), vendor (EMC, HP, NetApp) and etc. In contrast with Homogeneous, Heterogeneous simply means mixture and complexity of array of disks.

Although Storage Tiering are executed based on required throughput and functionality and also total price of disk and Service Level Agreement and made it possible for storage pools to use variety of But what is the best option? is there any certain decision to choose mixture (Heterogeneous) or simplicity (Homogeneous)? There's some benefit in Heterogeneous in enterprise scales, because we can create pools with different disk types and different level of storing services. A big problem is how it's possible to transfer data between different tiers of disk when it's required without any interruption or required manual action for data migration between Homogeneous storage pools.

EMC provides a good answer for this situation: FAST VP. This great feature offer the highest flexibility that storage device can reached by allocating storage pool in different tiers based on performance requirement. Remember that tiers consist different types of disk but it's strongly recommended to use the same speed in each type, not rotational. Then storage system makes decisions about the best place for a data based on its importance, needed throughput. Heterogeneous pools are fundamental storage infrastructure for EMC FAST VP. (Note that Thick LUN is suitable as FAST VP underlay)

About the applications and services with similar requirements, Homogeneous Pool (Tier with single type of disks) is better option because there is an expected Input/Output stream of bandwidth and data storing perform, So there is no needs to tiering data between different storage types. Heterogeneous pools can consist different types of drives. As Tomek said in this post FAST VP facilitates automatic data movement to appropriate drive tiers depending on the I/O activity for that data. As a result NL-SAS or even SATA disk types is better for data with low rate I/O or high required capacity (like Backup files) and Flash-based disk (like SSD) is more suitable for low capacity or Highest required I/O (So probably more important data) and at last SAS disks are the best candidate for mid-range of Storage Infrastructure of data (medium capacity density and medium throughput of data). EMC called each of previously mentioned Respectively: Capacity Tier, Extreme Performance Tier and Performance Tier.

In other view we can say Homogeneous Pool is a good choice for predictable data (respect to each rate: capacity or performance) because you know exactly what you expected from your storage underlay. So each of these pools can select a single type of drive unlike the Heterogeneous Pools. But if I want to generally speaking about every storage products from any vendor without considering any advanced features, sometimes you may prefer the simplest way: create homogeneous pools, assign LUNs to the storage initiators and then in a duration of time calculate the performance metrics and storage space. Then you can say what type of disks with which speed and capacity you really required. So after a time if you need to expand your storage infrastructure

in a complex environment, it may be used Heterogeneous pool.

Friday, July 5, 2019

VMware VDI (Horizon View) Troubleshooting - Part IV

VMware lately released a new poster about troubleshooting network connectivity of VMware Horizon View includes all required protocols and their related connections and network ports. This is a great diagram, very very impressive and useful for TS of VDI.

Tuesday, July 2, 2019

Upgrade vCenter Server 6.0 to VCSA 6.7

Yesss... i did it after many many years of laziness :D
Finally YouTube channel of VirtualUndercity has been created and i tried to reorganize and edit many of my raw video records, so this is link of the first video. Just as you will listen, with respect to my people, language of my first published video is Persian (پارسی) although after this i will publish another videos in English ;)

I hope it will be helpful for you

Sunday, June 30, 2019

What is the VMware Photon OS

Photon OS is an Open-Source Linux developed by VMware for cloud-native applications such as vCloud Air and virtual infrastructure services, like vSphere. Photon OS is used especially for VCSA6.x & SRM8.x as the guest OS of OVF. VMware announced because of costumer's needs to an environment to provide consistency development through production. Considering all aspects of infrastructure: Computing, Networking and Storing, Photon OS provides a fully integrated platform to make sure all of these will provide all abilities that are required by VMware platform's App developers and costumers.

Photon OS supports running of highly-applicable Containers (Rocket, Docker & Garden) and also developer Apps that must be deployed in to the Containers. Beside Project Lightwave (as another open-sourced project for Access / Identity management) Container deployed by Photon OS and all of their Workloads will be protected by security enforcement.

Although current version of Photon OS is ver3.0 but historically each version introduce many optimization features around VMware environments (like Kernel Message Dumper in ver2.0). Updates for Photon OS always delivered as the Package (yum & rpm are supported), also you can upgrade this product in-place with an offline downloaded package and then run (there is no patch):

# tdnf install photon-upgrade

# photon-upgrade.sh

If you install VCSA (with built-in photon OS) you need to provide almost 10GB RAM, but minimum recommended free memory for Photon itself is 2GB. As VMware mentioned resource requirement highly depends on installation types (Minimal, Full, OSTree Server), virtualization environment (ESXi, Workstation or Fusion), Linux kernel (Hypervisor optimized or Generic) and distribution file (preinstalled OVA/OVF or a more complex setup with ISO). It's good to know about installation types of Photon OS:

1.Minimal: Lightweight version and the best choice for Container providing.

2.Full: With additional package and the better option for development of container-based application.

3.OSTree Server: This one is suitable as a repository and also management node for all other Photon OS hosts and also.

All in the Hypervisor optimized kernel type, you all have required for virtualization by VMware hypervisor not anymore components, then they will be removed. So selecting Generic means needs all. To provide Docker feature you need to do:

# systemctl start docker Run the daemon service

# systemctl enable docker Enable service startup

With respect to opinions and discussions about development and security considerations of virtual infrastructure services, VMware release Photon OS as an open-source product, then it can support other public cloud environment, for example: Amazon Elastic Compute Cloud (EC2), Google Compute Engine (GCE) and Microsoft Azure. To read more about Photon OS you can refer to following links:

Photon OS Introducing

Photon OS FAQ

And also you can download its source from following GitHub link:

Photon in GitHub

Wednesday, June 26, 2019

VMworld 2019 (US) - Less than 60 days

Exactly less than two months before starting of VMworld 2019 United States (San Francisco) , 25 ~ 29 August.

Be ready for this great event. You can read about this:
https://www.vmworld.com/en/us/index.html

Of course Europa (Barcelona) will begin in November 2019:
https://www.vmworld.com/en/europe/index.html

Monday, June 24, 2019

VMware VDI (Horizon View) Troubleshooting - Part III

In the third part of the VDI troubleshooting series, unlike the last two parts, I want to talk about client-side connection problems. For instance, if there is a dedicated subnet of IP addresses for Zero Client devices, then incorrect setup or miss-configuration of routing settings can be the reason for the connection problem between VDI clients and servers. Same way, wrong VLAN configs (ID, subnet, Inter VLAN Routing) can be the main reason for the trouble. So I provided a checklist of "What to do if you have a problem with your Horizon connection servers?"

1. Check the correctness of Zero/Thin client's communication infrastructure (routing, switching, etc) to the VDI servers (Connection Server, Security Server)

2. Check network connection between Connection Server subnet and deployed Virtual Machines of Desktop Pool, if they are separated. Of course, logically there is no need to connect their dedicated Hosts/Clusters to each other, so you can have separate ESXi Clusters, one for Desktop pools and another for VDI Servers.

3. Investigate the vCenter Server is accessible from Connection Server and also its related credential.

4. If you have a Composer Server, check it's Services. So many times I saw the Composer Server service does not start after a server reboot, while it's automated and no warning/error event has been reported. Also, you need to check the ODBC Connection between Composer Server and its Database.

5. Investigate installed View Agent state inside the Desktop Pool's VMs. If you need to provide client redirection to the desktop (without the presence of Connection Server) View Direct Agent is needed too.

6. A TCP connection on port 4001(non-SSL)/4002(SSL-based) between Desktop's View Agent and Connection Server must be established, It's required for connection and you can check it by running netstat -ano | findstr "4001".

7. Review the User Entitlement for provided Desktop Pools, maybe there is a mistake especially when you add AD Groups instead of AD Users. (also check them, are they still available or assigned to the other users?)

8. Type of Virtual Desktop provisioning is also important. Except for Full Clone, on Linked Clone and Instant Clone models, you need to check the status of Virtual Desktops in Inventory\Resources\Machines of the View Admin web page.

9. If there is an interruption in connected sessions, you need to review their states in Inventory\monitoring of the View Admin web page.

10. For the last Note: DO NOT FORGET TO CONFIGURE EVENT DATABASE! I had encountered too many Horizon View deployment that did not configure any event database, so in troubleshooting situations, we had NOTHING to know really what happened.

I hope it can be helpful for you all buddy...

Saturday, June 15, 2019

Manage VCSA Certificates - Chapter I

Every part of the virtual infrastructure environment needs a channel to communication and a safe and secure channel always requires a certificate. ESXi Hosts, vCenter Server, NSX Manager, Horizon Connection Server and so on, each one of them has at least a machine certificate or a web-access management portal with a self-signed SSL certificate. After introducing of vSphere6.0 Platform Service Controller (PSC) will handle the vSphere generated certificates with a web access panel that has been called VMware Certificate Authority (VMCA). But in this post I want to introduce some CLI to manage VMware certificates:

VECS-CLI: This is a useful CLI to manage (create, get, list, delete) certificate stores and private keys. VECS (VMware Endpoint Certificate Stores) is the VMware SSL Certificate repository. Pic1 show usage of some of its syntax:
DIR-CLI: Manage (create, list, update, delete) everything inside the VMware Directory Service (vmdir): solution user accounts, certificates, and passwords.
Certool: View, Generate and revoke certificates.

There are many types of stores inside the VECS:

Trusted Root: Includes all of the default or added trusted root certificates.
Machine SSL: With the release of vSphere6.0 all communication of VC & PSC services are executed through a reverse proxy, so they need a machine SSL certificate that is also backward compatible (ver 5.x). Embedded PSC also requires Machine Certificate for its vmdir management tasks.
Solution users: VECS stores for a separate certificate with a unique subject for each of solution users like VPXD. These user certificates are used for authentication with VC SSO.
Backup: Provides revert action to restore (only) the last state of certificates.
Others: Contains VMware or some Third-party solution certificates.

Now let me ask what are the roles of solution users? There are five solution users:

machine: License server and logging service are the main acts. It's important to know Machine solution user certificate is totally different from machine SSL certificate that has been required for the secure connections (like LDAP for vmdir / HTTPS for web access) in each node of VI (VC / PSC instance)
SMS: Storage Monitoring Service.
vpxd: vCenter Daemon activity (Managing of VPXA - ESXi host agents)
vpxd-extensions: Like Auto Deploy and Inventory service
vsphere-WebClient: lol, certainly web client and some additional services like performance chart.

The default paths of certificate management utilities are down below:
    /usr/lib/vmware-vmafd/bin/vecs-cli
    /usr/lib/vmware-vmafd/bin/dir-cli
    /usr/lib/vmware-vmca/bin/certool

And for windows type of vCenter server you can go to the:
   "%programfiles%\vmware\vcenter server\vmafdd

Surely I will talk about what is the vmafd itself and other useful CLI vdcpromo in this path on another post. Also, I will provide a video about how to work with certificate-manager." is the default path of windows-based vCenter server.

For the last note, always remember that deleting Trusted Roots is not permitted, because if you do it, it can cause some sophistic problems in your VMware certificate infrastructure.