Implement vSphere Metro Storage Cluster with Hitachi Virtual Storage Platform (VSP) Storage Array Platforms and Hitachi NAS (GEfN) cluster
search cancel

Implement vSphere Metro Storage Cluster with Hitachi Virtual Storage Platform (VSP) Storage Array Platforms and Hitachi NAS (GEfN) cluster

book

Article ID: 312119

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

This article provides information about deploying a VMware metro storage cluster across 2 data centers using Hitachi Virtual Storage Platform (VSP) Storage Array Platforms in vSphere 6.5, vSphere 6.7, vSphere 7.0 and vSphere 8.0 environment.

Environment

VMware ESXi 4.1.x Embedded
VMware ESXi 3.5.x Embedded
VMware vSphere ESXi 5.0
VMware vSphere ESXi 7.0.0
VMware vSphere ESXi 5.1
VMware ESXi 4.0.x Embedded
VMware vSphere ESXi 5.5
VMware vSphere ESXi 6.5
VMware ESXi 4.0.x Installable
VMware vSphere ESXi 6.0
VMware vSphere ESXi 6.7
VMware ESXi 4.1.x Installable
VMware ESXi 3.5.x Installable

Resolution

vMSC with VSP array

A VMware vSphere Metro Storage Cluster architecture on Hitachi Virtual Storage Platform storage array platform uses Global-Active Device (GAD) feature (active-active stretched clustering technology) to provide an ideal solution for maximizing availability and uptime by clustering physical data centers within metro distances. The metro storage cluster solution from Hitachi consists of storage systems presenting replicated storage as a single LUN from different geographically distributed sites. This design enables high availability of services by allowing virtual machine migration between sites with no downtime.
A combination of Hitachi software and hardware provides the following key functions to a vSphere infrastructure:
  • Host multipathing
  • Internal and external storage provisioning
  • Synchronous storage replication across metro cluster distances
  • Storage failover
  • Host access via uniform (recommended) or non-uniform topology 2145375_1.png

These functions work together with VMware vSphere vMotion, vSphere High Availability, and vSphere Distributed Resource Scheduler to deliver this solution for a VMware vSphere Metro Storage Cluster.

vMSC with GEfN (using HDRS)

vMSC architecture can also be implemented using Hitachi NAS 5x00 stretched cluster with four HNAS nodes spread across 2 separate sites with each site having it’s own storage array. The data between the datacenters is replicated using GAD.   In the stretched configuration, NFS datastores (NFS v3) are highly available and can be accessed by vSphere hosts on both sites simultaneously.
A 4-node stretch cluster adds additional resiliency in case of a single HNAS node failure at either site, avoiding the need for a site failover. The product name for the 4-node solution is GAD Enhanced for NAS (GEfN).
image.png
A GEfN solution is typically implemented using Hitachi Data Recovery Solution (HDRS) software package.
Hitachi Disaster Recovery Solution (HDRS) simplifies the deployment and maintenance of a GEfN cluster attached to Hitachi storage replicating using Global Active Device technology (GAD) across 2 sites. This solution also significantly improves disaster readiness and eases recovery process.

HDRS helps configure GAD pair for GEfN environment on VSP and presents the GAD paired volume as  system drives on GEfN cluster. There is a “one button” install option wherein the HDRS tool configures the VSP like LUN provisioning for GAD, Host Group provisioning, GAD pair creation for both Site A and Site B.
 
For failure scenarios where GAD pair is destroyed/suspended (For e.g. test scenario#4 in table 2.2), HDRS is useful in quick recovery of the GAD pair health and recovers the lost LUN path in GEfN. HDRS also helps in analysis of the entire recovery process from a disaster/breakdown. This feature is very critical for disaster management.

If the entire GEfN environment is intended to be destroyed, HDRS does it with a “one-click” solution to dismantle the GAD pair and HNAS configuration which is very handy for error correction during initial deployment.

For a complete implementation solution of GEfN using HDRS, the following guide is recommended for review:
https://knowledge.hitachivantara.com/Documents/Data_Protection/Disaster_Recovery_Solution/6.0.x/Disaster_Recovery_Solution_User_Guide 

Configuration Requirements 

vMSC using GAD with VMware Native Multi-Pathing (NMP) with ALUA SATP has a minimum hypervisor requirement of ESXi 6.5 or later (upto ESXi 8.0).or
         
vMSC using GAD with Hitachi Dynamic Link manager (HDLM) SATP has a minimum hypervisor requirement of ESXi 6.5 or later (upto ESXi 8.0).Note: 
  • ESXi hosts using FC protocol are fully supported for vMSC with Global-Active Device (GAD) configuration on Hitachi Virtual Storage Platform (VSP) storage array platforms previously noted.
  • ESXi hosts using iSCSI protocol with GAD is not supported by any Hitachi VSP array.
  • Remote connectivity for GAD is supported for both FC and iSCSI protocol.
The following requirements must be satisfied to support these configurations:
  • The round-trip latency between the Ethernet networks in the two sites must be less than 10 milliseconds (uniform host access).  The IP network supports the VMware ESXi hosts and the VSP/GAD management interface.
  • The round-trip latency for synchronous storage replication must be less than 5 milliseconds. 
  • Host Mode Options 54,63 & 114 are to be set on Hitachi storages for all GAD configurations.
  • ATS (Atomic Test & Set) setting must be disabled on all ESXi hosts for all GAD configurations.
  • The minimum throughput available between the two sites should be 622 Mbps in order to support vMotion of virtual machines across ESXi hosts spread across both data centers.
  • Management Network: Connectivity to vCenter & other Hosts. Can be layer 2  (same subnet). or Layer 3 (routed)
  • VM network: Recommend Layer 2 stretched network. In the event of site failure VMs restarting on the Secondary site do not require IP address changes. (i.e. The same IP network on which the virtual machines (VM Network) reside must be accessible to ESXi hosts on both sites, so that clients accessing virtual machines running on ESXi hosts on both sites are able to function smoothly upon any VMware HA-triggered virtual machine restart event.
  • vMotion Network: If vMotion is desired between Data Sites, Layer 2 or Layer 3 are supported
  • The data storage locations, including the boot device used by the virtual machines, must be accessible from ESXi hosts in both data centers. 
  • Consult the VMware vSphere maximums guide for the maximum number of vSphere hosts in a HA cluster for the specific vSphere release.
  • NMP or HDLM multipathing can be used. NMP is default.
  • Please note Hitachi Vantara support recommends using RR instead of MRU.
  • For NMP/ALUA configuration with PSP as RR the following additional rule must be set to the ESXi hosts (prior to ESXi6.7 U1) for Hitachi LUNs if not already present:                                                           
    • esxcli storage nmp satp rule add -V HITACHI -M "OPEN-V" -P VMW_PSP_RR -s VMW_SATP_ALUA -c tpgs_on
    • esxcli storage core claimrule load
[This may/will be an additional rule to the existing claimrule for local devices i.e. 
esxcli storage nmp satp rule add --satp "VMW_SATP_DEFAUT_AA" -V HITACHI -M "OPEN-V" -P "VMW_PSP_RR"]
  • For ESXi 6.7update1 and above hosts , ALUA rule is already enabled on OS and no additional command is required to enable/configure ALUA on hosts. 
  • However, ALUA setting must be enabled on PVOLs of both the sites, following is the command for enabling ALUA on Hitachi LUNs:
    • raidcom modify ldev -ldev_id <ldev_id> -alua enable -fx -IH<horcm_instance.
    • For example: raidcom modify ldev -ldev_id 08:10 -alua enable -fx -IH4545
  • Also, path optimization settings need to be done on primary and secondary storage hostgroups as mentioned below:
    • raidcom modify lun -port cl1-d HOSTGROUP -lun_id all -asymmetric_access_state optimized -I10 (On PVOL host group)
    • raidcom modify lun -port cl1-d HOSTGROUP -lun_id all -asymmetric_access_state non_optimized -I10 (On SVOL host group)
HMO78 needs to be set on Host group having SVOLs.
  • For ESXi6.0update2 minimum microcode required is 80-03-32. For ESXi 6.5 G/F1x00 minimum microcode level required is v80-05-xx.
  • Hitachi HUS 150 microcode 0977/H or newer is required, if elected as quorum storage.
Notes:
  • While the Hitachi Storage Cluster solution supports uniform and non-uniform host access topology, Hitachi Vantara recommends uniform host access deployment where feasible for upmost high availability requirements
  • Uniform host access configuration – When ESXi hosts from both sites are all connected to a storage node in the storage cluster across all sites. Paths presented to ESXi hosts are stretched across distance.
  • Non-uniform host access configuration – ESXi hosts in each site are connected only to storage node(s) in the same site. Paths presented to ESXi hosts from storage nodes are limited to the local site
  • Using Hitachi Dynamic Link Manager with host mode option (HMO) 78 enables you to specify non-preferred paths to a certain storage array. A benefit of this is to prevent IO traffic across long distances from the ESXi host to the non-local storage. This minimizes response time and the cost of WAN traffic. It is recommended to turn on this feature with site distances greater than 20 miles (32 KM).

For any additional requirements for Hitachi Storage Cluster, see the complete  implementation guide https://www.hitachivantara.com/en-us/pdf/architecture-guide/gad-vmware-vsphere-metro-storage-cluster-configuration-on-storage-implementation-guide.pdf

The following components create a VMware vSphere Metro Storage Cluster environment:
  • Hypervisor: vSphere 6.x,7.x or 8.0. ESXi host on both data centers running virtual machines and managed by vCenter Server(s) 
  • Storage array: Hitachi Virtual Storage Platform — High-performance and scalable storage solution.
  • Replication software: Hitachi Global-Active Device® — Synchronous bi-directional storage replication between two storage system to provide an active-active storage cluster. GAD provides read/write copies of the same data in two places at the same time. This active-active design enables production workloads on all systems while maintaining full data consistency and protection
  • GAD Quorum Disk: Options include iSCSI disk from virtual machine or separate storage system from Hitachi VSP Family or other supported 3rd-party storage
  • Fibre Channel switch: SAN connectivity to the datacenter storage network 
  • Network switch: LAN connectivity to the datacenter network
  • For GEfN architecture – 
    • 4 nodes of Hitachi NAS 5x00 (NAS OS 14x or later), 
    • A dedicated VLAN for NAS communication is created on a TOR network switch. The ESXi clients use this dedicated VLAN for NAS workload to the HNAS nodes. NAS shares are mounted on ESXi hosts as NFS v3 datastores through this network.
    • 1 rack server or virtual machine running CentOS 8 Stream to be configured as virtual SMU for NAS cluster configuration and management.
    • HDRS software v6.x, for GEfN cluster deployment and management.
Note: The traditional Metro Storage Cluster is in figure 1 with single vCenter and leveraging vSphere HA and DRS across that stretched cluster for VM availability. An alternate configuration, figure 2, uses distributed vCenters (typically in enhanced linked mode) and vSphere clusters on each site. This provides active-active storage solution as well but relies on a runbook like site recovery manager to provide a managed failover environment rather than single cluster vSphere HA technology. This is a supported configuration with Hitachi site recovery adapter SRAs. Note, This KB is focused on figure 1 configuration.
image.png
Figure 1
 
Hitachi Virtual Storage Platform (VSP) Storage Array Platforms

Hitachi Virtual Storage Platform (VSP) Storage Array Platforms provides an always-available, hybrid and all-flash storage array platform across different model lines to deliver a continuous available infrastructure for cloud solution. At time of publication, this includes the VSP model #’s:- VSP E590/E590H, E790/E790H,E990,E1090/E190H, VSP 5100/5100H, VSP 5500/5500H, VSP 5200/5200H, VSP 5600/5600H, G/F1500, G/F1000, G/F900, G/F800, G/F700, G/F600, G/F400, G/F370, G/F350, G200, G130 running SVOS or SVOS RT.
image.png

As part of Hitachi Storage Virtualization Operating System (SVOS), Virtual Storage Machine (VSM) technology ensures two physical systems are logically presented as one system. Global-Active Device (GAD) feature implements cross-mirrored storage volumes between two Virtual Storage Platform systems accepting read/write I/Os on both sides that are continuously updated. If a disk controller failure occurs at one site, the controller at the other site automatically takes over and accepts read/write I/Os. It enables production workloads on both systems, while maintaining full data consistency and protection. The global active device feature assures that an active and up to date storage volume is available to a production application in spite of the loss of a virtualized controller, system or site. More details:-https://www.youtube.com/watch?v=xhC7CIKdr4M

For a complete installation and implementation guide to deploy vSphere Metro Storage stretched clusters with Hitachi Global Active Device (GAD) on VSP storage, the following guide is highly recommended to review. This also includes details on test cases and observances. https://www.hitachivantara.com/en-us/pdf/architecture-guide/gad-vmware-vsphere-metro-storage-cluster-configuration-on-storage-implementation-guide.pdf 

What are the Multipathing and Quorum Disk options?

VMware Native Multi-Pathing (NMP) or Hitachi Dynamic Link Manager (HDLM) is the multipathing software that integrates with Global-Active Device to provide load balancing, path optimization, path failover, and path failback capabilities for vSphere hosts. NMP or HDLM will load-balance I/O between all available preferred paths (Active) from P-VOL and keep all paths to S-VOL as active non-optimized paths. 

A quorum entity that is external to either system in normally a separate location is used to determine the operational control when certain failure occurs to avoid split-brain scenarios. In a vSphere Metro Storage Cluster using VSP platform, there are various options for providing quorum services including a separate Storage system (including any supported 3rd party storage that can be attached to VSP platform) or presenting an iSCSI disk from physical/virtual machine from 3rd site or cloud.

For example: External storage system: A 12GB LUN is created on an external storage array such as VSP Gx00 or other supported external 3rd-party storage array for use as a quorum disk. This LUN is presented to the Site 1 VSP and Site 2 VSP as externalized storage by virtue of VSP platform SVOS storage virtual machine capability. The quorum disk stores continually updated information about data consistency in Hitachi Global-Active Device P-VOLs and S-VOLs for use during site failover operations. Global Active Device uses the information in the event of a failure, to direct host operations to the other volume within the pair.

Steps to configure Windows Server local disk as iSCSI Quorum disk for VSP storage units 
  1. Go to Server Manager > File and Storage Services > iSCSI > New iSCSI Virtual Disk. 
Follow the steps to create iSCSI disks. 
2145375_4.png
 
  1. Under iSCSi Targets, Right click ‘View all Targets’ ->Properties -> Initiators. Add the IQN of the storage ports that will be used as External Ports for Quorum.
2145375_5.png
The software components handling the management for host path failover and storage replication control are listed in Table 1.

Table 1. Metro Cluster Software Components
 
Metro Cluster Software componentsVersion
NMP or Hitachi Dynamic Link Manager8.0.1-00 or newer *
VMware vSphere command-line interface6.0update2,6.5 or 6.7, 7.0, 8.0
Command control interface for Hitachi productsMicrocode Dependent

* ESXi 6.5 / ESXi 6.7 with HDLM v8.6.0 is a supported configuration for Fibre Channel solution.
* ESXi 7.0 with HDLM v8.7.7 is a supported configuration for Fibre Channel solution.

Check https://compatibility.hitachivantara.com/products/interop-matrix for Hitachi interoperability support matrix.

What is HNAS?

Hitachi NAS platform or the HNAS system, known for its high performance and scalability, benefits from its state-of-the-art object file system and its unique field-programmable gate array(FPGA) offload engine hardware architecture, which is used to accelerate compute intensive tasks. Hitachi Vantara has updatedthis FPGA hardware and NAS software in its latest HNAS systems. The latest HNAS platforms can scale to more NAS capacity and IO    performance than previous-generation systems, and the new software improves small file performance and capacity utilization.
The HNAS 5000 series supports both file and block (iSCSI) workloads for greater consolidation and operational simplicity. File controllers are designed with a hardware-accelerated architecture, using field-programmable gate arrays (FPGA) for active, critical, and sensitive file services. Individual file systems belong to a Hitachi Enterprise Virtual Server(EVS) for NAS, and each server has its own set of IP addresses, policies, and individual port assignments
HNAS systems come in NAS gateway configurations with attached Hitachi Virtual Storage Platform (VSP).For more information on HNAS 5000 series architecture, please refer below document.

https://www.hitachivantara.com/en-us/pdf/datasheet/nas-platform-5000-series-datasheet.pdf

Tested Scenarios

Table 2 outlines the tested and supported failure scenarios when using a Hitachi Storage Cluster for VMware vSphere with Hitachi Virtual Storage Platform and Global-Active Device. This table below documents the uniform host access based configuration. Non-uniform behaves the same expect where it leverages site failure scenario for local storage failure

Table 2. Tested Scenarios NMP/ALUA or HDLM Configuration
ScenarioGlobal-active device / MP  behaviorObserved VMware behavior
Using VMware vMotion or VMware Distributed Resource Scheduler to migrate virtual machines between Site 1 and Site 2
  • No Impact
  • Virtual machine migrates to Site 2 hosts and I/O is directed to the local storage S-VOL on Site 2. 
Using VMware High Availability (VMware HA) to failover virtual machines between Site 1 and Site 2.
  • No impact
  • Virtual machine fails over to Site 2 hosts and I/O is directed to the local storage S-VOL on Site 2. 
An active path in a single host fails.
  • Host I/O is redirected to an available active path via HDLM/NMP PSP
  • Another active path is used
  • No disruption to virtual machines
Site 1 storage system fails 
  • Storage failover
    • Global-active device verifies data integrity with the quorum disk before failover
    • Global-active device splits the pair replication and S-VOL is converted to SSWS (S Local) 
    • Host I/O is redirected via SATP to the standby S-VOL paths on the Site 2 storage system. 
  • Active paths to P-VOL are reported dead
  • Standby paths to S-VOL become active
  • No disruption to virtual machines
All active paths to the local storage system fail for any ESXi host in the cluster.Host I/O in each Site is redirected to available standby (non-preferred) paths on the remote storage system via HDLM/NMP PSP.
  • Active paths to the local storage system are reported dead
  • Standby paths to the remote storage system become active
  • No disruption to virtual machines
All paths down(APD) occurs in any ESXi host in the cluster
  • Storage failover does not occur.
  • ESXi hosts must be shut down manually for VMware High Availability to restart virtual machines on the other hosts.
Quorum disk fails or all paths to quorum disk removed
  • Replication between PVOL and S-VOL continues and the PVOL and S-VOL stays in pair state
  • No disruption to virtual machines
Storage replication link failure*
  • Global-active device verifies the data integrity with quorum disk and determine the one of two (P-VOL and S-VOL) as Local I/O mode (the other is as Block I/O. The decision depends on the state of both volumes which is notified and written to quorum disk. 
  • When the volume (e.g. P-VOL) is chosen to continue to perform host I/O, all access to the other (S-VOL) is blocked and is failed over to P-VOL (P Local).
  • Host I/O in Site 2 is redirected to standby paths to P-VOL on remote storage in Site 1.
  • There was important distinction in the behavior of PVOL survival depending if NMP or HDLM was used that was observed in releases higher that vSphere 6.5. See note below.
Case when quorum determines that P-VOL becomes local I/O mode
  • Active paths to S-VOL are reported dead
  • Standby paths to P-VOL become active
  • No disruption to virtual machines
Case when quorum determines that S-VOL becomes local I/O mode
  • Active paths to P-VOL are reported dead
  • Standby paths to S-VOL become active
  • No disruption to virtual machines
WAN storage connection failure
  • Storage failover occurs, same as storage replication link failure except path behavior.
  • The path failover doesn't occur.
  • When P-VOL is chosen to convert to P-Local, host I/O in Site 1 is still continued to process using path to local storage in Site 1. Because local site access remains active, virtual machines on Site 1 can access the local P-VOL.
  • Site 1: After storage failover, P-VOL will process host I/O for Site 1 hosts because local site access remains active. Virtual machines on Site 1 can access the local P-VOL.
  • Virtual machines on Site 2 hosts are unable to access their virtual disks on Site 1. Site 2 hosts must be shut down manually for VMware High Availability to restart virtual machines on Site 1 hosts. 
  • Note: For ESXi 7.0, HA may restart all virtual machines from Site-2 to Site-1 automatically.
Site 1 failure
  • Same as “Site 1 storage system failure” in terms of storage behavior.
  • Storage replication between P-VOL and S-VOL stops (pairsplit) and storage failover occurs. S-VOL is converted to SSWS(S Local).
VMware High Availability fails over virtual machines to available Site 2 hosts.
Site 2 failure
  • Storage replication between P-VOL and S-VOL stops (pair split) and storage failover occurs.
  • P-VOL is converted to PSUE (P Local). 
VMware High Availability fails over virtual machines to available Site 1 hosts.

Table 2.2 Tested Scenarios : HNAS-GEfN Configuration:
ScenarioHNAS BehaviorGlobal-active device / MP  behaviorObserved VMware behavior

Using VMware vMotion or VMware Distributed Resource Scheduler to migrate virtual machines between Site 1 and Site 2

  • No Impact
  • No Impact
  • Virtual Machine migrates to any of alternate Site hosts.
Using VMware High Availability (VMware HA) to failover virtual machines betweenSite 1and Site 2.
  • No Impact
  • No Impact
  • Virtual Machine migrates to any of alternate Site hosts.
NAS network outage for one ESXi client.
  • No Impact
  • No Impact
  • NAS datastores become inaccessible for the ESXi client.
  • I/O on virtual disks from NAS datastoreterminates for the ESXi client.
  • VMware High Availability fails over virtual machines of the failed host to other available hosts.
Site 1 storage system fails 
  • All system drives in HNAS remains online through other available paths
  • Storage failover
  • Global-active device verifies data integrity with the quorum disk before failover
  • Global-active device splits the pair replication and S-VOL is converted to SSWS (S Local)
  • No disruption to virtual machine IO.
All FC connections fail from a single NAS node.
  • Fibre channel connection is lost on the impacted HNAS node of GEfN cluster
  • No disruption to system drives.
  • Storage pools’ status become unhealthy.
  • No impact
  • No disruption to virtual machine IO.
Single NAS Node removed from GEfN cluster
  • EVS of the removed NAS node will be migrated to other nodes on the cluster thus disruption to clients may occur.
  • No disruption to system drives.
  • No impact
  • No disruption to virtual machine IO.
Quorum disk fails or all paths to quorum disk removed
  • No impact.
  • Replication between PVOL and S-VOL continues and the PVOL and S-VOL stays in pair state
  • No disruption to virtual machines
Storage replication link failure*
  • System drives in HNAS nodes remain healthy through other available paths.
  • Global-active device verifies the data integrity with quorum disk and determine the one of two (P-VOL and S-VOL) as Local I/O mode (the other is as Block I/O. The decision depends on the state of both volumes which is notified and written to quorum disk. 
  • When the volume (e.g. P-VOL) is chosen to continue to perform host I/O, all access to the other (S-VOL) is blocked and is failed over to P-VOL (P Local).

 
  • No disruption to virtual machines  IO.
WAN storage connection failure
  • System drive status on Site-A remains OK through other available paths.
  • On site-B system drive status is not healthy, and  path failover does not occur.
  • Global-active device splits the pair replication and S-VOL is converted to SSWS (S Local)
  • No disruption to virtual machine IO.
Site 1 failure
  • Cluster status is degraged as 2 nodes of the cluster are down.
  • No disruption to system drives.
  • Same as “Site 1 storage system failure” in terms of storage behavior.
  • Storage replication between P-VOL and S-VOL stops (pairsplit) and storage failover occurs. S-VOL is converted to SSWS(S Local).
  • VMware High Availability fails over virtual machines to available Site 2 hosts
Site 2 failure
  • Cluster status is degraged as 2 nodes of the cluster are down.
  • No disruption to system drives. 
  • Storage replication between P-VOL and S-VOL stops (pair split) and storage failover occurs.
  • P-VOL is converted to PSUE (P Local).
  • VMware High Availability fails over virtual machines to available Site 1 hosts

*** This occurs due to change in default HA property in ESXi7.0 and ESX6.7. 

In ESXi7.0:

image.png

In ESXi6.7:
 image.png

NMP and HDLM: One distinction observed with replication link failures are remote site failures. 
The following observation was recorded in vSphere releases higher than vSphere 6.5. In summary, if PVOL survival on each respective side is required in the event of a remote site failure or replication link failure, then leverage HDLM

Detailed  Observation

Storage replication link failure simulation: The following sequence is observed when simulating replication link or site 2 failures. Testing leveraged a script to fail the switch ports connected to the replication links) and the results are listed below for NMP/ALUA and HDLM respectively.
  1. Replication links failure occurred while ESXi hosts NMP were issuing read/write commands to PVOLs(active path under ALUA setting) and no I/O to SVOL(passive path under ALUA setting). ( 
  2. Both VSP’s DKC P-VOL and S-VOL would stop I/O responding (including ACK) to the ESXi hosts and proceed to trigger the countdown of Block Monitoring Path(or PBW) setting at 5 secs. 
  3. During this time, ESXi hosts vmkernel.log will report error messages of multiple disks write command failure and retry on the same P-VOL active path. 
  4. After PBM 5 seconds countdown, both PVOL and SVOL storage will proceed to suspend with PVOL transit into PSUE(local read/write) and SVOL into SSWS(block).
  5. PVOL responded/acknowledged to ESXi read/write requests and I/O resumed on active path (PVOL) and 2 minutes later the hosts reported Permanent Loss for SVOLs paths

HDLM and NMP follow the same observation for vSphere 6.5/6.7/7.0/8.0 above although for HDLM with SVOL set to Non-optimized, PVOL is always preserved and SVOL is always blocked in the case of link failure.

GAD pair behaves differently for failing remote connections for each site.
Test simulation result of Storage TC ports failure (ALUA/NMP) for a particular Site (A or B):
  1. Disabled Site A Storage TC ports:
à Site A PVOLs win and Site B corresponding SVOLs block.
à Site B PVOLs win and Site A corresponding SVOLs block.
  1. Disabled Site B Storage TC ports:
à Site A PVOLs block and Site B corresponding SVOLs win.
à Site B PVOLs win and Site A corresponding SVOLs block.

Test simulation result of Storage TC ports failure (HDLM) for a particular Site (A or B):
  1. Disabled Site A Storage TC ports:
à Site A PVOLs win and Site B corresponding SVOLs block.
à Site B PVOLs win and Site A corresponding SVOLs block.
  1. Disabled Site B Storage TC ports:
à Site A PVOLs win and Site B corresponding SVOLs block.
à Site B PVOLs win and Site A corresponding SVOLs block.
Note
  • For ESXi 6.7, the parameter for action_OnRetryErrors is ON by default.
  • For ESXi 6.7U3B, the same parameter is OFF by default.
  • When using NMP/ALUA as multipath, set HMO78=OFF
  • Ensure for NMP/ALUA, ALUA enabled per LUN / Dedicated Ports for PVOLs and SVOLs enabled with HG Optimized and Non-optimized Paths.
  • vSphere GUI on all ESXi hosts showing LUN status: “Active (IO)” à PVOLs and “Active” à SVOLs. 
  • Zero IOPS observed on SVOLs Storage Ports and generated IO workload observed on PVOLs Storage Ports.
  • For NMP/ALUA the host sends CMD=A30A to all the paths, and the storage that notifies Quorum first, survives.
  • With HDLM, it was confirmed no ALUA RTPG A3h command send and therefore both PVOLs survived on both storages.
This is noted in the implementation guide https://www.hitachivantara.com/en-us/pdf/architecture-guide/gad-vmware-vsphere-metro-storage-cluster-configuration-on-storage-implementation-guide.pdf

For more information about Hitachi products and services, contact your sales representative or visit the Hitachi Vantara website.

Additional Information


使用具有 Hitachi Virtual Storage Platform G1000/G1500/F1500/Gx00/Fx00 功能的 Hitachi Storage Cluster for VMware vSphere 实现 vSphere Metro Storage Cluster