Knowledge Base

|
VMware support with NetApp MetroCluster (1001783)
Details
Solution
What is a MetroCluster?
What happens to an ESX host in the event of a single storage component failure?
Note: LUN 1 is used as an example in a MetroCluster configuration with two storage controllers with two ports each.
vmhba1:1:1 (FAS controller 2 port 2)
vmhba2:0:1 (FAS controller 1 port 2)
vmhba2:1:1 (FAS controller 2 port 1)
-
Storage controller failure (disk shelves have not failed)
For ESX hosts accessing the NetApp FAS storage controller via iSCSI or NFS protocol, the surviving storage controller performs a "takeover", meaning the target IP address (used as iSCSI target or NFS datastore mounting) is brought up on the surviving storage controller. No manual intervention is required on the ESX hosts without causing any disruption to data availability.
For ESX hosts accessing the NetApp FAS storage controller via FCP protocol, the HBAs see the two clustered FAS storage controller nodes as one storage array unit, with the same WWNN (World Wide Node Name). A given LUN, if configured properly, sees paths in the following manner.
Under normal operation, LUN 1 is active on vmhba1:0:1 path. In the event of storage controller failure, the paths vmhba1:0:1 would be unavailable. The active path is failed over to vmhba1:1:1. Because of the appropriate multipathing policy, no manual intervention is required on the ESX hosts. This again cause no disruption to data availability. -
Disk Shelf failure (Storage controller has not failed)
In the event of an entire disk shelf failure, the storage controller accesses the mirrored interconnected enclosure. The ESX host continues to use the same HBA/NIC to access the same storage controller and port. No manual intervention is required for ESX hosts accessing the NetApp FAS storage array via NFS/iSCSI/FCP protocol.
What happens to an ESX host in the event of complete filer/array failure?
- For NetApp FAS storage controller running Data ONTAP 7.2.4 or newer:
After you perform a manual MetroCluster failover, the UUIDs of the mirrored LUNs are retained. Perform a rescan for each ESX host to detect the VMFS volumes running on the mirrored LUNs. When the VMFS volumes are detected, power on the virtual machines.
- For NetApp FAS storage controller running Data ONTAP older than 7.2.4:
After you perform a manual MetroCluster failover, the mirrored LUNs do not maintain the same LUN UUID as the original LUNs. When these LUNs house the VMFS-3 file system, the volumes are detected by ESX 3.x as being on snapshot LUNs. In a similar fashion, if a RAW LUN that is mapped as an RDM (Raw Device Mapping), is replicated or mirrored through MetroCluster, the metadata entry for the RDM must be recreated to map to the replicated or mirrored LUN.
Running VMware HA and FT in a MetroCluster environment
MetroCluster configuration supports active workloads on both sites. The “failed site” is attributed to the site that experiences failure, or complete outage. The “remote site” is attributed to the site that has not failed.
| # |
Failure Scenario |
Data Availability Impact |
| 1 |
Complete loss of power to disk shelf |
None |
| 2 |
Loss of one link on one disk loop |
None |
| 3 | Failure and failback of storage controller |
None |
| 4 |
Loss of mirrored storage, network isolation |
None |
| 5 |
Total network isolation, including all ESX hosts (FT or non-FT enabled) and loss of hard drive |
Applications or data on the non-FT virtual machines running on the affected ESX hosts are available after it automatically comes up in the surviving nodes of the VMware HA cluster. FT enabled virtual machines run uninterrupted. |
| 6 | Loss of all ESX hosts in one site |
Applications or data on the non-FT virtual machines running on the affected ESX hosts are available after it automatically comes up in the surviving nodes of the VMware HA cluster. FT enabled virtual machines run uninterrupted. |
| 7 |
Loss of one Brocade Fabric Interconnect switch (applicable for continuous availability solution with Fabric MetroCluster only) |
None |
| 8 | Loss of one ISL between the Brocade Fabric Interconnect switches (applicable for continuous availability solution with Fabric MetroCluster only) |
None |
| 9 | Loss of an entire site | Applications or data on the virtual machines (both FT enabled and non-FT) and running in the failed site are available after executing the force takeover command from the surviving site and manual power on operations of the virtual machines. |
| 10 | Loss of all ESX hosts in one site and loss of storage controller in the other site |
None |
| Loss of disk pool 0 in both sites |
None | |
| Loss of storage controller in one site and loss of disk pool 0 in other |
None |
Disclaimer: The partner products referenced in this article are hardware devices that are developed and supported by stated partners. Use of these products are also governed by the end user license agreements of the partners. You must obtain the application, support, and licensing for using these products from the partners. For more information, see Support Information in this article.
Keywords
Request a Product Feature
- Updated:
- Categories:
- Languages:
- Product Family:
- Product(s):
- Product Version(s):

