Support > Knowledge Base
Knowledge Base

Search the Knowledge Base: |
Search the Knowledge Base: |
VMware VI3 support with NetApp MetroCluster
Details
Solution
What is a MetroCluster?
MetroCluster allows for synchronous mirroring of volumes between two storage controllers providing storage high availability and disaster recovery. A MetroCluster configuration consists of two NetApp FAS controllers, each residing in the same datacenter or two different physical locations, clustered together. It provides recovery for any single storage component or multiple point failure, and single-command recovery in case of complete site disaster. For additional information such as maximum distance supported and configuration requirements and details, contact NetApp.
What happens to an ESX host in the event of a single storage component failure?
Note: LUN 1 is used as an example in a MetroCluster configuration with two storage controllers with two ports each.
vmhba1:0:1 (FAS controller 1, port 1)
vmhba1:1:1 (FAS controller 2 port 2)
vmhba2:0:1 (FAS controller 1 port 2)
vmhba2:1:1 (FAS controller 2 port 1)
vmhba1:1:1 (FAS controller 2 port 2)
vmhba2:0:1 (FAS controller 1 port 2)
vmhba2:1:1 (FAS controller 2 port 1)
-
Storage controller failure (disk shelves have not failed)
For ESX hosts accessing the NetApp FAS storage controller via iSCSI or NFS protocol, the surviving storage controller performs a "takeover", meaning the target IP address (used as iSCSI target or NFS datastore mounting) is brought up on the surviving storage controller. No manual intervention is required on the ESX hosts without causing any disruption to data availability.
For ESX hosts accessing the NetApp FAS storage controller via FCP protocol, the HBAs see the two clustered FAS storage controller nodes as one storage array unit, with the same WWNN (World Wide Node Name). A given LUN, if configured properly, sees paths in the following manner.
Under normal operation, LUN 1 is active on vmhba1:0:1 path. In the event of storage controller failure, the paths vmhba1:0:1 would be unavailable. The active path is failed over to vmhba1:1:1. Because of the appropriate multipathing policy, no manual intervention is required on the ESX hosts. This again cause no disruption to data availability. -
Disk Shelf failure (Storage controller has not failed)
In the event of an entire disk shelf failure, the storage controller accesses the mirrored interconnected enclosure. The ESX host continues to use the same HBA/NIC to access the same storage controller and port. No manual intervention is required for ESX Servers accessing the NetApp FAS storage array via NFS/iSCSI/FCP protocol.
What happens to an ESX host in the event of complete filer/array failure?
In the event of complete storage controller and/or all disk shelves failure (storage controller and associated local disk shelves), you must perform a manual failover of the MetroCluster. Contact NetApp for documentation and detailed steps. Additional steps are required for ESX hosts depending on the version of NetApp Data ONTAP running on the FAS storage controllers.
- For NetApp FAS storage controller running Data ONTAP 7.2.4 or newer:
After you perform a manual MetroCluster failover, the UUIDs of the mirrored LUNs are retained. Perform a rescan for each ESX host to detect the VMFS volumes running on the mirrored LUNs. When the VMFS volumes are detected, power on the virtual machines.
- For NetApp FAS storage controller running Data ONTAP older than 7.2.4:
After you perform a manual MetroCluster failover, the mirrored LUNs do not maintain the same LUN UUID as the original LUNs. When these LUNs house the VMFS-3 file system, the volumes are detected by ESX 3.x as being on snapshot LUNs. In a similar fashion, if a RAW LUN that is mapped as an RDM (Raw Device Mapping), is replicated or mirrored through MetroCluster, the metadata entry for the RDM must be recreated to map to the replicated or mirrored LUN.
To ensure the ESX hosts have access to the VMFS volumes on the mirrored LUNs, set the advanced VMkernel option LVM.DisallowSnapshotLUN to 0 and perform a rescan for each ESX host. After the ESX host(s) detect the VMFS volumes, power on the virtual machines.
Running VMware HA in a MetroCluster environment
The following configurations and operational scenarios are tested and supported by VMware and NetApp.
MetroCluster configuration supports active workloads on both sites. The “failed site” is attributed to the site that experiences failure, or complete outage. The “remote site” is attributed to the site that has not failed.
Stretch Configuration
Note: For details of configuration requirements and maximum distance supported, please contact NetApp.
|
Operational Scenario |
MetroCluster Behavior |
VMware High Availability (HA) Behavior |
|
Loss of one link on one disk loop |
No MetroCluster event |
No VMware HA event |
|
Complete loss of power to disk shelf |
Storage controller will access the disk shelf from remote site; there is no disruption to data access |
No VMware HA event |
|
Failure of storage controller |
Controller in remote site performs automatic takeover; there is no disruption of data access to either site
Note: For details on how ESX hosts responds to controller failure, please refer to What happens to an ESX host in the event of a single storage component failure section of this article. |
No VMware HA event |
|
Failback of storage controller |
Controller in failed site reclaims its original role prior to failure; there is no disruption
of storage access to either site Note: For details on how ESX hosts responds to controller failure, please refer to What happens to an ESX host in the event of a single storage component failure section of this article. |
No VMware HA event |
|
“Failed site” experiences complete outage; disaster declared |
No automatic cluster takeover; user must perform manual force failover
Note: For details on how remote site ESX hosts respond to complete site failure, please refer to What happens to an ESX host in the event of complete filer/array failure section of this article. |
No VMware HA event; virtual machines that are accessing storage in failed site need to be powered on manually after manual force takeover |
|
Failed back to “failed site” after site restoration |
The controller in “failed site” reclaims its original role prior to failure; there is no disruption of storage access to either site
Note: For details on how ESX hosts responds to controller failure, please refer to What happens to an ESX host in the event of a single storage component failure section of this article. |
No VMware HA event |
|
Test ESX server failures (combination of failures between hosts from both sites) |
No MetroCluster event |
VMware HA event; auto power on of virtual machines from failed hosts on the surviving nodes |
Fabric Configuration
Note: For details of configuration requirements and maximum distance supported, please contact NetApp.
Operational scenarios for Brocade switch and Interswitch Link (ISL) are specific to the cluster interconnect fabric for the storage controllers.
|
Operational Scenario |
MetroCluster Behavior |
VMware High Availability (HA) Behavior |
|
Loss of one link on one disk loop |
No MetroCluster event |
No VMware HA event |
|
Complete loss of power to disk shelf |
Storage controller will access the disk shelf from remote site; there is no disruption to data access |
No VMware HA event |
|
Loss of one Brocade Fabric Interconnect Switch |
No MetroCluster event |
No VMware HA event |
|
Loss of one inter-switch link (ISL) between the Brocade Fabric Interconnect Switches |
No MetroCluster event |
No VMware HA event |
|
Failure of storage controller |
Controller in remote site performs automatic takeover; there is no disruption of data access to either site
Note: For details on how ESX hosts responds to controller failure, please refer to What happens to an ESX host in the event of a single storage component failure section of this article. |
No VMware HA event; virtual machines that are accessing storage in failed site need to be powered on manually after manual force takeover |
|
Failback of storage controller |
Controller in failed site reclaims its original role prior to failure; there is no disruption of storage access to either site
Note: For details on how ESX hosts responds to controller failure, please refer to What happens to an ESX host in the event of a single storage component failure section of this article. |
No VMware HA event |
|
"Failed site" experiences complete outage; disaster declared |
No automatic cluster takeover; user must perform manual force failover
Note: For details on how remote site ESX hosts respond to complete site failure, please refer to What happens to an ESX host in the event of complete filer/array failure section of this article. |
No VMware HA event; virtual machines that are accessing storage in failed site need to be powered on manually after manual force takeover |
|
Failed back to “failed site” after site restoration |
The controller in “failed site” reclaims its original role prior to failure; there is no disruption of storage access to either site |
No VMware HA event |
|
Test ESX server failures (combination of failures between hosts from both sites) |
No MetroCluster event |
VMware HA event: auto power on of virtual machines from failed hosts on the surviving nodes |
Note: If Distributed Resource Scheduler (DRS) is enabled and set to Fully Automated or Partially Automated for the HA cluster in a MetroCluster environment (Stretched or Fabric), virtual machines may access storage that is not local to its ESX Server host, meaning the data access is through the storage controller in the remote site, through fabric interconnect.
Keywords
Feedback
Actions
- KB Article: 1001783
- Updated: Aug 25, 2009
- Products:
VMware ESX - Product Versions:
VMware ESX 3.0.x
VMware ESX 3.5.x

