MetroCluster allows for synchronous mirroring of volumes between two storage arrays, for the purpose of storage high availability and disaster recovery. A MetroCluster configuration consists of two NetApp FAS storage arrays, each residing in the same datacenter or two different physical locations, clustered together. It provides recovery for any single storage component failure, and single-command recovery in case of complete site disaster. For additional information such as maximum distance supported and configuration requirements and details, contact NetApp.
What happens to an ESX Server in the event of a single storage component failure?
Storage controller failure (disk shelves have not failed)
For ESX Servers accessing the NetApp FAS storage controller via iSCSI or NFS protocol, the surviving storage controller performs a "takeover", meaning the target IP address (used as iSCSI target or NFS datastore mounting) is brought up on the surviving storage controller. No manual intervention is required on the ESX Server hosts.
For ESX Servers accessing the NetApp FAS storage controller via FCP protocol, the HBAs see the two clustered FAS storage controller nodes as one storage array unit, with the same WWNN (World Wide Node Name). A given LUN, if configured properly, sees paths in the following manner: Note: LUN 1 is used as an example, in a MetroCluster configuration with two storage controllers with two ports each.
vmhba1:0:1 (FAS controller 1, port 1) vmhba1:1:1 (FAS controller 2 port 2) vmhba2:0:1 (FAS controller 1 port 2) vmhba2:1:1 (FAS controller 2 port 1)
Under normal operation, LUN 1 is active on vmhba1:0:1 path. In the event of storage controller failure, the paths vmhba1:0:1 would be unavailable. The active path is failed over to vmhba1:1:1. No manual intervention is required on the ESX Server hosts.
Disk Shelf failure (Storage controller has not failed)
In the event of an entire disk shelf failure, the storage controller accesses the mirrored interconnected enclosure. No manual intervention is required for ESX Servers accessing the NetApp FAS storage array via NFS/iSCSI/FCP protocol.
What happens to an ESX Server in the event of complete filer/array failure?
In the event of complete filer/array failure (storage controller and associated local disk shelves), you must perform a manual failover of the MetroCluster. Contact NetApp for documentation and detailed steps. Additional steps are required for ESX Servers depending on the version of NetApp Data ONTAP running on the FAS storage controllers.
For NetApp FAS storage controller running Data ONTAP 7.2.4 or newer:
After you perform a manual MetroCluster failover, the UUIDs of the mirrored LUNs are retained. Perform a rescan of each ESX Server to detect the VMFS volumes running on the mirrored LUNs, when detected, power on the virtual machines.
For NetApp FAS storage controller running Data ONTAP older than 7.2.4:
After you perform a manual MetroCluster failover, the mirrored LUNs do not maintain the same LUN UUID as the original LUNs. When these LUNs house the VMFS-3 file system, the volumes are detected by ESX Server 3.x as being on snapshot LUNs. In a similar fashion, if a RAW LUN that is mapped as an RDM (Raw Device Mapping), is replicated or mirrored through MetroCluster, the metadata entry for the RDM must be recreated to map to the replicated or mirrored LUN.
To ensure the ESX hosts on the remote site to have access to the VMFS volumes on the mirrored LUNs, set the advanced VMkernel option LVM.DisallowSnapshotLUN to 0 and perform a rescan for each ESX host. After the ESX host(s) detect the VMFS volumes, power on the virtual machines.
Note: VMware High Availability (HA) and Distributed Resource Scheduler (DRS) have not been tested with MetroCluster by VMware.
Keywords
snapshot, DR, Disaster Recovery, Metro Cluster, NetApp, VMFS3