Knowledge Base

The VMware Knowledge Base provides support solutions, error messages and troubleshooting guides
 
Search the VMware Knowledge Base (KB)   View by Article ID
 

vSphere 5.x support with NetApp MetroCluster (2031038)

Purpose

This article provides information about deploying a vSphere Metro Storage Cluster (vMSC) across two datacenters or sites using NetApp MetroCluster Solution with vSphere 5.0, 5.1, or 5.5. For ESXi 5.0, 5.1, or 5.5, the article applies for FC, iSCSI, and NFS implementations of Stretch and Fabric MetroCluster.

Resolution

What is vMSC?

vSphere Metro Storage Cluster (vMSC) is a new certified configuration for NetApp MetroCluster storage architectures. vMSC configuration is designed to maintain data availability beyond a single physical or logical site. A storage device configured in the vMSC configuration is supported after successful vMSC certification. All supported storage devices are listed on the VMware Storage Compatibility Guide.

What is a NetApp MetroCluster?

NetApp MetroCluster is a synchronous replication solution between two NetApp controllers providing storage high availability and disaster recovery in a campus or metropolitan area. A MetroCluster (MC) configuration consists of two NetApp controllers, each residing in the same data center or across two different physical locations, clustered together. MC handles any single failure in the storage configuration and certain multiple failures without causing disruption to data availability and provides single-command recovery in case of complete site disaster.

What is MetroCluster TieBreaker?

MetroCluster TieBreaker (MCTB) Solution is a plug-in that runs in the background as a Windows service or Unix daemon on an OnCommand Unified Manager (OC UM) host. The OC UM host can be a physical machine or a virtual machine. MCTB provides automatic failover in a MetroCluster Solution in scenarios where automatic failover is not possible. This can occur during an entire Site Failure.

MCTB continuously monitors the MetroCluster controllers and corresponding network gateways from an OnCommand server at a third location. When MCTB detects conditions that require a Cluster Failover on Disaster (CFOD), it issues the necessary commands to initiate the CFOD. Log messages and OnCommand events are generated when necessary to keep the operator informed as to the state of the MetroCluster and MCTB.

Configuration Requirements

These requirements must be satisfied to support this configuration:
  • For distances under 500 m, stretch MetroCluster configurations can be used, and for distances over 500 m but under 160 km for systems running ONTAP version 8.1.1, a Fabric MetroCluster configuration can be used.
  • The maximum round trip latency for Ethernet Networks between two sites must be less than 10 ms, and for syncmirror replications must be less than 3 ms.
  • The storage network must be a minimum of 1 Gbps throughput between the two sites for ISL connectivity.
  • ESXi hosts in the vMSC configuration should be configured with at least two different IP networks, one for storage and the other for management and virtual machine traffic. The Storage network handles NFS and iSCSI traffic between ESXi hosts and NetApp Controllers. The second network (VM Network) supports virtual machine traffic as well as management functions for the ESXi hosts. End users can choose to configure additional networks for other functionality such as vMotion/Fault Tolerance. VMware recommends this as a best practice, but it is not a strict requirement for a vMSC configuration.
  • FC Switches are used for vMSC configurations where datastores are accessed via FC protocol, and ESX management traffic will be on an IP network. End users can choose to configure additional networks for other functionality such as vMotion/Fault Tolerance. This is recommended as a best practice but is not a strict requirement for a vMSC configuration.
  • For NFS/iSCSI configurations, a minimum of two uplinks for the controllers must be used. An interface group (ifgroup) should be created using the two uplinks in multimode configurations.
  • The VMware datastores and NFS volumes configured for the ESX servers are provisioned on mirrored aggregates.
  • vCenter Server must be able to connect to ESX servers on both the sites.
  • The maximum number of Hosts in an HA cluster must not exceed 32 hosts.

Notes:
  • A MetroCluster TieBreaker Machine should be deployed in a third site, and must be able to access the storage controllers in Site one and Site two to initiate a CFOD in case of an entire site failure.

  • vMSC certification testing was conducted on vSphere 5.0 and NetApp Data ONTAP version 8.1 operating in 7 mode. For ESXi 5.5, vMSC certification testing was successfully completed on vSphere 5.5 and NetApp Data ONTAP version 8.2 operating in 7 mode.

  •  For more information on NetApp MetroCluster Design and Implementation, see the NetApp Technical Report, Best Practices for MetroCluster Design and Implementation. For information about NetApp in a vSphere environment, see NetApp Storage Best Practices for VMware vSphere.

Solution Overview

The NetApp Unified Storage Architecture offers an agile and scalable storage platform. All NetApp storage systems use the Data ONTAP operating system to provide SAN (FC, iSCSI) and NFS.

MetroCluster leverages NetApp HA CFO functionality to automatically protect against controller failures. Additionally, MetroCluster layers local SyncMirror, cluster failover on disaster (CFOD), hardware redundancy, and geographical separation to achieve extreme levels of availability. Local SyncMirror synchronously mirrors data across the two halves of the MetroCluster configuration by writing data to two plexes: the local plex (on the local shelf) actively serving data and the remote plex (on the remote shelf) normally not serving data. On local shelf failure, the remote shelf seamlessly takes over data-serving operations. No data loss occurs because of synchronous mirroring. Hardware redundancy is put in place for all MetroCluster components. Controllers, storage, cables, switches (fabric MetroCluster), and adapters are all redundant.

A VMware HA/DRS cluster is created across the two sites using ESXi 5.x hosts and managed by vCenter Server 5.x. The vSphere Management, vMotion, and virtual machine networks are connected using a redundant network between the two sites. It is assumed that the vCenter Server managing the HA/DRS cluster can connect to the ESXi hosts at both sites.

Based on the distance considerations, NetApp MetroCluster can be deployed in two different configurations:
  • Stretch MetroCluster
  • Fabric MetroCluster

Stretch MetroCluster

This is a Stretch MetroCluster configuration:



Fabric MetroCluster

This is a Fabric MetroCluster configuration:



Note: These illustrations are simplified representations and do not indicate the redundant front-end components, such as Ethernet and fibre channel switches.

The vMSC configuration used in this certification program was configured with Uniform Host Access mode. In this configuration, the ESX hosts from a single site are configured to access storage from both sites.

In cases where RDMs are configured for virtual machines residing on NFS volumes, a separate LUN must be configured to hold the RDM mapping files. Ensure you present this LUN to all the ESX hosts.

vMSC test scenarios

This table outlines vMSC test scenarios:

Scenario NetApp Controllers Behavior VMware HA Behavior
Controller single path failure Controller path failover occurs. All LUNs and volumes remain connected.

For FC datastores, path failover is triggered from the host and the next available path to the same controller will be active.

All ESXi iSCSI/NFS sessions remain active in multimode configurations of two or more network interfaces.
No impact
ESXi single storage path failure No impact on LUN and volume availability. ESXi storage path fails over to the alternative path. All sessions remain active. No impact
Site-1 controller failure LUN availability remains unaffected.

FC datastores fail over to the alternate available path of a surviving controller.

ESXi iSCSI sessions affected by node failure, failover to surviving controller.
ESXi NFS volumes continue to be accessible through the other surviving node.
No impact
Site-2 controller failure LUN availability remains unaffected.

FC datastores fail over to the alternate available path of a surviving controller.

ESXi iSCSI sessions affected by node failure, failover to surviving controller.
ESXi NFS volumes continue to be accessible through the other surviving node.
No impact
MCTB VM failure No impact on LUN and volume availability. All sessions remain active. No impact
MCTB VM single Link failure No impact. Controllers continue to function normally. No impact
Complete Site 1 failure, including ESXi and controller LUN and volume availability remains unaffected.
FC datastores fail over to the alternate available path of a surviving controller.
iSCSI sessions to surviving ESXi nodes remain active.
After failed controller comes back online, and the giveback is initiated afterward, all affected aggregates resync automatically.
Virtual machines on failed Site 1 ESXi nodes fail. HA restarts failed virtual machines on ESXi hosts on Site 2.
Complete Site 2 failure, including ESXi and controller LUN and volume availability remains unaffected.
FC datastores fail over to the alternate available path of a surviving controller.
iSCSI sessions to surviving ESXi nodes remain active.
After failed controller comes back online, and the giveback is initiated afterward, all affected aggregates resync automatically.
Virtual machines on failed Site 2 ESXi nodes fail. HA restarts failed virtual machines on ESXi hosts on Site 1.
Single ESXi failure (shutdown) No impact. Controllers continue to function normally. Virtual machines on failed ESXi node fail. HA restarts failed virtual machines on surviving ESXi hosts.
Multiple ESXi host management network failure No impact. Controllers continue to function normally. A new Master will be selected within the network partition.
Virtual machines will remain running. No need to restart virtual machines.
Site 1 and Site 2 simultaneous failure (shutdown) and restoration Controllers boot up and resync. All LUNs and volumes become available. All iSCSI sessions and FC paths to ESXi hosts are re-established and virtual machines restarted successfully. As a best practice, NetApp controllers should be powered on first and allow the LUNs/volumes to become available before powering on the ESXi hosts. No impact
ESXi Management network all ISL links failure No impact to controllers. LUNs and volumes remain available. If the HA host isolation response is set to Leave Powered On, virtual machines at each site continue to run as storage heartbeat is still active. Partitioned Hosts on site that do not have a Fault Domain Manager elect a new Master.
All Storage ISL Links failure No Impact to controllers. LUNs and volumes remain available.
When the ISL links are back online, the aggregates resync.
No impact
System Manager - Management Server failure No impact. Controllers continue to function normally.
NetApp controllers can be managed using Command Line.
No impact
vCenter Server failure No impact. Controllers continue to function normally. No impact on HA. However, the DRS rules cannot be applied.

See Also

Request a Product Feature

To request a new product feature or to provide feedback on a VMware product, please visit the Request a Product Feature page.

Feedback

  • 52 Ratings

Did this article help you?
This article resolved my issue.
This article did not resolve my issue.
This article helped but additional information was required to resolve my issue.
What can we do to improve this information? (4000 or fewer characters)
  • 52 Ratings
Actions
KB: