NetApp ONTAP with NetApp SnapMirror Business Continuity (SM-BC) with VMware vSphere Metro Storage Cluster (vMSC).
search cancel

NetApp ONTAP with NetApp SnapMirror Business Continuity (SM-BC) with VMware vSphere Metro Storage Cluster (vMSC).

book

Article ID: 312014

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

This article provides information about deploying a vSphere Metro Storage Cluster (vMSC) across two failure domains while using NetApp SnapMirror Business Continuity (SMBC) as the underlying storage solution. This solution applies only to VMFS datastores. SMBC can also protect iSCSI and FCP storage used by guest operating systems, either as RDMs or NPIV devices, or using in-guest iSCSI software initiators.

NetApp ONTAP with NetApp SnapMirror Business Continuity (SMBC) with VMware vSphere Metro Storage Cluster (vMSC) is a Partner Verified and Supported Products (PVSP) solution provided and supported by NetApp Inc. The NetApp solution provides multi-site disaster recovery for VMs, Kubernetes (K8s) pods, and traditional applications that are running on NetApp storage in the VMware vSphere ecosystem.

Note: The PVSP policy implies that the solution is not directly supported by VMware. For issues with this configuration, contact NetApp Inc directly. See the Support Workflow on how partners can engage with VMware. It is the partner's responsibility to verify that the configuration functions with future vSphere major and minor releases, as VMware does not guarantee that compatibility with future releases is maintained.

Disclaimer: The partner products referenced in this article are developed and supported by a partner. The use of these products is also governed by the end-user license agreements of the partner. You must obtain the storage system, support, and licensing for using these products from the partner.
For more information, see:



Environment

VMware vSphere ESXi 6.0
VMware vSphere ESXi 7.x

Resolution

What is vMSC?

A VMware vSphere Metro Storage Cluster configuration is a VMware vSphere certified solution that combines synchronous replication with array-based clustering. These solutions are implemented with the goal of reaping the same benefits that high-availability clusters provide to a local site, but in a geographically dispersed model with two locations in separate failure domains. At its core, a VMware vMSC infrastructure is a stretched cluster. The architecture is built on the idea of extending what is defined as local in terms of network and storage. This enables these subsystems to span geographies, presenting a single and common base infrastructure set of resources to the vSphere cluster at both sites. It stretches network and storage between sites. All supported storage devices are listed on the VMware Compatibility Guide .

What is a NetApp® SnapMirror® Business Continuity (SMBC)?

Introduced in NetApp ONTAP 9.9.1 SnapMirror® Business Continuity (SMBC) is a Business Continuity solution for zero Recovery Point Objective (RPO) and near zero Recovery Time Objective (RTO). SMBC gives you the flexibility with easy-to-use application-level granularity and automatic failover. Utilizing SnapMirror® synchronous replication technology, SMBC ensures fast data replication over IP networks (LAN, WAN) for your enterprise’s business-critical applications. SMBC can be deployed with an Asymmetric Active-Active configuration in which both failure domains are online, but only one failure domain has the designated owning controller for a specific LUN. The owning controller can issue I/O commands directly to the LUN, optimizing data flow and performance. The secondary failure domain, hosting the non-owning controller, can accept I/O commands and proxy communications with the LUN. 

SMBC’s asymmetric active-active architecture integrates seamlessly with critical infrastructure services such as VMware vSphere, and key applications such as Oracle and Microsoft SQL Server in both virtual and physical environments. SMBC ensures uninterrupted operation of mission-critical business services even in the event of a complete site failure. The solution incorporates Transparent Application Failover (TAF) capabilities, enabling automatic switchover to the secondary copy without the need for manual intervention. This streamlined process eliminates the requirement for additional scripting, simplifying disaster recovery and ensuring seamless continuity of operations.

What is ONTAP Mediator?
The ONTAP Mediator for SMBC extends the ability for ONTAP to detect and handle failures through seamless application failover to the secondary failure domain. The ONTAP Mediator runs on a RHEL or CentOS physical or virtual machine in a third failure domain, separate from the two ONTAP clusters and is present to establish consensus across the following three-party quorum:
1.    Primary ONTAP cluster: the host for the primary consistency group in SMBC
2.    Secondary ONTAP cluster: the host for the mirrored consistency group in SMBC
3.    ONTAP Mediator

Figure 1) SMBC three-party quorum
Picture1_anitha.png 

The ONTAP Mediator serves as a repository for heartbeat and replication status information. Agents on the source and destination clusters write and read node health information, while the mediator and partner nodes contribute to the three-party quorum process. In case of failure, the nodes trigger an alert and facilitate automated failover to the mirror copy, ensuring uninterrupted application I/O. SMBC prevents "split brain" scenarios, where each storage copy assumes sole survival and claims to be the primary. The ONTAP Mediator is required to perform automated failover for site failures.

For more information, review the Documentation for the SnapMirror Business Continuity.

Automatic Unplanned Failover in SnapMirror Business Continuity 

In a SnapMirror Business Continuity configuration, an automatic unplanned failover (AUFO) operation occurs in the event of a site failure. When a failover occurs, an automatic unplanned failover to the secondary cluster is executed. The secondary cluster is converted to the primary and begins serving clients. This operation is performed only with assistance from the ONTAP Mediator.
In addition, an AUFO can be triggered if all nodes at a site are failed because of the following reasons:
•    Node power down
•    Node power loss
•    Node panic
You can reestablish the protection relationship and resume operations on the original source cluster using System Manager or the ONTAP CLI.

Configuring SnapMirror Business Continuity (SMBC)

Pre-requisites:
Hardware

SMBC exclusively supports 2-node high availability (HA) clusters, either AFF or ASA models. It is crucial to note that both primary and secondary clusters must be the same type, either AFF or ASA. Protection for business continuity utilizing FAS models is not supported. 
Refer to the following for more information on SMBC supported NetApp models.
•    NetApp AFF A-Series
•    NetApp AFF C-Series
•    NetApp ASA

Note: The purpose of SnapMirror Business Continuity (SMBC) is to safeguard against failures that can render a site inoperable, such as disasters, and ensure uninterrupted business operations. Consequently, SMBC is not supported within the same cluster. To establish effective protection, the source and destination clusters must be separate entities and in separate failure domains.

License
You are entitled to use SMBC if you have the Data Protection, Premium Bundle, or ONTAP One licenses on both source and destination storage clusters.

Software

•    All nodes within the source and destination clusters should be installed or upgraded to ONTAP 9.9.1 or a later version.
•    ONTAP Mediator 
•    VM or physical server running supported O/S for the ONTAP Mediator

Note: For more details, please refer to NetApp’s documentation for SnapMirror Business Continuity (SMBC).

Multipathing
SMBC leverages Asymmetric Logical Unit Access (ALUA), a standard SCSI mechanism that enables application host multipathing software to communicate with the storage array through paths with priorities and access availability. ALUA designates active optimized paths to the LUN's owning controllers and identifies others as active non-optimized paths. The non-optimized paths are utilized in cases where the primary path fails.

VM vSphere
vSphere version 6.x, 7.x, or 8.x
FC and iSCSI implementations on ONTAP 9.9.1 or later

Host access topology
vMSC solutions are classified into two distinct types of topologies, depending on how the vSphere hosts access the storage systems:
1.    Uniform host access where the vSphere hosts on both sites are connected to the storage systems across both sites with LUN paths presented to vSphere hosts are stretched across the sites.
2.    Non-uniform host access where vSphere hosts at each site are connected only to the local storage system with LUN paths presented limited to the local site.
SMBC only supports uniform host access topology.

Network
Storage array-based replication transport is over TCP/IP network, with a maximum round trip time (RTT) latency of 10ms between the source and destination storage systems. 

ONTAP cluster configuration
Ensure source and destination clusters are configured properly, refer to confirm the ONTAP cluster configuration  for more details.
Protection for business continuity
Protection for business continuity involves creating a data protection relationship between two ONTAP storage systems and adding the LUNs specific to the application to the consistency group, known as a protection group. 

Note: LUNs must reside within the same storage virtual machine (SVM).
In ONTAP System Manager, 

1.    Protection > Overview > Protect for Business Continuity > Protect LUNs or 
       Storage > Consistency Groups, and select <cg name> > Protect
2.    Select one or more LUNs to protect on the source cluster.
3.    Select the destination cluster and SVM.
4.    Initialize relationship is selected by default. Click Save to begin protection. 
5.    Use ONTAP System Manager on the destination cluster. Protection > Relationships, to verify that the protection for business continuity relationship is “In sync.”

Solution Overview
NetApp SnapMirror® Business Continuity (SMBC) leverages an advanced asymmetric active-active storage architecture to enhance the inherent high availability and non-disruptive operations of NetApp hardware and ONTAP storage software. This powerful solution provides an additional layer of protection for your entire storage and host environment, ensuring uninterrupted application availability.
SMBC seamlessly maintains application availability regardless of your environment's composition, including stand-alone servers, high-availability server clusters, or virtualized servers. It protects against storage outages, array shutdowns, power or cooling failures, network connectivity issues, or operational errors. By combining storage array-based clustering with synchronous replication, SMBC ensures continuous availability even during a disaster.
With its asymmetric active-active storage architecture, SMBC delivers continuous availability and zero data loss at a cost-effective price. Managing the array-based cluster becomes simpler as it eliminates the dependencies and complexities associated with host-based clustering. By immediately duplicating mission-critical data, SMBC provides uninterrupted access to your applications and data. It seamlessly integrates with your host environment, offering continuous data availability without the need for complex failover scripts.
In the example shown in Figure 1, two workloads each in separate failure domains are protected. With the peach-colored workload (CG) on site A, and the gold-colored workload (CG) on site B. The peach paths are ALUA AO (active/optimized) for the peach workload while it is running on site-A. Conversely, the gold paths are AO for the gold workload while it is on site-B, and the green paths are again shown as ANO for it. The green paths are ALUA ANO (Active Non-Optimized) for the workloads.

Figure 2) SMBC and vMSC (asymmetric active-active)
image.png
The key features of SMBC are:
•    Guarantees the protection of SAN applications (iSCSI or FC) across two separate failure domains, ensuring robust business continuity.
•    Application focused granularity with intuitive workflows, non-disruptive instantaneous clones, and flexibility of individual application DR tests.
•    Transparent application failover of business-critical applications such as Oracle, Microsoft SQL Server, and VMware vSphere Metro Storage Cluster (vMSC)
•    Consistency group (CG) ensures dependent write order for a collection for volumes containing  application data
•    Tight integration with ONTAP leverages robust NetApp technologies to create a highly scalable, enterprise-level data protection solution
•    Simplified data management for storage provisioning, host connections, and creation of snapshots and clones for both sites
•    Enhanced business continuity by using SnapMirror Asynchronous (SM-A) to create a copy of data at a 3rd failure domain, ensuring long-distance DR and backup support.

vMSC test scenarios
This table outlines vMSC test scenarios:

ScenarioNetApp Controller BehaviorVMware HA Behavior
Controller single path failureController path failover occurs through ALUA paths established and LUNs and volumes remain connected.No impact
ESXi single storage path failure.No impact on LUN and volume availability. ESXi storage path fails over to the alternative path. All sessions remain active.No impact
Site 1 Storage failure.Transparent Application Failover of storage to Site 2.Site 1 ESXi nodes will incur higher latency due to RTT in accessing storage at Site 2.

Complete Site 1 failure, including ESXi and controller.
Transparent Application Failover of storage to Site 2.Virtual machines on failed Site 1 ESXi nodes fail. HA restarts failed virtual machines on ESXi hosts on Site 2.
Complete Site 2 failure, including ESXi and controller.No Impact to controllers in Site 1 which continues to serve client I/O, unaffected by Site 2 failure.Virtual machines on failed Site 2 ESXi nodes fail. HA restarts failed virtual machines on ESXi hosts on Site 1.
Single ESXi node failure (shutdown).No impact. Controllers continue to function normally.Virtual machines on failed ESXi node fail. HA restarts failed virtual machines on surviving ESXi nodes.
Multiple ESXi hosts become isolated due to a management network failure.No impact. Controllers continue to function normally.If the HA host isolation response is set to Leave Powered On, virtual machines at each site continue to run because the storage heartbeat is still active.

A new cluster master will be selected within the network partition.
ESXi Management network becomes partitioned between sites.No impact to controllers. LUNs and volumes remain available.If the HA host isolation response is set to Leave Powered On, virtual machines at each site continue to run because the storage heartbeat is still active.
Partitioned Hosts on the site that does not have a cluster master will elect a new cluster master.
 
Storage & management network ISL failure result in complete datacenter partition.No Impact to controllers. LUNs and volumes remain available from the primary path. When the storage links are back online, the volumes will automatically perform a delta resync.
 
No impact
ONTAP System Manager - Management Server failure.No impact. Controllers continue to function normally.
NetApp controllers can be managed using Command Line.
No impact
vCenter Server failure.No impact. Controllers continue to function normally.No impact on HA. However, the DRS rules cannot be applied.
Mediator Failure.No impact. Controllers continue to function normally.No impact
Mediator + primary Site Storage failure.Disruption to storage access for LUNs and volumes with primary path at failed site. Secondary site will be isolated.ESXi nodes utilizing primary paths at failed site will experience service disruption.
Mediator + secondary Site Storage failure.Primary site will be isolated. LUNs and volumes with primary path will continue to serve I/O.No impact