Knowledge Base

The VMware Knowledge Base provides support solutions, error messages and troubleshooting guides
 
Search the VMware Knowledge Base (KB)   View by Article ID
 

Protection and Recovery Limits of SRM 4.x to 5.5.x in a Shared Recovery Site (N:1) Configuration (2008061)

Purpose

A standard VMware vCenter Site Recovery Manager (SRM) configuration has one protection site and one recovery site, with one vCenter Server instance and one SRM Server instance on each site. A shared recovery site (N:1) configuration has multiple protection sites that all recover virtual machines to a single, shared recovery site. In an N:1 configuration, each protection site has its own vCenter Server and SRM Server instances. The recovery site in an N:1 configuration has one shared vCenter Server instance and multiple SRM Server instances that are all registered as extensions to the same shared vCenter Server instance. If you use vSphere Replication, the recovery site has one shared vSphere Replication management server. You can connect a maximum of 10 protected sites to a shared recovery site.

This article provides information about the scalability limits for an N:1 configuration with SRM 4.x to 5.5.x, for both array-based replication and vSphere Replication. 

Using Array-based Replication in an N:1 Configuration

You can use array-based replication to perform recovery and reprotect in an N:1 configuration with all SRM 5.x releases. Performing recovery and reprotect with array-based replication in an N:1 configuration is subject to the same protection and recovery limits as for a standard 1:1 configuration. See Operational Limits for SRM and vSphere Replication (2034768).

Using vSphere Replication in an N:1 Configuration

In an N:1 configuration with vSphere Replication, the secondary vSphere Replication management server is shared across the different SRM Server pairs. You can use vSphere Replication to perform recovery and reprotect in an N:1 configuration with certain limitations.
     
  • Reprotect with vSphere Replication is not supported in SRM 5.0.x. You cannot perform reprotect by using vSphere Replication with SRM 5.0.x in either a 1:1 or an N:1 configuration. Reprotect with vSphere Replication is supported in SRM 5.1 and later, for both 1:1 and N:1 configurations. 
  • In vSphere Replication 1.0.x and 5.1, the vSphere Replication management server cannot handle concurrent recoveries or reprotects from more than three sites. Additional recoveries can result in an operation timeout error in SRM, with the error message Operation timed out: -1 seconds. To avoid operation timeout errors, avoid running concurrent recoveries or reprotects for more than 3 sites. If you do run concurrent recoveries for more than 3 sites and you encounter an operation timeout error, rerun the failed recovery or reprotect operation. This issue has been fixed in vSphere Replication 5.1.1 and vSphere Replication 5.5.
  • In vSphere Replication 1.0.x and 5.1, due to restrictions in the number of database connections, the vSphere Replication management server cannot handle concurrent recovery or reprotect operations for more than 80 virtual machines (80 LROs). If you run concurrent recovery or reprotect operations on more than 80 virtual machines, the vSphere Replication management server can encounter a deadlock while waiting for a free database connection. This causes operations to time out in SRM. It sometimes happens during recovery, but it always happens during reprotect if the number of LROs  exceeds 80. This issue has been fixed in vSphere Replication 5.1.1 and vSphere Replication 5.5.
  • SRM handles the recovery or reprotect of a virtual machine as a long running operation (LRO). Each SRM Server instance on the recovery site throttles the number of LROs that it can send simultaneously to the vSphere Replication management server to a maximum of 40. This applies to all releases of vSphere Replication.
  • In SRM and vSphere Replication 5.5, test recovery, recovery, and reprotect operations can fail in a shared recovery site configuration if the vSphere Replication server experiences a heavy load, resulting in the error The connection to the remote server is down. Do not perform concurrent operations on more than 200 virtual machines, with a maximum of number of 20 virtual machines per protected site when using SRM and vSphere Replication 5.5.

How SRM Throttles Concurrent Long Running Operations

Due to the throttling of the number of LRO requests that SRM Server sends to the vSphere Replication management server, the key factor is not the number of simultaneous recovery or reprotect operations that you start for each SRM site pair. The key factor is the total number of concurrent recovery or reprotect operations (LROs) that all SRM site pairs send to the vSphere Replication management server on the recovery site.

Example 1: Excessive LROs

In an N:1 configuration with vSphere Replication 1.0.x or 5.1, you can have more than 3 sites, but you can only start recovery or reprotect operations from 3 sites simultaneously.                                                

SRM Site Pairs
Number of Virtual Machines to Recover or Reprotect Simultaneously
Number of Concurrent LRO Requests that SRM Server Sends to vSphere Replication Management Server
SRM Site Pair 1
100
40 (SRM throttles 100 LRO requests down to 40)
SRM Site Pair 2
20
20
SRM Site Pair 3
0
0
SRM Site Pair 4
0
0
SRM Site Pair 5
45
40 (SRM throttles 45 LRO requests down to 40)
TOTALS
165
100

In Example 1, the number of recovery or reprotect operations that are started simultaneously is 165. However, the total number of concurrent LRO requests that the SRM Server instances send to the vSphere Replication management server is 100. This total exceeds the limit of 80 concurrent concurrent recovery or reprotect operations (80 LROs) that the vSphere Replication management server can handle simultaneously in vSphere Replication 1.0.x and 5.1.

The SRM Server logs on the recovery site show messages that indicate that the connection to the vSphere Replication management server is down. For example:

2011-09-15T09:51:14.062-07:00 [03988 error 'LocalHMS'] Response handler '167319' handling unexpected error: (dr.fault.ConnectionDownFault)

In the SRM interface, the recovery or reprotect of virtual machines fails with an error:

The connection to the remote server is down.

The vSphere Replication management server logs contain the message:

011-09-15 16:47:10.756 TRACE  hms.monitor.threads [hms-thread-monitor-thread-0]  (..hms.util.ThreadPoolMonitorScheduler) |Thread pool state  [hms-vlsi-server]: { max: 100, active: 100, queued: 2 }

Example 2: Successful LROs

In an N:1 configuration with vSphere Replication 1.0x and 5.1, the number of simultaneous recovery or reprotect operations that you start can exceed the number of LRO requests that SRM sites send to the vSphere Replication management server, as long as the number of LROs does not exceed 80.                                                        

SRM Site Pairs
Number of Virtual Machines to Recover or Reprotect Simultaneously
Number of Concurrent LRO Requests that SRM Server Sends to vSphere Replication Management Server
SRM Site Pair 1
100
40 (SRM throttles 100 LRO requests down to 40)
SRM Site Pair 2
20
20
SRM Site Pair 3
0
0
SRM Site Pair 4
0
0
SRM Site Pair 5
15
15
TOTALS
135
75

In example 2, the number of recovery or reprotect operations that are started simultaneously is 135. However, the total number of concurrent LRO requests that the SRM Server instances send to the vSphere Replication management server is 75. This total does not exceed the limit of 80 concurrent recovery or reprotect operations (80 LROs) that the vSphere Replication management server can handle simultaneously. The requests for recovery or reprotect operations succeed, even though the number of simultaneous recovery or reprotect operations that are started is 135.

Resolution

To avoid operation timeout errors in SRM caused by a deadlocked vSphere Replication management server in vSphere Replication 1.0.x and 5.1, do not perform recovery or reprotect operations on more than 80 virtual machines concurrently (80 LROs).

To determine whether the vSphere Replication management server in an N:1 configuration exceeds the limit of 80 LROs, calculate the total number of simultaneous recovery or reprotect operations (T) from all SRM site pairs:

T = Sum [ min(40, N[i]) ]
In this formula, N[i] is the number of virtual machines to recover or reprotect simultaneously on SRM site pair number (i) (the i-th SRM site-pair) and min is the minimum function that returns the smallest of a set of numbers given to it.
min(40, N[i]) = 40 if N[i] >= 40
min(40, N[i]) = N[i] if N[i] < 40
You can use this formula to calculate T with the data from Example 2:
T = min(40, N[1]) +
    min(40, N[2]) +
    min(40, N[3]) +
    min(40, N[4]) +
    min(40, N[5])
 
T = 40 + 20 + 0 + 0 + 15 = 75

If you encounter a timeout caused by a deadlocked vSphere Replication management server on the secondary site, restart the vSphere Replication management server.

If a recovery plan fails due to an overloaded vSphere Replication management server, you can rerun the plan. SRM retries virtual machines that failed. The virtual machines that have already been recovered are left untouched and continue running. This workaround is available only for real recoveries. If a test recovery fails, you cannot rerun the test from the point of failure. You have to cleanup the test and then start it again.

You can also change the database settings to allow more database connections from the vSphere Replication management server, up to a maximum of 500.

Reconfiguring vSphere Replication Management Server Database Connection Settings in SRM 5.1

In SRM 5.1, you can reconfigure the hms-max-db-connections and the vPostgres max_connections settings to allow more database connections. This is not necessary in SRM 5.1.1 and later or in SRM 5.5.

You calculate the required settings as follows:
     
  • hms-max-db-connections = number of parallel reprotect operations (maximum of 40 LRO per SRM site) * 2 + 10 
  • vPostgresql max_connections = at least hms-max-db-connections + 1, to allow an administrative access to the database and connections by the vSphere Replication management server.
For example, to set the  number of database connections to the maximum value of 500, perform the following steps:
     
  1. Log in to the vSphere Replication management server virtual machine. 
  2. Stop the vSphere Replication management server: /etc/init.d/hms stop 
  3. Open /opt/vmware/hms/conf/hms-configuration.xml in a text editor and change the hms-db-max-connections setting from 99 to 500. 
  4. Open /var/lib/vrmsdb/postgresql.conf in a text editor and change the max_connections settings from 100 to 501. 
  5. Restart the embedded vPostgres database: /etc/init.d/hms-vpostgres stop /etc/init.d/hms-vpostgres start 
  6. Start the vSphere Replication management server: /etc/init.d/hms start 
  7. Rerun the failed operation. 

NOTE: If you increase the number of database connections, the limits on the numbers of replications and on performing concurrent recovery or reprotect operations from multiple sites still apply. You can configure 500 replications per vSphere Replication appliance and perform concurrent recovery or reprotect operations from a maximum of 3 sites.

Reconfiguring vSphere Replication Management Server Database Connection Settings in SRM 5.0.x

In SRM 5.0.x, you can reconfigure the hibernate.c3p90.max_size setting to allow more database connections. You calculate the required settings as follows:

hibernate.c3p90.max_size = number of parallel recovery operations (maximum of 40 LRO per SRM site) * 2 + 10
For example, to set the  number of database connections to the maximum value of 500, perform the following steps:
     
  1. Log in to the vSphere Replication management server virtual machine. 
  2. Stop the vSphere Replication management server: /etc/init.d/hms stop 
  3. Copy /opt/vmware/hms/libs/hms.jar to another location. 
  4. Open the archive by using a ZIP file utility. 
  5. Open META-INF/persistence.xml file in an editor. 
  6. Change the hibernate.c3p0.max_size setting from 40 to 500. 
  7. Rebuild the updated hms.jar and copy it into /opt/vmware/hms/libs on the appliance. 
  8. Start the vSphere Replication management server: /etc/init.d/hms start 
  9. Rerun the failed operation. 

NOTE: If you increase the number of database connections, the limits on the number of replications and on performing concurrent recovery operations from multiple sites still apply. You can configure 500 replications per vSphere Replication appliance and perform concurrent recovery operations from a maximum of 3 sites.

See Also

Request a Product Feature

To request a new product feature or to provide feedback on a VMware product, please visit the Request a Product Feature page.

Feedback

  • 3 Ratings

Did this article help you?
This article resolved my issue.
This article did not resolve my issue.
This article helped but additional information was required to resolve my issue.
What can we do to improve this information? (4000 or fewer characters)
  • 3 Ratings
Actions
KB: