Scale up Migration ConcurrencyTo improve concurrent migration scalability, resources on the HCX Connector & Cloud Manager must be increased as below:
Baseline Migration Concurrency:Supported 300 Migrations
(Bulk & RAV) per HCX Manager
vCPU | RAM (GB) | Disk Size (GB) | Tuning |
---|
4 | 12 | 60 | N/A |
Extended Migration Concurrency :Supported 600 Migrations
(Bulk & RAV) per HCX Manager
vCPU | RAM (GB) | Disk Size (GB) | Tuning |
---|
32 | 48 | 300 | Y |
Increase resources on the HCX Connector/Cloud ManagerThe following procedure must be used to increase resource allocation on HCX Connector & Cloud Manager VM both.
Requirements and Considerations before increasing resources on the HCX Connector & Cloud Manager
- Do NOT exceed recommended allocations as that may cause the HCX Connector/Cloud Manager to malfunction.
- Both HCX Cloud Manager and Connector must be running version HCX 4.7.0 or later.
- There should be NO active migration or configuration workflows when making these resource changes.
- Changes must be made during a scheduled Maintenance Window.
- There is NO impact to Network Extension services.
- There is NO change of concurrency for HCX vMotion/Cold Migration workflow.
- The concurrent migration limit specified for HCX Replicated Assisted vMotion (RAV) is ONLY for Initial & Delta sync. During RAV switchover stage, only one relocation will be serviced at a time on a serial basis.
- Additional service meshes/IX appliance should be deployed for unique workload clusters to aggregate the replication capacity of multiple IX appliances. A different Services Mesh can be deployed for each workload cluster at source and/or target.
- If there are multiple service meshes/IX Appliances then RAV can switchover in parallel, however per SM/IX Pair it will always be sequential.
Procedure
IMPORTANT: It is recommended to take snapshots for HCX Connector & Cloud Manager VMs prior to executing steps.
Step 1: Increase the vCPU and memory of HCX Manager to 32 and 48GB respectively.
- Login to vCenter that hosts the HCX Manager.
- Shutdown HCX Manager VM's GuestOS using vCenter UI.
- Edit HCX Manager's VM to increase the vCPU and MEM reservations. Refer to:
- Power ON the HCX Manager VM.
Step 2: Add a 300GB disk to HCX Connector & Cloud Manager.
IMPORTANT: Following steps can be used to add a 300GB disk to both HCX managers. Refer to VMware Knowledge base
1003940 for creating a new virtual disk to an existing Linux virtual machine.
- Mount the created disk to HCX managers.
mount /dev/sdc1 /common_ext
df -hT
# Check if /common_ext has been mounted and has the correct type
- Add an entry to "/etc/fstab" to ensure mounted disk will sustain a reboot and HCX Manager upgrade.
vi /etc/fstab
/dev/sdc1 /common_ext ext3 rw,nosuid,nodev,exec,auto,nouser,async 1 2
Note: Use Linux VI editor to edit/modify the file.
1. Press the ESC key for normal mode.
2. Press "i" Key for insert mode.
3. Press ":q!" keys to exit from the editor without saving a file.
4. Press ":wq!" keys to save the updated file and exit from the editor.
Step 3: Stop HCX services as below:
# systemctl stop postgresdb
# systemctl stop zookeeper
# systemctl stop kafka
# systemctl stop app-engine
# systemctl stop web-engine
# systemctl stop appliance-management
Step 4: Redirect existing contents under
"kafka-db" and
"postgres-db" to the newly created disk.
- Move directory "/common/kafka-db" to "/common/kafka-db.bak".
cd /common
mv kafka-db kafka-db.bak
- Create a new directory "/common_ext/kafka-db".
cd /common_ext
mkdir kafka-db
Note: The contents inside Kafka doesn't require to be copied and will be generated after kafka/app-engine services restart.
- Change the ownership and permissions of this directory same as "/common/kafka-db.bak".
chmod 755 kafka-db
chown kafka:kafka kafka-db
- Make a soft link from "/common/kafka-db" to "/common_ext/kafka-db".
cd /common
ln -s /common_ext/kafka-db kafka-db
- Move directory "/common/postgres-db" to "/common/postgres-db.bak" as a backup
cd /common
mv postgres-db postgres-db.bak
- Copy the content for directory "/common/postgres-db.bak" to "/common_ext/postgres-db" and change the ownership to postgres.
Note: Use
"-R" option to change the ownership for
"/common_ext/postgres-db" as below:
cp -r /common/postgres-db.bak /common_ext/postgres-db
chown -R postgres:postgres /common_ext/postgres-db
- Make a soft link from "/common/postgres-db" to "/common_ext/postgres-db".
cd /common
ln -s /common_ext/postgres-db postgres-db
Step 5: Start HCX services as below:
# systemctl start postgresdb
# systemctl start zookeeper
# systemctl start kafka
# systemctl start app-engine
# systemctl start web-engine
# systemctl start appliance-management
Performance Tuning on the HCX ManagerIn addition to increasing HCX resources, you must perform the following tuning steps to scale concurrent migrations.
IMPORTANT: The steps performed in this procedure are not persisted after an HCX Manager upgrade.
ProcedureStep 6: Stop HCX services again.
Login to HCX Connector/Cloud Manager Root Console
# systemctl stop postgresdb
# systemctl stop zookeeper
# systemctl stop kafka
# systemctl stop app-engine
# systemctl stop web-engine
# systemctl stop appliance-management
Step 7: Increase memory page in app-engine framework.
- Edit "app-engine-start" file to increase JAVA memory allocation and max perm size.
vi /etc/systemd/app-engine-start
JAVA_OPTS="-Xmx4096m -Xms4096m -XX:MaxPermSize=1024m ...
Step 8: Increase thread pooling for Mobility Migration services.
- Edit "MobilityMigrationService.zql" and "MobilityTransferService.zql" to increase thread numbers.
vi /opt/vmware/deploy/zookeeper/MobilityMigrationService.zql
"numberOfThreads": "50",
vi /opt/vmware/deploy/zookeeper/MobilityTransferService.zql
"numberOfThreads":50,
Step 9: Increase message size limit for kafka framework.
- Edit "vchsApplication.zql" and update "kafkaMaxMessageSizeBytes" from "2097152" to "4194304".
vi /opt/vmware/deploy/zookeeper/vchsApplication.zql
"kafkaMaxMessageSizeBytes":4194304
- Edit "kafka server.properties" and update "message.max.bytes" from "2097152" to "4194304".
vi /etc/kafka/server.properties
message.max.bytes=4194304
Step 10: Start HCX services.
# systemctl start postgresdb
# systemctl start zookeeper
# systemctl start kafka
# systemctl start app-engine
# systemctl start web-engine
# systemctl start appliance-management
Step 11: Check the below services running in the HCX Connector/Cloud Manager:
admin@hcx [ ~ ]$ systemctl --type=service | grep "zoo\|kaf\|web\|app\|postgres"
app-engine.service loaded active running App-Engine
appliance-management.service loaded active running Appliance Management
kafka.service loaded active running Kafka
postgresdb.service loaded active running PostgresDB
web-engine.service loaded active running WebEngine
zookeeper.service loaded active running Zookeeper
IMPORTANT: In the event the HCX Manager fails to reboot OR any above listed services fail to start, revert the configuration changes immediately and ensure the system comes back on-line. Additionally, Snapshots can also be used to revert the above configurations incase of any failure while applying the steps.
Note: Snapshot revert process won't restore HCX Connector/Cloud Manager's compute resources vCPU/MEM. User must follow
"Step 1" to restore vCPU and memory of HCX Manager to
"8" and
"12GB" respectively, if needed.
Recommendations operating concurrent migrations at scale
- As a best practice, use vSphere Monitoring and Performance to monitor HCX Connector & Cloud Manager CPU utilization and MEM usage.
- Do NOT exceed the recommended limits as that could cause system instability and failed migration workflows.
- In a scaled up environment, when migration operations are being processed, expect for the CPU utilization to increase significantly during a short periods of time and there may be a temporary delay in the UI response for migration progressing events.
- Limit the concurrency of MON operations on target cloud when making configuration changes while having active concurrent Bulk migrations into MON enabled segments during switchover.
- Follow the migration events and estimation on the HCX UI to determine any slowness that may be caused by the infrastructure or the network.
- Additionally, vSphere Replication status can be monitored from the source ESXi host. Refer VMware Knowledge base 87028
- If a source ESXi host is heavily occupied from memory, I/O rate perspective, then replication performance will be affected. As a result, Bulk/RAV workflow may takes more time to complete initial base sync provided there are no slowness in the underlying datapath.
Note: In such cases, the recommendation is to relocate the source VM compute resources to another ESXi host probably a free one using native vCenter vMotion. This action won't impact ongoing replication process and do not require any changes in the migration workflow.
- The Bulk/RAV migration workflow consists of multiple stages (i.e. initial/delta sync, off-line sync, disk consolidation, data checksum, VM instantiation, etc.) and most are not dependent of network infrastructure hence the time to complete a migration for any given VM, from start to finish, may vary depending on the conditions and it is not a simple calculation based on the size of the VM and the assumed network bandwidth.