HCX : Bulk Migration operations and best practices

search cancel

HCX : Bulk Migration operations and best practices

book

Article ID: 323663

calendar_today

Updated On:

Products

VMware HCX VMware Cloud on AWS

Issue/Introduction

Introduction

The bulk migration capability of HCX uses vSphere replication(vSR) to migrate disk files while re-creating the VM on the destination vSphere HCX instance.

How it works

HCX Bulk Migration has following stages:

1. Placeholder disk creation: At the target vCenter, on the specified datastore placeholder disks are created to enable replication of source VM disk's data to these placeholder disks.
Note: During early stage of migration for a given VM, the empty disks(placeholder) are created by HCX on the target side. Whereas, config files(cfg) are created by Host based replication(HBR).

2. Push LWD config: On source and target IX appliances, LWD configuration rules are pushed to enable HBR server(also known as VR server) inside target IX appliance to accept and process vSphere Replication traffic for sending towards target ESX host via NFC connection on port 902.
Note: In case of reverse bulk migration from Cloud to OnPrem, HBR server is enabled on OnPrem IX appliance.

3. Enable replication: At source side, HCX signals vCenter to enable replication. Replication status can be verified using following command in ESX root shell:
vim-cmd vmsvc/getallvms
vim-cmd hbrsvc/vmreplica.getState <VM ID>

4. Start Full sync/Base sync: Once replication gets enabled on source ESX/HBR, a full sync event on source vCenter starts replicating data to the target via HCX-IX appliance.

5. RPO cycle/Delta Sync: After completion of initial base sync, an RPO cycle of 2 hours is set to perform delta sync.
Note: Depending upon data churn on source disk, additional snapshots are being created during RPO cycle.
Note: After each RPO cycle, disk consolidation takes place and creates a “hbrdisk.RDID vmdk” called as replica instance vmdk file on target datastore.

6. CONT Replication: If migration switchover is planned for a schedule MW, Delta Sync keeps running every 2 hours until it reaches to the scheduled time.
Note: If user wants to modify switchover schedule, they can able to change/modify from migration wizard during runtime.
Note: User may perform force switchover immediately by selecting "Ignore failover window and start migration as soon as possible." option under "Schedule failover window for migrations" tab, but, it will wait for any ongoing replication transfer to be completed before transitioning to the cutover stage.

7. Switchover: Post completion of initial or full sync, An image is created on target side and switchover is triggered automatically unless a specific schedule is set, as described above.
Note: Image constitutes using VMX & NVRAM including VMXF(if applicable).

Switchover has following tasks performed in the backend:

a. Power off source VM: To perform offline sync, HCX signals source vCenter to power off VM which is required to stop further data churn on the source VM.
b. Offline Sync: The replica instance vmdk files are consolidated(deleted) which is a time taking process and depends upon the target vCenter infrastructure which can't be predicted and mostly unrelated to HCX bulk migration workflow.
c. Instantiate VM: After successful completion of offline sync, VMX/VMXF and NVRAM files are copied from HBR config files to target datastore to be used for instantiating the VM.

8. Clean up: Upon successful instantiation of VM on target side, Migration workflow transitioned into clean up workflow to remove all instances and configurations corresponding to migration and transfer for a given VM.

a. LWD config is removed from both source/target IX appliances.
b. Disable replication task is performed on source ESX host corresponding to a given Virtual Machine.
c. Network is disconnected and VM is renamed for backup at source side.

Log analysis during bulk migration workflow

1. HCX: To monitor the events,

a. Go to HCX source/target migration wizard >> Migration Management page(Mobility Groups) >> Migration >> Events

b. Go to HCX source/target admin shell >> /common/log/admin/app.log,
c. Go to HCX IX appliance shell using ccli >> var/log/vmware/hbrsrv.log

Note: You can also go to /tmp/Fleet-appliances/<Service-Mesh>/<IX-Appliance>/var/log/vmware/hbrsrv.log from HCX tech bundle.
Note: For forward migration, look for hbrsrv.log from target/cloud IX appliance.

Empty disks are created on target:

2022-02-17T15:16:18.152Z info hbrsrv[6AB2AD852700] [Originator@6876 sub=Host opID=hs-285dd448] Getting disk type for /vmfs/volumes/vsan:UUID/VM_UUID/VM_NAME

2022-02-17T15:16:18.202Z info hbrsrv[6AB2AD956700] [Originator@6876 sub=Host opID=hs-4ac4bf6f] Getting disk type for /vmfs/volumes/vsan:UUID/VM_UUID/VM_NAME

Got the Disk RDID from source ESX HBR:

2022-02-17T15:17:18.919Z info hbrsrv[6AB2AD956700] [Originator@6876 sub=Delta] Configured disks for group VRID-XXXXXX:
2022-02-17T15:17:18.919Z info hbrsrv[6AB2AD956700] [Originator@6876 sub=Delta]         RDID-XXXXX
2022-02-17T15:17:18.919Z info hbrsrv[6AB2AD956700] [Originator@6876 sub=Delta]         RDID-XXXXXX

Indication of Full Sync completion:

2022-02-17T15:17:33.881Z info hbrsrv[6AB2ADA9B700] [Originator@6876 sub=Delta opID=hsl-10579a55] Full sync complete for disk RDID-XXXXXXX (198057984 bytes transferred, 209715200 bytes checksummed)
 
2022-02-17T15:17:55.078Z info hbrsrv[6AB2AD8D4700] [Originator@6876 sub=Delta opID=hsl-1057c4a8] Full sync complete for disk RDID-XXXXXXX (827564032 bytes transferred, 838860800 bytes checksummed)

Image creation

2022-02-17T15:20:04.403Z info hbrsrv[6AB2AD9D8700] [Originator@6876 sub=Delta opID=hsl-1057c4bc] Instance complete for disk RDID-XXXXXXX

2022-02-17T15:20:04.738Z info hbrsrv[6AB2AD70D700] [Originator@6876 sub=Delta opID=hsl-1057c4ee] Instance complete for disk RDID-XXXXXXX

2022-02-17T15:20:14.508Z info hbrsrv[6AB2ADBE0700] [Originator@6876 sub=Image opID=hs-4f3c1b62:hs-d5da:hs-4252] Creating image from group VRID-XXXXXXXX, instance 49, in XXXXXXX

Creation of replica disks

2022-02-17T15:20:14.526Z info hbrsrv[6AB2ADBE0700] [Originator@6876 sub=Host opID=hs-4f3c1b62:hs-d5da:hs-4252] Getting disk type for /vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-XXXXXXXX.vmdk

2022-02-17T15:20:14.822Z info hbrsrv[6AB2ADBE0700] [Originator@6876 sub=Host opID=hs-4f3c1b62:hs-d5da:hs-4252] Getting disk type for /vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-XXXXXXXX.vmdk

VMX/VMXF and NVRAM file download event

2022-02-17T15:20:15.123Z info hbrsrv[6AB2ADBE0700] [Originator@6876 sub=Image opID=hs-4f3c1b62:hs-d5da:hs-4252] Copying cfg /vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-XXXXXXXX.vmdkvmx.137 to /vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-XXXXXXXX.vmdk.vmx

2022-02-17T15:20:15.410Z info hbrsrv[6AB2ADBE0700] [Originator@6876 sub=Image opID=hs-4f3c1b62:hs-d5da:hs-4252] Copying cfg /vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-XXXXXXXX.vmdk.vmxf.138 to /vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-XXXXXXXX.vmdk.vmxf

2022-02-17T15:20:15.430Z info hbrsrv[6AB2ADBE0700] [Originator@6876 sub=Image opID=hs-4f3c1b62:hs-d5da:hs-4252] Copying cfg /vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-XXXXXXXX.vmdk.nvram.139 to /vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-XXXXXXXX.vmdk.nvram

Replica disks consolidation/Deletion

2022-02-17T15:20:32.891Z info hbrsrv[6AB2AD70D700] [Originator@6876 sub=PersistentCleanup opID=hs-565f4eb] The disk '/vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-XXXXXXXX.vmdk' (key=186) was cleaned up successfully.
 
2022-02-17T15:20:33.004Z info hbrsrv[6AB2AD70D700] [Originator@6876 sub=PersistentCleanup opID=hs-565f4eb] The disk '/vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-XXXXXXXX.vmdk' (key=187) was cleaned up successfully.

Hbrcfg file deletion

2022-02-17T15:20:33.148Z info hbrsrv[6AB2AD70D700] [Originator@6876 sub=PersistentCleanup opID=hs-565f4eb] The file '/vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-XXXXXXXX.vmx.137' (key=189) was cleaned up successfully.
 
2022-02-17T15:20:33.220Z info hbrsrv[6AB2AD70D700] [Originator@6876 sub=PersistentCleanup opID=hs-565f4eb] The file '/vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-XXXXXXXX.vmxf.138' (key=190) was cleaned up successfully.
 
2022-02-17T15:20:33.291Z info hbrsrv[6AB2AD70D700] [Originator@6876 sub=PersistentCleanup opID=hs-565f4eb] The file '/vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-XXXXXXXX.nvram.139' (key=191) was cleaned up successfully.

2. ESX HBR: Run vim-cmd CLIs to verify the status of replication.

Get all VM details and track specific VM ID

[root@cia-vmc-esx-015:~] vim-cmd vmsvc/getallvms                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
852    VM_Name                                                                                 rhel6_64Guest           vmx-14

Get replication state for VM ID 852 from source ESX/HBR

[root@cia-vmc-esx-015:~] vim-cmd hbrsvc/vmreplica.getState 852
Retrieve VM running replication state:
(vim.fault.ReplicationVmFault) {
   faultCause = (vmodl.MethodFault) null, 
   faultMessage = <unset>, 
   reason = "notConfigured", 
   state = <unset>, 
   instanceId = <unset>, 
   vm = 'vim.VirtualMachine:852'
   msg = "Received SOAP response fault from [<cs p:000000081ef64380, TCP:localhost:8307>]: getGroupState
vSphere Replication operation error: Virtual machine is not configured for replication."
}

[root@cia-vmc-esx-015:~] vim-cmd hbrsvc/vmreplica.getState 852
Retrieve VM running replication state:
	The VM is configured for replication. Current replication state: Group: VRID-XXXXX (generation=32459820918756983)
	Group State: full sync (74% done: checksummed 614 MB of 1000 MB, transferred 569.3 MB of 593.8 MB)
		DiskID RDID-XXXXXX State: full sync (checksummed 414 MB of 800 MB, transferred 380.4 MB of 404.9 MB)
		DiskID RDID-XXXXXX State: inactive

[root@cia-vmc-esx-015:~] vim-cmd hbrsvc/vmreplica.getState 852
Retrieve VM running replication state:
	The VM is configured for replication. Current replication state: Group: VRID-XXXXX (generation=32459820918756983)
	Group State: lwd delta (instanceId=replica-XXXXXXXX) (0% done: transferred 0 bytes of 40 KB)
		DiskID RDID-XXXXXXX State: lwd delta (transferred 0 bytes of 40 KB)
		DiskID RDID-XXXXXXX State: lwd delta (transferred 0 bytes of 0 bytes)

3. Target vCenter: Check the VM datastore file location and verify the status of hbrdisk & hbrcfg files.

Placeholder disks are created by HCX

Config files (nvram,vmx &vmxf) are created by HBR

There may be multiple instances of images created on target Datastore depending upon snapshots on source VM, but those will be downloaded and consolidated during VM instantiation.

Post successful consolidation of disks, only original images and vmdk files will be retained on the target Datastore.

Resolution

Migration Best Practices

Use Bulk migration with "Seed Checkpoint" enabled which is available from HCX 4.1.0 release onwards,

Note: In the event that the bulk migration workflow fails and rolls back at the cut over stage, so when rescheduling the migrations, the workflow will try to reuse the seed data already copied in the previous attempt.
Note: The recommendation is to not perform cleanup operation of failed job which will lead to the removal of seed data.

Quiesce the VM before scheduling the migration to minimize data churn.
Migrate the VM by itself so all infrastructure resources are dedicated to that single workflow.
Ensure there is sufficient space in the target Data Store. Up to 20% extra space may be used temporarily during the migration.
Follow the migration events and estimation on the HCX UI to determine any slowness that may be caused by the infrastructure or the network.
Additionally, vSphere Replication status can be monitored from the source ESXi host as described in earlier sections.
If a source ESXi host is heavily occupied from memory, I/O rate perspective, then replication performance will be affected. As a result, Bulk migration workflow may takes more time to complete initial base sync provided there are no slowness in the underlying datapath.

Note: In such cases, the recommendation is to relocate the source VM compute resources to another ESXi host probably a free one using vCenter vMotion. This action won't impact ongoing replication process and do not require any changes in the migration workflow.

Switchover window should be over estimated to accommodate for the lengthy data checksum process and instantiation of the VM on target
Co-location with other VMs must be planned accordingly to accommodate for the expected downtime in services, so the migration workflow can be committed to completion.
In certain cases, Bulk migration workflow may takes more time to complete switchover stage, when an extremely large VM is getting migrated using HCX.
Do not perform app/web engine restart from source and target HCX manager during course of an ongoing migration as it may impact migration workflow at a given point of time.
Do not power off source VM manually after completion of initial base sync as it may impact offline sync workflow at a given point of time.
In the event that VM take longer or cannot be shutdown gracefully from GuestOS, the recommendation is for the customer to enable "Force Power Off" upon scheduling the migrations. Refer to KB 86026

Alternatively:

The VM can be migrated using Cold migration to ensure completion, despite the service disruption.
In case bulk migration fails, the recommendation is for the customer to use DR instead and ensure the protection is completed before manually triggering recovery to bring up the VM instance on the target site.
DR Protection Recovery can be used only as a last resource, to fail over all required VMs once those are protected.

Note: DR Protection Recovery would be a more manual and lengthy process but with a higher chance for success given any infrastructure and network limitations.

IMPORTANT: A migration cannot be guaranteed under ANY circumstances therefore these and other considerations must be taken to maximize the possibilities for a successful migration under those conditions, by minimizing the impact of infrastructure and network limitations.

Feedback

thumb_up Yes

thumb_down No