什么是热迁移
虚拟机的迁移可以分为两大类: 离线迁移(Offline Migration)和热迁移(Live Migration)。
离线迁移是指在迁移之前, 将虚拟机暂停, 拷贝其状态到目的主机,在目的主机主机重新建立运行状态, 恢复执行。这种迁移技术适用于对服务可用性要求不高的场合。
热迁移, 是在保证虚拟机中服务正常运行的同时, 把虚拟机从一个物理主机拷贝到另一个物理主机。整个迁移过程对用户是透明的, 即用户感觉不到虚拟机位置的变化。
QEMU虚机热迁移基本流程
QEMU虚拟机热迁移是指在host上QEMU命令行中使用migrate
命令把指定的虚拟机从一个host迁移到另外一个host。
热迁移过程中会有短暂的停机时间,但目的是尽可能减少停机时间。
流程基本如下:
- Setup
Start guest on destination, connect, enable dirty page logging and more
Transfer Memory
- Guest continues to run
- Bandwidth limitation (controlled by the user)
- First transfer the whole memory
- Iteratively transfer all dirty pages (pages that were written to by the guest).
Stop the guest
- sync VM image(s) (guest’s hard drives).
Transfer State
- As fast as possible (no bandwidth limitation)
- All VM devices’ state and dirty pages yet to be transferred
Continue the guest
- On destination upon success,broadcast “I’m over here” Ethernet packet to announce new location of NIC(s).
- On source upon failure (with one exception).
QEMU虚机热迁移实现
热迁移关键stage是上图中的stage2、3和5,QEMU中对应的实现如下:
- 将虚拟机所有RAM pages设置成dirty,主要函数
ram_save_setup
- 持续迭代将虚拟机的dirty pages发送到dst,直到达到一定条件,比如dirty pages数量比较少, 主要函数
ram_save_iterate
- 停止src上面的guest,把剩下的dirty pages发送到dst,之后发送设备状态,主要函数
qemu_savevm_state_complete_precopy
其中1和2是上图中的灰色区域,3是灰色和左边的区域。
之后就可以在dst上面继续运行qemu程序了。
源端的关键实现
migration_thread
qemu_savevm_state_setup //标记所有RAM pages为dirty
ram_save_setup[save_setup]
ram_init_all
ram_init_bitmaps
ram_list_init_bitmaps
bitmap_new
bitmap_set
migration_iteration_run //迭代拷贝脏内存
qemu_savevm_state_pending
ram_save_pending[save_live_pending] //确定还要传输的字节数
qemu_savevm_state_iterate
ram_save_iterate[save_live_iterate] //把dirty pages传到dst上面
ram_find_and_save_block
send //调用具体的传输方法,比如TCP、RDMA等
migration_completion
vm_stop_force_state
qemu_savevm_state_complete_precopy
qemu_savevm_state_complete_precopy_iterable
ram_save_complete[save_live_complete_precopy]
ram_find_and_save_block //拷贝最后的脏内存
目的端的关键实现
migration_incoming_process
qemu_loadvm_state
qemu_loadvm_section_start_full
find_se //处理源端发过来的各个section
vmstate_load
ram_load[load_state] //接收到的数据拷贝到目的端虚拟机的内存上
ram_load_precopy
qemu_get_buffer
QEMU VFIO设备热迁移
VFIO直通设备在热迁移过程中,主要涉及到设备发起的DMA内存标脏,停流和恢复。
在热迁移过程中涉及到很多回调,主要涉及到SaveVMHandlers
结构体,针对内存的回调基本在savevm_ram_handlers
中。
那如果虚机中有VFIO直通设备,同样也需要实现该回调,针对VFIO设备的savevm_vfio_handlers
。
intel E810网卡的VF热迁移实际流程
pre-copy减少停机时间
由于设备的状态保存和恢复发生在停机阶段,为了尽可能减少虚机停机时间,也会考虑pre-copy。
在QEMU的官网上,针对VFIO设备迁移在QEMU内的实现,专门有一段描述:
VFIO设备迁移相关patchset
VFIO热迁移的历史可以追踪一下patch set:
- vfio: Define device migration protocol v2
- vfio: VFIO migration support with vIOMMU
- Multifd: device state transfer support with VFIO consumer
未来演进
- ARM架构特性演进,比如:FEAT_TLBIRANG减少TLBI次数,FEAT_BBM=2, MMU/IOMMU硬件标脏
参考
- Live Migration of Virtual Machines
- Live Migrating QEMU-KVM Virtual Machines
- 论文笔记 Live Migration of Virtual Machines NSDI, 2005
- qemu热迁移简介
- Journey of advancing device migration for virtio PCI hardware devices
- Journey of advancing virtio live migration
- VFIO device migration
- A Perfect Solution for Live Migration with Pass-through Devices by Quan Xu
- 基于E810网卡的VF热迁移
- VM 热迁移详解
- Unleashing VFIO’s Potential: Code Refactoring and New Frontiers in Device Virtualization
- QEMU live migration device state transfer parallelization via multifd channels
- virtio-net: add support for SR-IOV emulation