Xen 4.6 Live Migration and Hotplugging Issues

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Xen 4.6 Live Migration and Hotplugging Issues

Tim Evers-2
Hi,

I am trying to set up two Ubuntu 16.04 / Xen 4.6 Machines to perform
live migration and CPU / memory hotplug. So far I encountered several
catastrophic issues. They are so severe that I am thinking I might be on
the wrong track alltogether.

Any input is highly appreciated!

The setup:

2 Dell M630 with Ubuntu 16.04 and Xen 4.6, 64bit Dom0 (node1 + node2)

2 Domus, Debian Jessie 64bit PV and Debian Jessie 64bit HVM

Now create a PV Domu on node1 with 1 CPU Core and 2 GB RAM and plenty of
room for hot-add / hotplug:

Config excerpt:

kernel       = "/home/xen/shared/boot/tests/vmlinuz-3.16.0-4-amd64"
ramdisk      = "/home/xen/shared/boot/tests/initrd.img-3.16.0-4-amd64"
maxmem       = 16384
memory       = 2048
maxvcpus     = 8
vcpus        = 1
cpus         = "18"

xm list:

root1823     97  2048     1     -b----      15.1

All is fine. Now migrate to node2. Immediately after the migratiion we see:

xm list:

root182      360 16384     1     -b----      10.5

So the DomU immediately ballooned to its maxmem after the migration, and
even better, inside the Domu we see all CPUs are suddenly hotplugged
(but not online due to missing udev rules):

root@debian8:~# ls /sys/devices/system/cpu/ | grep cpu
cpu0
cpu1
cpu2
cpu3
cpu4
cpu5
cpu6
cpu7

So this is already not how it is supposed to be (DomU should look the
same before and after migration).

Now we take cpu1 online:

echo 1 > /sys/devices/system/cpu/cpu1/online

Result as seen through hvc on the Dom0:

[  373.360949] installing Xen timer for CPU 1
[  400.032003] BUG: soft lockup - CPU#0 stuck for 22s! [bash:733]
[  400.032003] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl
nfs lockd fscache sunrpc evdev pcspkr x86_pkg_temp_thermal thermal_sys
coretemp crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper
ablk_helper cryptd autofs4 ext4 crc16 mbcache jbd2 crct10dif_pclmul
crct10dif_common xen_netfront xen_blkfront crc32c_intel
[  400.032003] CPU: 0 PID: 733 Comm: bash Not tainted 3.16.0-4-amd64 #1
Debian 3.16.43-2+deb8u3
[  400.032003] task: ffff88000470e1d0 ti: ffff88006acec000 task.ti:
ffff88006acec000
[  400.032003] RIP: e030:[<ffffffff810013aa>]  [<ffffffff810013aa>]
xen_hypercall_sched_op+0xa/0x20
[  400.032003] RSP: e02b:ffff88006acefdd0  EFLAGS: 00000246
[  400.032003] RAX: 0000000000000000 RBX: 0000000000000001 RCX:
ffffffff810013aa
[  400.032003] RDX: ffff88007d640000 RSI: 0000000000000000 RDI:
0000000000000000
[  400.032003] RBP: ffff88006bcf6000 R08: ffff88007d03d5c8 R09:
0000000000000122
[  400.032003] R10: 0000000000000000 R11: 0000000000000246 R12:
0000000000000001
[  400.032003] R13: 000000000000cd60 R14: ffff88006d1dca20 R15:
000000000007d649
[  400.032003] FS:  00007fe4b215e700(0000) GS:ffff88007d600000(0000)
knlGS:0000000000000000
[  400.032003] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[  400.032003] CR2: 00000000016de6d0 CR3: 0000000004a67000 CR4:
0000000000042660
[  400.032003] Stack:
[  400.032003]  ffff88006acefb3e 0000000000000000 ffffffff81010dc1
0000000001323d35
[  400.032003]  0000000000000000 0000000000000000 0000000000000001
0000000000000001
[  400.032003]  ffff88006d1dca20 0000000000000000 ffffffff81068cac
000000306aceff3c
[  400.032003] Call Trace:
[  400.032003]  [<ffffffff81010dc1>] ? xen_cpu_up+0x211/0x500
[  400.032003]  [<ffffffff81068cac>] ? _cpu_up+0x12c/0x160
[  400.032003]  [<ffffffff81068d59>] ? cpu_up+0x79/0xa0
[  400.032003]  [<ffffffff8150b615>] ? cpu_subsys_online+0x35/0x80
[  400.032003]  [<ffffffff813a608d>] ? device_online+0x5d/0xa0
[  400.032003]  [<ffffffff813a6145>] ? online_store+0x75/0x80
[  400.032003]  [<ffffffff8121b56a>] ? kernfs_fop_write+0xda/0x150
[  400.032003]  [<ffffffff811aaf32>] ? vfs_write+0xb2/0x1f0
[  400.032003]  [<ffffffff811aba72>] ? SyS_write+0x42/0xa0
[  400.032003]  [<ffffffff8151a48d>] ?
system_call_fast_compare_end+0x10/0x15
[  400.032003] Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc
cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00
0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc

The same happens on the HVM DomU but always only _after_ live migration.
Hotplugging works flawlessly if done on the Dom0 where the DomU is
started on.

Any idea what might be happening here? Anyone who has managed to migrate
and afterwards hotplug a DomU?

Thanks

Tim

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: Xen 4.6 Live Migration and Hotplugging Issues

Dongli Zhang
About the cpu hotplug issue, I am able to reproduce it as well.

I think the lockup is due to the following code in xen_cpu_up()
(arch/x86/xen/smp.c) as it is spinning until cpu_hotplug_state of new vcpu is
CPU_ONLINE.

    while (cpu_report_state(cpu) != CPU_ONLINE)
        HYPERVISOR_sched_op(SCHEDOP_yield, NULL);

cpu_hotplug_state is set to CPU_ONLINE with cpu_set_state_online().

Have you tried with the latest mainline linux? As far as I remember, I have
tried with latest mainline linux and got warning related to block-mq when I
online vcpu.

I am not sure if below patch would help:

   1 commit ae039001054b34c4a624539b32a8b6ff3403aaf9
   2 Author: Ankur Arora <[hidden email]>
   3 Date:   Fri Jun 2 17:06:02 2017 -0700
   4
   5     xen/vcpu: Handle xen_vcpu_setup() failure at boot
   6
   7     On PVH, PVHVM, at failure in the VCPUOP_register_vcpu_info hypercall
   8     we limit the number of cpus to to MAX_VIRT_CPUS. However, if this
   9     failure had occurred for a cpu beyond MAX_VIRT_CPUS, we continue
  10     to function with > MAX_VIRT_CPUS.
  11
  12     This leads to problems at the next save/restore cycle when there
  13     are > MAX_VIRT_CPUS threads going into stop_machine() but coming
  14     back up there's valid state for only the first MAX_VIRT_CPUS.
  15
  16     This patch pulls the excess CPUs down via cpu_down().
  17
  18     Reviewed-by: Boris Ostrovsky <[hidden email]>
  19     Signed-off-by: Ankur Arora <[hidden email]>
  20     Signed-off-by: Juergen Gross <[hidden email]>

Dongli Zhang

On 10/31/2017 12:14 AM, Tim Evers wrote:

> Hi,
>
> I am trying to set up two Ubuntu 16.04 / Xen 4.6 Machines to perform live
> migration and CPU / memory hotplug. So far I encountered several catastrophic
> issues. They are so severe that I am thinking I might be on the wrong track
> alltogether.
>
> Any input is highly appreciated!
>
> The setup:
>
> 2 Dell M630 with Ubuntu 16.04 and Xen 4.6, 64bit Dom0 (node1 + node2)
>
> 2 Domus, Debian Jessie 64bit PV and Debian Jessie 64bit HVM
>
> Now create a PV Domu on node1 with 1 CPU Core and 2 GB RAM and plenty of room
> for hot-add / hotplug:
>
> Config excerpt:
>
> kernel       = "/home/xen/shared/boot/tests/vmlinuz-3.16.0-4-amd64"
> ramdisk      = "/home/xen/shared/boot/tests/initrd.img-3.16.0-4-amd64"
> maxmem       = 16384
> memory       = 2048
> maxvcpus     = 8
> vcpus        = 1
> cpus         = "18"
>
> xm list:
>
> root1823     97  2048     1     -b----      15.1
>
> All is fine. Now migrate to node2. Immediately after the migratiion we see:
>
> xm list:
>
> root182      360 16384     1     -b----      10.5
>
> So the DomU immediately ballooned to its maxmem after the migration, and even
> better, inside the Domu we see all CPUs are suddenly hotplugged (but not online
> due to missing udev rules):
>
> root@debian8:~# ls /sys/devices/system/cpu/ | grep cpu
> cpu0
> cpu1
> cpu2
> cpu3
> cpu4
> cpu5
> cpu6
> cpu7
>
> So this is already not how it is supposed to be (DomU should look the same
> before and after migration).
>
> Now we take cpu1 online:
>
> echo 1 > /sys/devices/system/cpu/cpu1/online
>
> Result as seen through hvc on the Dom0:
>
> [  373.360949] installing Xen timer for CPU 1
> [  400.032003] BUG: soft lockup - CPU#0 stuck for 22s! [bash:733]
> [  400.032003] Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs
> lockd fscache sunrpc evdev pcspkr x86_pkg_temp_thermal thermal_sys coretemp
> crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd
> autofs4 ext4 crc16 mbcache jbd2 crct10dif_pclmul crct10dif_common xen_netfront
> xen_blkfront crc32c_intel
> [  400.032003] CPU: 0 PID: 733 Comm: bash Not tainted 3.16.0-4-amd64 #1 Debian
> 3.16.43-2+deb8u3
> [  400.032003] task: ffff88000470e1d0 ti: ffff88006acec000 task.ti:
> ffff88006acec000
> [  400.032003] RIP: e030:[<ffffffff810013aa>]  [<ffffffff810013aa>]
> xen_hypercall_sched_op+0xa/0x20
> [  400.032003] RSP: e02b:ffff88006acefdd0  EFLAGS: 00000246
> [  400.032003] RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff810013aa
> [  400.032003] RDX: ffff88007d640000 RSI: 0000000000000000 RDI: 0000000000000000
> [  400.032003] RBP: ffff88006bcf6000 R08: ffff88007d03d5c8 R09: 0000000000000122
> [  400.032003] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> [  400.032003] R13: 000000000000cd60 R14: ffff88006d1dca20 R15: 000000000007d649
> [  400.032003] FS:  00007fe4b215e700(0000) GS:ffff88007d600000(0000)
> knlGS:0000000000000000
> [  400.032003] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  400.032003] CR2: 00000000016de6d0 CR3: 0000000004a67000 CR4: 0000000000042660
> [  400.032003] Stack:
> [  400.032003]  ffff88006acefb3e 0000000000000000 ffffffff81010dc1 0000000001323d35
> [  400.032003]  0000000000000000 0000000000000000 0000000000000001 0000000000000001
> [  400.032003]  ffff88006d1dca20 0000000000000000 ffffffff81068cac 000000306aceff3c
> [  400.032003] Call Trace:
> [  400.032003]  [<ffffffff81010dc1>] ? xen_cpu_up+0x211/0x500
> [  400.032003]  [<ffffffff81068cac>] ? _cpu_up+0x12c/0x160
> [  400.032003]  [<ffffffff81068d59>] ? cpu_up+0x79/0xa0
> [  400.032003]  [<ffffffff8150b615>] ? cpu_subsys_online+0x35/0x80
> [  400.032003]  [<ffffffff813a608d>] ? device_online+0x5d/0xa0
> [  400.032003]  [<ffffffff813a6145>] ? online_store+0x75/0x80
> [  400.032003]  [<ffffffff8121b56a>] ? kernfs_fop_write+0xda/0x150
> [  400.032003]  [<ffffffff811aaf32>] ? vfs_write+0xb2/0x1f0
> [  400.032003]  [<ffffffff811aba72>] ? SyS_write+0x42/0xa0
> [  400.032003]  [<ffffffff8151a48d>] ? system_call_fast_compare_end+0x10/0x15
> [  400.032003] Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc
> cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59
> c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
>
> The same happens on the HVM DomU but always only _after_ live migration.
> Hotplugging works flawlessly if done on the Dom0 where the DomU is started on.
>
> Any idea what might be happening here? Anyone who has managed to migrate and
> afterwards hotplug a DomU?
>
> Thanks
>
> Tim
>
> _______________________________________________
> Xen-users mailing list
> [hidden email]
> https://lists.xen.org/xen-users

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users