high CPU stolen time after live migrate

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

high CPU stolen time after live migrate

Olivier Bonvalet
Hi,

with Xen 4.8.2, migrated domains have wrong stats for stolen time.

For example :

root! laussor:/proc# cat stat
cpu  87957976 32463 6483874 361169794 3075806 0 937004 183086986113 0 0
cpu0 2828867 2434 226580 43754152 786651 0 42766 1778722573131 0 0
cpu1 2611496 3545 232529 42364047 594846 0 2441 1598822754769 0 0
cpu2 21958330 14923 1338283 40315990 424045 0 127421 1086605305497 0 0
cpu3 2725745 339 271471 42426113 319132 0 1408 41756798508 0 0
cpu4 21738600 3858 1099222 39528759 508996 0 682044 321900840424 0 0
cpu5 9543155 1668 1165194 53274011 231727 0 8130 1652751948052 0 0
cpu6 14223984 4014 970888 48826880 126790 0 67909 254194505219 0 0
cpu7 12327794 1680 1179702 50679764 83618 0 4883 827029885923 0 0
intr 7866132625 99868037 0 434116966 617 0 363800 0 94983174 425603035 1016 0 390335 0 0 122245733 369053953 1041 0 358301 0 0 77267482 422334777 1015 0 371084 0 0 96031342 218747945 993 0 331440 0 0 77622794 333046334 1139 0 444912 0 0 57536561 237035283 1126 0 407725 0 0 41162673 306795645 1196 0 431960 0 0 775 0 0 0 0 479 0 0 0 952 0 0 0 17046 30405 15526 94781 14025 3849 16848 4485 16352 1801 17611 2104 14016 1807 16868 2235 3680 0 0 0 0 0 0 0 110370 107659 101208 355716 107895 62726 105345 70387 112823 62653 103888 70642 109312 63799 103388 65918 0 0 0 5619 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 6178259009
btime 1506309567
processes 6457002
procs_running 2
procs_blocked 0
softirq 2893820956 8 574619811 808518311 848843354 0 0 12041195 205459383 0 444338894


root! laussor:/proc# cat /proc/uptime
652005.23 2631328.82


Values for "stolen time" in /proc/stat seems impossible with only 7 days of uptime.


Is it a known bug ?

thanks,

Olivier

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xen-devel] high CPU stolen time after live migrate

Dario Faggioli-2
On Mon, 2017-10-02 at 18:37 +0200, Olivier Bonvalet wrote:
> root! laussor:/proc# cat /proc/uptime
> 652005.23 2631328.82
>
>
> Values for "stolen time" in /proc/stat seems impossible with only 7
> days of uptime.
>
I think it can be this:
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-parav
irtualized-xen-guest/

What's the version of your guest kernel?

Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re : [Xen-devel] high CPU stolen time after live migrate

Olivier Bonvalet
Le mardi 03 octobre 2017 à 11:22 +0200, Dario Faggioli a écrit :
> What's the version of your guest kernel?

4.9.52, so yes, it seems to be that, I will try to patch and follow
this issue.

Thanks !

Olivier

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xen-devel] high CPU stolen time after live migrate

Dongli Zhang
In reply to this post by Olivier Bonvalet
Hi Dario and Olivier,

I have just encountered this issue in the past. While the fix mentioned in the
link is effective, I assume the fix was derived from upstream linux and it will
introduce new error as mentioned below.

While there is a kernel bug in the guest kernel, I think the root cause is at
the hypervisor side.

From my own test, the issue is reproducible even when migration a VM locally
within the same dom0. From the test, once guest VM is migrated,
RUNSTATE_offline time looks normal, while RUNSTATE_runnable is moving backward
and decreased. Therefore, the value returned by paravirt_steal_clock()
(actually xen_steal_clock()), which is equivalent to the sum of
RUNSTATE_offline and RUNSTATE_runnable, is decreased as well. However, the
kernel such as 4.8 could not handle this special situation correctly
as the code in cputime.c is not written specifically for xen hypervisor.

For kernel like v4.8-rc8, would something as below would be better?

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a846cf8..3546e21 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -274,11 +274,17 @@ static __always_inline cputime_t steal_account_process_time(cputime_t maxtime)
  if (static_key_false(&paravirt_steal_enabled)) {
  cputime_t steal_cputime;
  u64 steal;
+ s64 steal_diff;
 
  steal = paravirt_steal_clock(smp_processor_id());
- steal -= this_rq()->prev_steal_time;
+ steal_diff = steal - this_rq()->prev_steal_time;
 
- steal_cputime = min(nsecs_to_cputime(steal), maxtime);
+ if (steal_diff < 0) {
+ this_rq()->prev_steal_time = steal;
+ return 0;
+ }
+
+ steal_cputime = min(nsecs_to_cputime(steal_diff), maxtime);
  account_steal_time(steal_cputime);
  this_rq()->prev_steal_time += cputime_to_nsecs(steal_cputime);


This issue seems not getting totally fixed by most up-to-date upstream linux (I
have tested with 4.12.0-rc7). The issue in 4.12.0-rc7 is different. After live
migration, although the steal clock counter is not overflowed (become a very
large unsigned number), the steal clock counter in /proc/stat is moving
backward and decreased (e.g., from 329 to 311).

test@vm:~$ cat /proc/stat
cpu  248 0 240 31197 893 0 1 329 0 0
cpu0 248 0 240 31197 893 0 1 329 0 0
intr 39051 16307 0 0 0 0 0 990 127 592 1004 1360 40 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0
ctxt 59400
btime 1506731352
processes 1877
procs_running 1
procs_blocked 0
softirq 38903 0 15524 1227 6904 0 0 6 0 0 15242

After live migration, steal counter in ubuntu guest running 4.12.0-rc7 was decreased to 311.

test@vm:~$ cat /proc/stat
cpu  251 0 242 31245 893 0 1 311 0 0
cpu0 251 0 242 31245 893 0 1 311 0 0
intr 39734 16404 0 0 0 0 0 1440 128 0 8 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0
ctxt 60880
btime 1506731352
processes 1882
procs_running 3
procs_blocked 0
softirq 39195 0 15618 1286 6958 0 0 7 0 0 15326

I assume this is not an expected behavior. A different patch (similar to the one
I mentioned above) to upstream linux would fix this issue.

---------------------------------------------------------

Whatever the fix would be applied to guest kernel side, I think the root cause
is because xen hypervisor returns a RUNSTATE_runnable time less than the
previous one before live migration.

As I am not clear enough with xen scheduling, I do not understand why
RUNSTATE_runnable cputime is decreased after live migration.

Dongli Zhang



----- Original Message -----
From: [hidden email]
To: [hidden email], [hidden email]
Cc: [hidden email]
Sent: Tuesday, October 3, 2017 5:24:49 PM GMT +08:00 Beijing / Chongqing / Hong Kong / Urumqi
Subject: Re: [Xen-devel] high CPU stolen time after live migrate

On Mon, 2017-10-02 at 18:37 +0200, Olivier Bonvalet wrote:
> root! laussor:/proc# cat /proc/uptime
> 652005.23 2631328.82
>
>
> Values for "stolen time" in /proc/stat seems impossible with only 7
> days of uptime.
>
I think it can be this:
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-parav
irtualized-xen-guest/

What's the version of your guest kernel?

Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
_______________________________________________
Xen-devel mailing list
[hidden email]
https://lists.xen.org/xen-devel

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xen-devel] high CPU stolen time after live migrate

Jan Beulich-2
>>> On 08.10.17 at 07:29, <[hidden email]> wrote:
> Whatever the fix would be applied to guest kernel side, I think the root cause
> is because xen hypervisor returns a RUNSTATE_runnable time less than the
> previous one before live migration.
>
> As I am not clear enough with xen scheduling, I do not understand why
> RUNSTATE_runnable cputime is decreased after live migration.

Isn't this simply because accounting starts from zero again in the
new (migrated) domain? If so, that's nothing that ought to change,
it would still be the guest kernel responsible to take care of if it
matters to it.

Jan


_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users