[xen master] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[xen master] xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.

patchbot
commit 6e7f6155b8ab5a513b549788d6818cacd56873e8
Author:     Dario Faggioli <[hidden email]>
AuthorDate: Wed Aug 30 12:06:21 2017 +0100
Commit:     George Dunlap <[hidden email]>
CommitDate: Wed Aug 30 12:06:21 2017 +0100

    xen: RCU/x86/ARM: discount CPUs that were idle when grace period started.
   
    Xen is a tickless (micro-)kernel, i.e., when a CPU becomes
    idle there is no timer tick that will periodically wake the
    CPU up.
    OTOH, when we imported RCU from Linux, Linux was (on x86) a
    ticking kernel, i.e., there was a periodic timer tick always
    running, even on idle CPUs. This was bad for power consumption,
    but, for instance, made it easy to monitor the quiescent states
    of all the CPUs, and hence tell when RCU grace periods ended.
   
    In Xen, that is impossible, and that's particularly problematic
    when the system is very lightly loaded, as some CPUs may never
    have the chance to tell the RCU core logic about their quiescence,
    and grace periods could extend indefinitely!
   
    This has led, on x86, to long (and unpredictable) delays between
    RCU callbacks queueing and their actual invokation. On ARM, we've
    even seen infinite grace periods (e.g., complate_domain_destroy()
    never being actually invoked!). See here:
   
     https://lists.xenproject.org/archives/html/xen-devel/2017-01/msg02454.html
   
    The first step for fixing this situation is for RCU to record,
    at the beginning of a grace period, which CPUs are already idle.
    In fact, being idle, they can't be in the middle of any read-side
    critical section, and we don't have to wait for their quiescence.
   
    This is tracked in a cpumask, in a similar way to how it was also
    done in Linux (on s390, which was tickless already). It is also
    basically the same approach used for making Linux x86 tickless,
    in 2.6.21 on (see commit 79bf2bb3 "tick-management: dyntick /
    highres functionality").
   
    For correctness, wee also add barriers. One is also present in
    Linux, (see commit c3f59023, "Fix RCU race in access of nohz_cpu_mask",
    although, we change the code comment to something that makes better
    sense for us). The other (which is its pair), is put in the newly
    introduced function rcu_idle_enter(), right after updating the
    cpumask. They prevent races between CPUs going idle during the
    beginning of a grace period.
   
    Signed-off-by: Dario Faggioli <[hidden email]>
    Reviewed-by: Tim Deegan <[hidden email]>
    Reviewed-by: George Dunlap <[hidden email]>
---
 xen/common/rcupdate.c      | 41 +++++++++++++++++++++++++++++++++++++++--
 xen/common/schedule.c      |  2 ++
 xen/include/xen/rcupdate.h |  3 +++
 3 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c
index 8cc5a82..12ae7da 100644
--- a/xen/common/rcupdate.c
+++ b/xen/common/rcupdate.c
@@ -52,7 +52,8 @@ static struct rcu_ctrlblk {
     int  next_pending;  /* Is the next batch already waiting?         */
 
     spinlock_t  lock __cacheline_aligned;
-    cpumask_t   cpumask; /* CPUs that need to switch in order    */
+    cpumask_t   cpumask; /* CPUs that need to switch in order ... */
+    cpumask_t   idle_cpumask; /* ... unless they are already idle */
     /* for current batch to proceed.        */
 } __cacheline_aligned rcu_ctrlblk = {
     .cur = -300,
@@ -248,7 +249,16 @@ static void rcu_start_batch(struct rcu_ctrlblk *rcp)
         smp_wmb();
         rcp->cur++;
 
-        cpumask_copy(&rcp->cpumask, &cpu_online_map);
+       /*
+        * Make sure the increment of rcp->cur is visible so, even if a
+        * CPU that is about to go idle, is captured inside rcp->cpumask,
+        * rcu_pending() will return false, which then means cpu_quiet()
+        * will be invoked, before the CPU would actually enter idle.
+        *
+        * This barrier is paired with the one in rcu_idle_enter().
+        */
+        smp_mb();
+        cpumask_andnot(&rcp->cpumask, &cpu_online_map, &rcp->idle_cpumask);
     }
 }
 
@@ -474,7 +484,34 @@ static struct notifier_block cpu_nfb = {
 void __init rcu_init(void)
 {
     void *cpu = (void *)(long)smp_processor_id();
+
+    cpumask_clear(&rcu_ctrlblk.idle_cpumask);
     cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu);
     register_cpu_notifier(&cpu_nfb);
     open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
 }
+
+/*
+ * The CPU is becoming idle, so no more read side critical
+ * sections, and one more step toward grace period.
+ */
+void rcu_idle_enter(unsigned int cpu)
+{
+    ASSERT(!cpumask_test_cpu(cpu, &rcu_ctrlblk.idle_cpumask));
+    cpumask_set_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
+    /*
+     * If some other CPU is starting a new grace period, we'll notice that
+     * by seeing a new value in rcp->cur (different than our quiescbatch).
+     * That will force us all the way until cpu_quiet(), clearing our bit
+     * in rcp->cpumask, even in case we managed to get in there.
+     *
+     * Se the comment before cpumask_andnot() in  rcu_start_batch().
+     */
+    smp_mb();
+}
+
+void rcu_idle_exit(unsigned int cpu)
+{
+    ASSERT(cpumask_test_cpu(cpu, &rcu_ctrlblk.idle_cpumask));
+    cpumask_clear_cpu(cpu, &rcu_ctrlblk.idle_cpumask);
+}
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index e83f4c7..c6f4817 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -1903,6 +1903,7 @@ void sched_tick_suspend(void)
 
     sched = per_cpu(scheduler, cpu);
     SCHED_OP(sched, tick_suspend, cpu);
+    rcu_idle_enter(cpu);
 }
 
 void sched_tick_resume(void)
@@ -1910,6 +1911,7 @@ void sched_tick_resume(void)
     struct scheduler *sched;
     unsigned int cpu = smp_processor_id();
 
+    rcu_idle_exit(cpu);
     sched = per_cpu(scheduler, cpu);
     SCHED_OP(sched, tick_resume, cpu);
 }
diff --git a/xen/include/xen/rcupdate.h b/xen/include/xen/rcupdate.h
index 557a7b1..561ac43 100644
--- a/xen/include/xen/rcupdate.h
+++ b/xen/include/xen/rcupdate.h
@@ -146,4 +146,7 @@ void call_rcu(struct rcu_head *head,
 
 int rcu_barrier(void);
 
+void rcu_idle_enter(unsigned int cpu);
+void rcu_idle_exit(unsigned int cpu);
+
 #endif /* __XEN_RCUPDATE_H */
--
generated by git-patchbot for /home/xen/git/xen.git#master

_______________________________________________
Xen-changelog mailing list
[hidden email]
https://lists.xenproject.org/xen-changelog