Fix the occasional xen-blkfront deadlock, when irqbalancing.

classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Fix the occasional xen-blkfront deadlock, when irqbalancing.

Daniel Stodden-5
Hi.

Please pull upstream/xen/blkfront from
git://xenbits.xensource.com/people/dstodden/linux.git

Cheers,
Daniel


_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

[PATCH] blkfront: Move blkif_interrupt into a tasklet.

Daniel Stodden-5
Response processing doesn't really belong into hard irq context.

Another potential problem this avoids is that switching interrupt cpu
affinity in Xen domains can presently lead to event loss, if
RING_FINAL_CHECK is run from hard irq context.

Signed-off-by: Daniel Stodden <[hidden email]>
Cc: Tom Kopec <[hidden email]>
---
 drivers/block/xen-blkfront.c |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 6c00538..75576d3 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -86,6 +86,7 @@ struct blkfront_info
  struct blkif_front_ring ring;
  struct scatterlist sg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
  unsigned int evtchn, irq;
+ struct tasklet_struct tasklet;
  struct request_queue *rq;
  struct work_struct work;
  struct gnttab_free_callback callback;
@@ -676,13 +677,14 @@ static void blkif_completion(struct blk_shadow *s)
  gnttab_end_foreign_access(s->req.seg[i].gref, 0, 0UL);
 }
 
-static irqreturn_t blkif_interrupt(int irq, void *dev_id)
+static void
+blkif_do_interrupt(unsigned long data)
 {
+ struct blkfront_info *info = (struct blkfront_info *)data;
  struct request *req;
  struct blkif_response *bret;
  RING_IDX i, rp;
  unsigned long flags;
- struct blkfront_info *info = (struct blkfront_info *)dev_id;
  int error;
 
  spin_lock_irqsave(&info->io_lock, flags);
@@ -743,6 +745,15 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
 
 out:
  spin_unlock_irqrestore(&info->io_lock, flags);
+}
+
+
+static irqreturn_t
+blkif_interrupt(int irq, void *dev_id)
+{
+ struct blkfront_info *info = (struct blkfront_info *)dev_id;
+
+ tasklet_schedule(&info->tasklet);
 
  return IRQ_HANDLED;
 }
@@ -893,6 +904,7 @@ static int blkfront_probe(struct xenbus_device *dev,
  info->connected = BLKIF_STATE_DISCONNECTED;
  INIT_WORK(&info->work, blkif_restart_queue);
  spin_lock_init(&info->io_lock);
+ tasklet_init(&info->tasklet, blkif_do_interrupt, (unsigned long)info);
 
  for (i = 0; i < BLK_RING_SIZE; i++)
  info->shadow[i].req.id = i+1;
--
1.7.0.4


_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] blkfront: Move blkif_interrupt into a tasklet.

Daniel Stodden-5

This is upstream/xen/blkfront at
git://xenbits.xensource.com/people/dstodden/linux.git

Daniel

On Mon, 2010-08-23 at 02:54 -0400, Daniel Stodden wrote:

> Response processing doesn't really belong into hard irq context.
>
> Another potential problem this avoids is that switching interrupt cpu
> affinity in Xen domains can presently lead to event loss, if
> RING_FINAL_CHECK is run from hard irq context.
>
> Signed-off-by: Daniel Stodden <[hidden email]>
> Cc: Tom Kopec <[hidden email]>
> ---
>  drivers/block/xen-blkfront.c |   16 ++++++++++++++--
>  1 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 6c00538..75576d3 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -86,6 +86,7 @@ struct blkfront_info
>   struct blkif_front_ring ring;
>   struct scatterlist sg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>   unsigned int evtchn, irq;
> + struct tasklet_struct tasklet;
>   struct request_queue *rq;
>   struct work_struct work;
>   struct gnttab_free_callback callback;
> @@ -676,13 +677,14 @@ static void blkif_completion(struct blk_shadow *s)
>   gnttab_end_foreign_access(s->req.seg[i].gref, 0, 0UL);
>  }
>  
> -static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> +static void
> +blkif_do_interrupt(unsigned long data)
>  {
> + struct blkfront_info *info = (struct blkfront_info *)data;
>   struct request *req;
>   struct blkif_response *bret;
>   RING_IDX i, rp;
>   unsigned long flags;
> - struct blkfront_info *info = (struct blkfront_info *)dev_id;
>   int error;
>  
>   spin_lock_irqsave(&info->io_lock, flags);
> @@ -743,6 +745,15 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
>  
>  out:
>   spin_unlock_irqrestore(&info->io_lock, flags);
> +}
> +
> +
> +static irqreturn_t
> +blkif_interrupt(int irq, void *dev_id)
> +{
> + struct blkfront_info *info = (struct blkfront_info *)dev_id;
> +
> + tasklet_schedule(&info->tasklet);
>  
>   return IRQ_HANDLED;
>  }
> @@ -893,6 +904,7 @@ static int blkfront_probe(struct xenbus_device *dev,
>   info->connected = BLKIF_STATE_DISCONNECTED;
>   INIT_WORK(&info->work, blkif_restart_queue);
>   spin_lock_init(&info->io_lock);
> + tasklet_init(&info->tasklet, blkif_do_interrupt, (unsigned long)info);
>  
>   for (i = 0; i < BLK_RING_SIZE; i++)
>   info->shadow[i].req.id = i+1;



_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: Fix the occasional xen-blkfront deadlock, when irqbalancing.

Jeremy Fitzhardinge
In reply to this post by Daniel Stodden-5
 On 08/22/2010 11:54 PM, Daniel Stodden wrote:
> Please pull upstream/xen/blkfront from
> git://xenbits.xensource.com/people/dstodden/linux.git

 I think this change is probably worthwhile on its own merits, but it
just papers over the irqbalancing problem.  I'd like to make sure that's
nailed down before pulling this patch.

Thanks,
    J

_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] blkfront: Move blkif_interrupt into a tasklet.

Jeremy Fitzhardinge
In reply to this post by Daniel Stodden-5
 On 08/22/2010 11:54 PM, Daniel Stodden wrote:
> Response processing doesn't really belong into hard irq context.
>
> Another potential problem this avoids is that switching interrupt cpu
> affinity in Xen domains can presently lead to event loss, if
> RING_FINAL_CHECK is run from hard irq context.

I just got this warning from a 32-bit pv domain.  I think it may relate
to this change.  The warning is

void blk_start_queue(struct request_queue *q)
{
        WARN_ON(!irqs_disabled());


Oddly, I only saw this pair once at boot, and after that the system
seemed fine...

[    4.376451] ------------[ cut here ]------------
[    4.377415] WARNING: at /home/jeremy/git/linux/block/blk-core.c:337 blk_start_queue+0x20/0x36()
[    4.377415] Modules linked in: xfs exportfs xen_blkfront [last unloaded: scsi_wait_scan]
[    4.377415] Pid: 0, comm: swapper Not tainted 2.6.32.21 #32
[    4.377415] Call Trace:
[    4.377415]  [<c1039f74>] warn_slowpath_common+0x65/0x7c
[    4.377415]  [<c11b3ae1>] ? blk_start_queue+0x20/0x36
[    4.377415]  [<c1039f98>] warn_slowpath_null+0xd/0x10
[    4.377415]  [<c11b3ae1>] blk_start_queue+0x20/0x36
[    4.377415]  [<edc74712>] kick_pending_request_queues+0x1c/0x2a [xen_blkfront]
[    4.377415]  [<edc74ec4>] blkif_do_interrupt+0x176/0x189 [xen_blkfront]
[    4.377415]  [<c103e063>] tasklet_action+0x63/0xa8
[    4.377415]  [<c103f2d5>] __do_softirq+0xac/0x152
[    4.377415]  [<c103f3ac>] do_softirq+0x31/0x3c
[    4.377415]  [<c103f484>] irq_exit+0x29/0x5c
[    4.377415]  [<c121a1b6>] xen_evtchn_do_upcall+0x29/0x34
[    4.377415]  [<c100a027>] xen_do_upcall+0x7/0xc
[    4.377415]  [<c10023a7>] ? hypercall_page+0x3a7/0x1005
[    4.377415]  [<c10065a9>] ? xen_safe_halt+0x12/0x1f
[    4.377415]  [<c10042cb>] xen_idle+0x27/0x38
[    4.377415]  [<c100877e>] cpu_idle+0x49/0x63
[    4.377415]  [<c14a6427>] rest_init+0x53/0x55
[    4.377415]  [<c179c814>] start_kernel+0x2d4/0x2d9
[    4.377415]  [<c179c0a8>] i386_start_kernel+0x97/0x9e
[    4.377415]  [<c179f478>] xen_start_kernel+0x576/0x57e
[    4.377415] ---[ end trace 0bfb98f0ed515cdb ]---
[    4.377415] ------------[ cut here ]------------
[    4.377415] WARNING: at /home/jeremy/git/linux/block/blk-core.c:245 blk_remove_plug+0x20/0x7e()
[    4.377415] Modules linked in: xfs exportfs xen_blkfront [last unloaded: scsi_wait_scan]
[    4.377415] Pid: 0, comm: swapper Tainted: G        W  2.6.32.21 #32
[    4.377415] Call Trace:
[    4.377415]  [<c1039f74>] warn_slowpath_common+0x65/0x7c
[    4.377415]  [<c11b3961>] ? blk_remove_plug+0x20/0x7e
[    4.377415]  [<c1039f98>] warn_slowpath_null+0xd/0x10
[    4.377415]  [<c11b3961>] blk_remove_plug+0x20/0x7e
[    4.377415]  [<c11b39ca>] __blk_run_queue+0xb/0x5e
[    4.377415]  [<c11b3af4>] blk_start_queue+0x33/0x36
[    4.377415]  [<edc74712>] kick_pending_request_queues+0x1c/0x2a [xen_blkfront]
[    4.377415]  [<edc74ec4>] blkif_do_interrupt+0x176/0x189 [xen_blkfront]
[    4.377415]  [<c103e063>] tasklet_action+0x63/0xa8
[    4.377415]  [<c103f2d5>] __do_softirq+0xac/0x152
[    4.377415]  [<c103f3ac>] do_softirq+0x31/0x3c
[    4.377415]  [<c103f484>] irq_exit+0x29/0x5c
[    4.377415]  [<c121a1b6>] xen_evtchn_do_upcall+0x29/0x34
[    4.377415]  [<c100a027>] xen_do_upcall+0x7/0xc
[    4.377415]  [<c10023a7>] ? hypercall_page+0x3a7/0x1005
[    4.377415]  [<c10065a9>] ? xen_safe_halt+0x12/0x1f
[    4.377415]  [<c10042cb>] xen_idle+0x27/0x38
[    4.377415]  [<c100877e>] cpu_idle+0x49/0x63
[    4.377415]  [<c14a6427>] rest_init+0x53/0x55
[    4.377415]  [<c179c814>] start_kernel+0x2d4/0x2d9
[    4.377415]  [<c179c0a8>] i386_start_kernel+0x97/0x9e
[    4.377415]  [<c179f478>] xen_start_kernel+0x576/0x57e
[    4.377415] ---[ end trace 0bfb98f0ed515cdc ]---

        J


_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] blkfront: Move blkif_interrupt into a tasklet.

Daniel Stodden-5
On Thu, 2010-09-02 at 18:46 -0400, Jeremy Fitzhardinge wrote:
> On 08/22/2010 11:54 PM, Daniel Stodden wrote:
> > Response processing doesn't really belong into hard irq context.
> >
> > Another potential problem this avoids is that switching interrupt cpu
> > affinity in Xen domains can presently lead to event loss, if
> > RING_FINAL_CHECK is run from hard irq context.
>
> I just got this warning from a 32-bit pv domain.  I think it may relate
> to this change.  The warning is

We clearly spin_lock_irqsave all through the blkif_do_interrupt frame.

It follows that something underneath quite unconditionally chose to
reenable them again (?)

Either: Can you add a bunch of similar WARN_ONs along that path?

Or: This lock is quite coarse-grained. The lock only matters for queue
access, and we know irqs are reenabled, so no need for flags. In fact we
only need to spin_lock_irq around the __blk_end_ calls and
kick_pending_.

But I don't immediately see what's to blame, so I'd be curious.

Daniel


> void blk_start_queue(struct request_queue *q)
> {
> WARN_ON(!irqs_disabled());


> Oddly, I only saw this pair once at boot, and after that the system
> seemed fine...
>
> [    4.376451] ------------[ cut here ]------------
> [    4.377415] WARNING: at /home/jeremy/git/linux/block/blk-core.c:337 blk_start_queue+0x20/0x36()
> [    4.377415] Modules linked in: xfs exportfs xen_blkfront [last unloaded: scsi_wait_scan]
> [    4.377415] Pid: 0, comm: swapper Not tainted 2.6.32.21 #32
> [    4.377415] Call Trace:
> [    4.377415]  [<c1039f74>] warn_slowpath_common+0x65/0x7c
> [    4.377415]  [<c11b3ae1>] ? blk_start_queue+0x20/0x36
> [    4.377415]  [<c1039f98>] warn_slowpath_null+0xd/0x10
> [    4.377415]  [<c11b3ae1>] blk_start_queue+0x20/0x36
> [    4.377415]  [<edc74712>] kick_pending_request_queues+0x1c/0x2a [xen_blkfront]
> [    4.377415]  [<edc74ec4>] blkif_do_interrupt+0x176/0x189 [xen_blkfront]
> [    4.377415]  [<c103e063>] tasklet_action+0x63/0xa8
> [    4.377415]  [<c103f2d5>] __do_softirq+0xac/0x152
> [    4.377415]  [<c103f3ac>] do_softirq+0x31/0x3c
> [    4.377415]  [<c103f484>] irq_exit+0x29/0x5c
> [    4.377415]  [<c121a1b6>] xen_evtchn_do_upcall+0x29/0x34
> [    4.377415]  [<c100a027>] xen_do_upcall+0x7/0xc
> [    4.377415]  [<c10023a7>] ? hypercall_page+0x3a7/0x1005
> [    4.377415]  [<c10065a9>] ? xen_safe_halt+0x12/0x1f
> [    4.377415]  [<c10042cb>] xen_idle+0x27/0x38
> [    4.377415]  [<c100877e>] cpu_idle+0x49/0x63
> [    4.377415]  [<c14a6427>] rest_init+0x53/0x55
> [    4.377415]  [<c179c814>] start_kernel+0x2d4/0x2d9
> [    4.377415]  [<c179c0a8>] i386_start_kernel+0x97/0x9e
> [    4.377415]  [<c179f478>] xen_start_kernel+0x576/0x57e
> [    4.377415] ---[ end trace 0bfb98f0ed515cdb ]---
> [    4.377415] ------------[ cut here ]------------
> [    4.377415] WARNING: at /home/jeremy/git/linux/block/blk-core.c:245 blk_remove_plug+0x20/0x7e()
> [    4.377415] Modules linked in: xfs exportfs xen_blkfront [last unloaded: scsi_wait_scan]
> [    4.377415] Pid: 0, comm: swapper Tainted: G        W  2.6.32.21 #32
> [    4.377415] Call Trace:
> [    4.377415]  [<c1039f74>] warn_slowpath_common+0x65/0x7c
> [    4.377415]  [<c11b3961>] ? blk_remove_plug+0x20/0x7e
> [    4.377415]  [<c1039f98>] warn_slowpath_null+0xd/0x10
> [    4.377415]  [<c11b3961>] blk_remove_plug+0x20/0x7e
> [    4.377415]  [<c11b39ca>] __blk_run_queue+0xb/0x5e
> [    4.377415]  [<c11b3af4>] blk_start_queue+0x33/0x36
> [    4.377415]  [<edc74712>] kick_pending_request_queues+0x1c/0x2a [xen_blkfront]
> [    4.377415]  [<edc74ec4>] blkif_do_interrupt+0x176/0x189 [xen_blkfront]
> [    4.377415]  [<c103e063>] tasklet_action+0x63/0xa8
> [    4.377415]  [<c103f2d5>] __do_softirq+0xac/0x152
> [    4.377415]  [<c103f3ac>] do_softirq+0x31/0x3c
> [    4.377415]  [<c103f484>] irq_exit+0x29/0x5c
> [    4.377415]  [<c121a1b6>] xen_evtchn_do_upcall+0x29/0x34
> [    4.377415]  [<c100a027>] xen_do_upcall+0x7/0xc
> [    4.377415]  [<c10023a7>] ? hypercall_page+0x3a7/0x1005
> [    4.377415]  [<c10065a9>] ? xen_safe_halt+0x12/0x1f
> [    4.377415]  [<c10042cb>] xen_idle+0x27/0x38
> [    4.377415]  [<c100877e>] cpu_idle+0x49/0x63
> [    4.377415]  [<c14a6427>] rest_init+0x53/0x55
> [    4.377415]  [<c179c814>] start_kernel+0x2d4/0x2d9
> [    4.377415]  [<c179c0a8>] i386_start_kernel+0x97/0x9e
> [    4.377415]  [<c179f478>] xen_start_kernel+0x576/0x57e
> [    4.377415] ---[ end trace 0bfb98f0ed515cdc ]---
>
> J
>



_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

blktap lockdep hiccup

Jeremy Fitzhardinge
 On 09/03/2010 09:08 AM, Daniel Stodden wrote:

> On Thu, 2010-09-02 at 18:46 -0400, Jeremy Fitzhardinge wrote:
>> On 08/22/2010 11:54 PM, Daniel Stodden wrote:
>>> Response processing doesn't really belong into hard irq context.
>>>
>>> Another potential problem this avoids is that switching interrupt cpu
>>> affinity in Xen domains can presently lead to event loss, if
>>> RING_FINAL_CHECK is run from hard irq context.
>> I just got this warning from a 32-bit pv domain.  I think it may relate
>> to this change.  The warning is
> We clearly spin_lock_irqsave all through the blkif_do_interrupt frame.
>
> It follows that something underneath quite unconditionally chose to
> reenable them again (?)
>
> Either: Can you add a bunch of similar WARN_ONs along that path?
>
> Or: This lock is quite coarse-grained. The lock only matters for queue
> access, and we know irqs are reenabled, so no need for flags. In fact we
> only need to spin_lock_irq around the __blk_end_ calls and
> kick_pending_.
>
> But I don't immediately see what's to blame, so I'd be curious.

I haven't got around to investigating this in more detail yet, but
there's also this long-standing lockdep hiccup in blktap:

Starting auto Xen domains: lurch  alloc irq_desc for 1235 on node 0
  alloc kstat_irqs on node 0
block tda: sector-size: 512 capacity: 614400
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
Pid: 4266, comm: tapdisk2 Not tainted 2.6.32.21 #146
Call Trace:
 [<ffffffff8107f0a4>] __lock_acquire+0x1df/0x16e5
 [<ffffffff8100f955>] ? xen_force_evtchn_callback+0xd/0xf
 [<ffffffff81010082>] ? check_events+0x12/0x20
 [<ffffffff810f0359>] ? apply_to_page_range+0x295/0x37d
 [<ffffffff81080677>] lock_acquire+0xcd/0xf1
 [<ffffffff810f0359>] ? apply_to_page_range+0x295/0x37d
 [<ffffffff810f0259>] ? apply_to_page_range+0x195/0x37d
 [<ffffffff81506f7d>] _spin_lock+0x31/0x40
 [<ffffffff810f0359>] ? apply_to_page_range+0x295/0x37d
 [<ffffffff810f0359>] apply_to_page_range+0x295/0x37d
 [<ffffffff812ab37c>] ? blktap_map_uaddr_fn+0x0/0x55
 [<ffffffff8100d0cf>] ? xen_make_pte+0x8a/0x8e
 [<ffffffff812ac34e>] blktap_device_process_request+0x43d/0x954
 [<ffffffff8100f955>] ? xen_force_evtchn_callback+0xd/0xf
 [<ffffffff81010082>] ? check_events+0x12/0x20
 [<ffffffff8100f955>] ? xen_force_evtchn_callback+0xd/0xf
 [<ffffffff81010082>] ? check_events+0x12/0x20
 [<ffffffff8107d687>] ? mark_held_locks+0x52/0x70
 [<ffffffff81506ddb>] ? _spin_unlock_irq+0x30/0x3c
 [<ffffffff8107d949>] ? trace_hardirqs_on_caller+0x125/0x150
 [<ffffffff812acba6>] blktap_device_run_queue+0x1c5/0x28f
 [<ffffffff812a0234>] ? unbind_from_irq+0x18/0x198
 [<ffffffff81010082>] ? check_events+0x12/0x20
 [<ffffffff812ab14d>] blktap_ring_poll+0x7c/0xc7
 [<ffffffff81124e9b>] do_select+0x387/0x584
 [<ffffffff81124b14>] ? do_select+0x0/0x584
 [<ffffffff811255de>] ? __pollwait+0x0/0xcc
 [<ffffffff811256aa>] ? pollwake+0x0/0x56
 [<ffffffff811256aa>] ? pollwake+0x0/0x56
 [<ffffffff811256aa>] ? pollwake+0x0/0x56
 [<ffffffff811256aa>] ? pollwake+0x0/0x56
 [<ffffffff8108059b>] ? __lock_acquire+0x16d6/0x16e5
 [<ffffffff8100f955>] ? xen_force_evtchn_callback+0xd/0xf
 [<ffffffff81010082>] ? check_events+0x12/0x20
 [<ffffffff8100f955>] ? xen_force_evtchn_callback+0xd/0xf
 [<ffffffff81010082>] ? check_events+0x12/0x20
 [<ffffffff8100f955>] ? xen_force_evtchn_callback+0xd/0xf
 [<ffffffff811252a4>] core_sys_select+0x20c/0x2da
 [<ffffffff811250d6>] ? core_sys_select+0x3e/0x2da
 [<ffffffff81010082>] ? check_events+0x12/0x20
 [<ffffffff8101006f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81108661>] ? kmem_cache_free+0x18e/0x1c8
 [<ffffffff8141e912>] ? sock_destroy_inode+0x19/0x1b
 [<ffffffff811299bd>] ? destroy_inode+0x2f/0x44
 [<ffffffff8102ef22>] ? pvclock_clocksource_read+0x4b/0xa2
 [<ffffffff8100fe8b>] ? xen_clocksource_read+0x21/0x23
 [<ffffffff81010003>] ? xen_clocksource_get_cycles+0x9/0x16
 [<ffffffff81075700>] ? ktime_get_ts+0xb2/0xbf
 [<ffffffff811255b6>] sys_select+0x96/0xbe
 [<ffffffff81013d32>] system_call_fastpath+0x16/0x1b
block tdb: sector-size: 512 capacity: 20971520
block tdc: sector-size: 512 capacity: 146800640
block tdd: sector-size: 512 capacity: 188743680

        J


_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: blktap lockdep hiccup

Daniel Stodden-5
On Mon, 2010-09-06 at 21:39 -0400, Jeremy Fitzhardinge wrote:

> On 09/03/2010 09:08 AM, Daniel Stodden wrote:
> > On Thu, 2010-09-02 at 18:46 -0400, Jeremy Fitzhardinge wrote:
> >> On 08/22/2010 11:54 PM, Daniel Stodden wrote:
> >>> Response processing doesn't really belong into hard irq context.
> >>>
> >>> Another potential problem this avoids is that switching interrupt cpu
> >>> affinity in Xen domains can presently lead to event loss, if
> >>> RING_FINAL_CHECK is run from hard irq context.
> >> I just got this warning from a 32-bit pv domain.  I think it may relate
> >> to this change.  The warning is
> > We clearly spin_lock_irqsave all through the blkif_do_interrupt frame.
> >
> > It follows that something underneath quite unconditionally chose to
> > reenable them again (?)
> >
> > Either: Can you add a bunch of similar WARN_ONs along that path?
> >
> > Or: This lock is quite coarse-grained. The lock only matters for queue
> > access, and we know irqs are reenabled, so no need for flags. In fact we
> > only need to spin_lock_irq around the __blk_end_ calls and
> > kick_pending_.
> >
> > But I don't immediately see what's to blame, so I'd be curious.
>
> I haven't got around to investigating this in more detail yet, but
> there's also this long-standing lockdep hiccup in blktap:

Ack. Let's fix that somewhere this week and see if we can clean up the
spin locking problem too then.

Daniel

> Starting auto Xen domains: lurch  alloc irq_desc for 1235 on node 0
>   alloc kstat_irqs on node 0
> block tda: sector-size: 512 capacity: 614400
> INFO: trying to register non-static key.
> the code is fine but needs lockdep annotation.
> turning off the locking correctness validator.
> Pid: 4266, comm: tapdisk2 Not tainted 2.6.32.21 #146
> Call Trace:
>  [<ffffffff8107f0a4>] __lock_acquire+0x1df/0x16e5
>  [<ffffffff8100f955>] ? xen_force_evtchn_callback+0xd/0xf
>  [<ffffffff81010082>] ? check_events+0x12/0x20
>  [<ffffffff810f0359>] ? apply_to_page_range+0x295/0x37d
>  [<ffffffff81080677>] lock_acquire+0xcd/0xf1
>  [<ffffffff810f0359>] ? apply_to_page_range+0x295/0x37d
>  [<ffffffff810f0259>] ? apply_to_page_range+0x195/0x37d
>  [<ffffffff81506f7d>] _spin_lock+0x31/0x40
>  [<ffffffff810f0359>] ? apply_to_page_range+0x295/0x37d
>  [<ffffffff810f0359>] apply_to_page_range+0x295/0x37d
>  [<ffffffff812ab37c>] ? blktap_map_uaddr_fn+0x0/0x55
>  [<ffffffff8100d0cf>] ? xen_make_pte+0x8a/0x8e
>  [<ffffffff812ac34e>] blktap_device_process_request+0x43d/0x954
>  [<ffffffff8100f955>] ? xen_force_evtchn_callback+0xd/0xf
>  [<ffffffff81010082>] ? check_events+0x12/0x20
>  [<ffffffff8100f955>] ? xen_force_evtchn_callback+0xd/0xf
>  [<ffffffff81010082>] ? check_events+0x12/0x20
>  [<ffffffff8107d687>] ? mark_held_locks+0x52/0x70
>  [<ffffffff81506ddb>] ? _spin_unlock_irq+0x30/0x3c
>  [<ffffffff8107d949>] ? trace_hardirqs_on_caller+0x125/0x150
>  [<ffffffff812acba6>] blktap_device_run_queue+0x1c5/0x28f
>  [<ffffffff812a0234>] ? unbind_from_irq+0x18/0x198
>  [<ffffffff81010082>] ? check_events+0x12/0x20
>  [<ffffffff812ab14d>] blktap_ring_poll+0x7c/0xc7
>  [<ffffffff81124e9b>] do_select+0x387/0x584
>  [<ffffffff81124b14>] ? do_select+0x0/0x584
>  [<ffffffff811255de>] ? __pollwait+0x0/0xcc
>  [<ffffffff811256aa>] ? pollwake+0x0/0x56
>  [<ffffffff811256aa>] ? pollwake+0x0/0x56
>  [<ffffffff811256aa>] ? pollwake+0x0/0x56
>  [<ffffffff811256aa>] ? pollwake+0x0/0x56
>  [<ffffffff8108059b>] ? __lock_acquire+0x16d6/0x16e5
>  [<ffffffff8100f955>] ? xen_force_evtchn_callback+0xd/0xf
>  [<ffffffff81010082>] ? check_events+0x12/0x20
>  [<ffffffff8100f955>] ? xen_force_evtchn_callback+0xd/0xf
>  [<ffffffff81010082>] ? check_events+0x12/0x20
>  [<ffffffff8100f955>] ? xen_force_evtchn_callback+0xd/0xf
>  [<ffffffff811252a4>] core_sys_select+0x20c/0x2da
>  [<ffffffff811250d6>] ? core_sys_select+0x3e/0x2da
>  [<ffffffff81010082>] ? check_events+0x12/0x20
>  [<ffffffff8101006f>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff81108661>] ? kmem_cache_free+0x18e/0x1c8
>  [<ffffffff8141e912>] ? sock_destroy_inode+0x19/0x1b
>  [<ffffffff811299bd>] ? destroy_inode+0x2f/0x44
>  [<ffffffff8102ef22>] ? pvclock_clocksource_read+0x4b/0xa2
>  [<ffffffff8100fe8b>] ? xen_clocksource_read+0x21/0x23
>  [<ffffffff81010003>] ? xen_clocksource_get_cycles+0x9/0x16
>  [<ffffffff81075700>] ? ktime_get_ts+0xb2/0xbf
>  [<ffffffff811255b6>] sys_select+0x96/0xbe
>  [<ffffffff81013d32>] system_call_fastpath+0x16/0x1b
> block tdb: sector-size: 512 capacity: 20971520
> block tdc: sector-size: 512 capacity: 146800640
> block tdd: sector-size: 512 capacity: 188743680
>
> J
>



_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] blkfront: Move blkif_interrupt into a tasklet.

Jeremy Fitzhardinge
In reply to this post by Daniel Stodden-5
 On 09/03/2010 09:08 AM, Daniel Stodden wrote:

> We clearly spin_lock_irqsave all through the blkif_do_interrupt frame.
>
> It follows that something underneath quite unconditionally chose to
> reenable them again (?)
>
> Either: Can you add a bunch of similar WARN_ONs along that path?
>
> Or: This lock is quite coarse-grained. The lock only matters for queue
> access, and we know irqs are reenabled, so no need for flags. In fact we
> only need to spin_lock_irq around the __blk_end_ calls and
> kick_pending_.
>
> But I don't immediately see what's to blame, so I'd be curious.

It looks like __blk_end_request_all(req, error); (line 743) is returning
with interrupts enabled sometimes (not consistently).  I had a quick
look through the code, but I couldn't see where it touches the interrupt
state at all.

    J

_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] blkfront: Move blkif_interrupt into a tasklet.

Daniel Stodden-5
On Tue, 2010-09-07 at 22:03 -0400, Jeremy Fitzhardinge wrote:

> On 09/03/2010 09:08 AM, Daniel Stodden wrote:
> > We clearly spin_lock_irqsave all through the blkif_do_interrupt frame.
> >
> > It follows that something underneath quite unconditionally chose to
> > reenable them again (?)
> >
> > Either: Can you add a bunch of similar WARN_ONs along that path?
> >
> > Or: This lock is quite coarse-grained. The lock only matters for queue
> > access, and we know irqs are reenabled, so no need for flags. In fact we
> > only need to spin_lock_irq around the __blk_end_ calls and
> > kick_pending_.
> >
> > But I don't immediately see what's to blame, so I'd be curious.
>
> It looks like __blk_end_request_all(req, error); (line 743) is returning
> with interrupts enabled sometimes (not consistently).  I had a quick
> look through the code, but I couldn't see where it touches the interrupt
> state at all.

Oha. Was this found on 2.6.32 or later?

Thanks,
Daniel


_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] blkfront: Move blkif_interrupt into a tasklet.

Jeremy Fitzhardinge
 On 09/08/2010 12:21 PM, Daniel Stodden wrote:

> On Tue, 2010-09-07 at 22:03 -0400, Jeremy Fitzhardinge wrote:
>> On 09/03/2010 09:08 AM, Daniel Stodden wrote:
>>> We clearly spin_lock_irqsave all through the blkif_do_interrupt frame.
>>>
>>> It follows that something underneath quite unconditionally chose to
>>> reenable them again (?)
>>>
>>> Either: Can you add a bunch of similar WARN_ONs along that path?
>>>
>>> Or: This lock is quite coarse-grained. The lock only matters for queue
>>> access, and we know irqs are reenabled, so no need for flags. In fact we
>>> only need to spin_lock_irq around the __blk_end_ calls and
>>> kick_pending_.
>>>
>>> But I don't immediately see what's to blame, so I'd be curious.
>> It looks like __blk_end_request_all(req, error); (line 743) is returning
>> with interrupts enabled sometimes (not consistently).  I had a quick
>> look through the code, but I couldn't see where it touches the interrupt
>> state at all.
> Oha. Was this found on 2.6.32 or later?

Yeah, xen/next-2.6.32.

    J

_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] blkfront: Move blkif_interrupt into a tasklet.

Andrew Jones-15
In reply to this post by Jeremy Fitzhardinge
On 09/03/2010 12:46 AM, Jeremy Fitzhardinge wrote:

>  On 08/22/2010 11:54 PM, Daniel Stodden wrote:
>> Response processing doesn't really belong into hard irq context.
>>
>> Another potential problem this avoids is that switching interrupt cpu
>> affinity in Xen domains can presently lead to event loss, if
>> RING_FINAL_CHECK is run from hard irq context.
>
> I just got this warning from a 32-bit pv domain.  I think it may relate
> to this change.  The warning is
>
> void blk_start_queue(struct request_queue *q)
> {
> WARN_ON(!irqs_disabled());
>
>
> Oddly, I only saw this pair once at boot, and after that the system
> seemed fine...
>
> [    4.376451] ------------[ cut here ]------------
> [    4.377415] WARNING: at /home/jeremy/git/linux/block/blk-core.c:337 blk_start_queue+0x20/0x36()
> [    4.377415] Modules linked in: xfs exportfs xen_blkfront [last unloaded: scsi_wait_scan]
> [    4.377415] Pid: 0, comm: swapper Not tainted 2.6.32.21 #32
> [    4.377415] Call Trace:
> [    4.377415]  [<c1039f74>] warn_slowpath_common+0x65/0x7c
> [    4.377415]  [<c11b3ae1>] ? blk_start_queue+0x20/0x36
> [    4.377415]  [<c1039f98>] warn_slowpath_null+0xd/0x10
> [    4.377415]  [<c11b3ae1>] blk_start_queue+0x20/0x36
> [    4.377415]  [<edc74712>] kick_pending_request_queues+0x1c/0x2a [xen_blkfront]
> [    4.377415]  [<edc74ec4>] blkif_do_interrupt+0x176/0x189 [xen_blkfront]
> [    4.377415]  [<c103e063>] tasklet_action+0x63/0xa8
> [    4.377415]  [<c103f2d5>] __do_softirq+0xac/0x152
> [    4.377415]  [<c103f3ac>] do_softirq+0x31/0x3c
> [    4.377415]  [<c103f484>] irq_exit+0x29/0x5c
> [    4.377415]  [<c121a1b6>] xen_evtchn_do_upcall+0x29/0x34
> [    4.377415]  [<c100a027>] xen_do_upcall+0x7/0xc
> [    4.377415]  [<c10023a7>] ? hypercall_page+0x3a7/0x1005
> [    4.377415]  [<c10065a9>] ? xen_safe_halt+0x12/0x1f
> [    4.377415]  [<c10042cb>] xen_idle+0x27/0x38
> [    4.377415]  [<c100877e>] cpu_idle+0x49/0x63
> [    4.377415]  [<c14a6427>] rest_init+0x53/0x55
> [    4.377415]  [<c179c814>] start_kernel+0x2d4/0x2d9
> [    4.377415]  [<c179c0a8>] i386_start_kernel+0x97/0x9e
> [    4.377415]  [<c179f478>] xen_start_kernel+0x576/0x57e
> [    4.377415] ---[ end trace 0bfb98f0ed515cdb ]---
> [    4.377415] ------------[ cut here ]------------
> [    4.377415] WARNING: at /home/jeremy/git/linux/block/blk-core.c:245 blk_remove_plug+0x20/0x7e()
> [    4.377415] Modules linked in: xfs exportfs xen_blkfront [last unloaded: scsi_wait_scan]
> [    4.377415] Pid: 0, comm: swapper Tainted: G        W  2.6.32.21 #32
> [    4.377415] Call Trace:
> [    4.377415]  [<c1039f74>] warn_slowpath_common+0x65/0x7c
> [    4.377415]  [<c11b3961>] ? blk_remove_plug+0x20/0x7e
> [    4.377415]  [<c1039f98>] warn_slowpath_null+0xd/0x10
> [    4.377415]  [<c11b3961>] blk_remove_plug+0x20/0x7e
> [    4.377415]  [<c11b39ca>] __blk_run_queue+0xb/0x5e
> [    4.377415]  [<c11b3af4>] blk_start_queue+0x33/0x36
> [    4.377415]  [<edc74712>] kick_pending_request_queues+0x1c/0x2a [xen_blkfront]
> [    4.377415]  [<edc74ec4>] blkif_do_interrupt+0x176/0x189 [xen_blkfront]
> [    4.377415]  [<c103e063>] tasklet_action+0x63/0xa8
> [    4.377415]  [<c103f2d5>] __do_softirq+0xac/0x152
> [    4.377415]  [<c103f3ac>] do_softirq+0x31/0x3c
> [    4.377415]  [<c103f484>] irq_exit+0x29/0x5c
> [    4.377415]  [<c121a1b6>] xen_evtchn_do_upcall+0x29/0x34
> [    4.377415]  [<c100a027>] xen_do_upcall+0x7/0xc
> [    4.377415]  [<c10023a7>] ? hypercall_page+0x3a7/0x1005
> [    4.377415]  [<c10065a9>] ? xen_safe_halt+0x12/0x1f
> [    4.377415]  [<c10042cb>] xen_idle+0x27/0x38
> [    4.377415]  [<c100877e>] cpu_idle+0x49/0x63
> [    4.377415]  [<c14a6427>] rest_init+0x53/0x55
> [    4.377415]  [<c179c814>] start_kernel+0x2d4/0x2d9
> [    4.377415]  [<c179c0a8>] i386_start_kernel+0x97/0x9e
> [    4.377415]  [<c179f478>] xen_start_kernel+0x576/0x57e
> [    4.377415] ---[ end trace 0bfb98f0ed515cdc ]---
>
> J
>

Hi Jeremy,

Any developments with this? I've got a report of the exact same warnings
on RHEL6 guest. See

https://bugzilla.redhat.com/show_bug.cgi?id=632802

RHEL6 doesn't have the 'Move blkif_interrupt into a tasklet' patch, so
that can be ruled out. Unfortunately I don't have this reproducing on a
test machine, so it's difficult to debug.  The report I have showed that
in at least one case it occurred on boot up, right after initting the
block device. I'm trying to get confirmation if that's always the case.

Thanks in advance for any pointers you might have.

Drew

_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] blkfront: Move blkif_interrupt into a tasklet.

Jeremy Fitzhardinge
 On 09/23/2010 09:08 AM, Andrew Jones wrote:

> On 09/03/2010 12:46 AM, Jeremy Fitzhardinge wrote:
>>  On 08/22/2010 11:54 PM, Daniel Stodden wrote:
>>> Response processing doesn't really belong into hard irq context.
>>>
>>> Another potential problem this avoids is that switching interrupt cpu
>>> affinity in Xen domains can presently lead to event loss, if
>>> RING_FINAL_CHECK is run from hard irq context.
>> I just got this warning from a 32-bit pv domain.  I think it may relate
>> to this change.  The warning is
>>
>> void blk_start_queue(struct request_queue *q)
>> {
>> WARN_ON(!irqs_disabled());
>>
>>
>> Oddly, I only saw this pair once at boot, and after that the system
>> seemed fine...
>>
>> [    4.376451] ------------[ cut here ]------------
>> [    4.377415] WARNING: at /home/jeremy/git/linux/block/blk-core.c:337 blk_start_queue+0x20/0x36()
>> [    4.377415] Modules linked in: xfs exportfs xen_blkfront [last unloaded: scsi_wait_scan]
>> [    4.377415] Pid: 0, comm: swapper Not tainted 2.6.32.21 #32
>> [    4.377415] Call Trace:
>> [    4.377415]  [<c1039f74>] warn_slowpath_common+0x65/0x7c
>> [    4.377415]  [<c11b3ae1>] ? blk_start_queue+0x20/0x36
>> [    4.377415]  [<c1039f98>] warn_slowpath_null+0xd/0x10
>> [    4.377415]  [<c11b3ae1>] blk_start_queue+0x20/0x36
>> [    4.377415]  [<edc74712>] kick_pending_request_queues+0x1c/0x2a [xen_blkfront]
>> [    4.377415]  [<edc74ec4>] blkif_do_interrupt+0x176/0x189 [xen_blkfront]
>> [    4.377415]  [<c103e063>] tasklet_action+0x63/0xa8
>> [    4.377415]  [<c103f2d5>] __do_softirq+0xac/0x152
>> [    4.377415]  [<c103f3ac>] do_softirq+0x31/0x3c
>> [    4.377415]  [<c103f484>] irq_exit+0x29/0x5c
>> [    4.377415]  [<c121a1b6>] xen_evtchn_do_upcall+0x29/0x34
>> [    4.377415]  [<c100a027>] xen_do_upcall+0x7/0xc
>> [    4.377415]  [<c10023a7>] ? hypercall_page+0x3a7/0x1005
>> [    4.377415]  [<c10065a9>] ? xen_safe_halt+0x12/0x1f
>> [    4.377415]  [<c10042cb>] xen_idle+0x27/0x38
>> [    4.377415]  [<c100877e>] cpu_idle+0x49/0x63
>> [    4.377415]  [<c14a6427>] rest_init+0x53/0x55
>> [    4.377415]  [<c179c814>] start_kernel+0x2d4/0x2d9
>> [    4.377415]  [<c179c0a8>] i386_start_kernel+0x97/0x9e
>> [    4.377415]  [<c179f478>] xen_start_kernel+0x576/0x57e
>> [    4.377415] ---[ end trace 0bfb98f0ed515cdb ]---
>> [    4.377415] ------------[ cut here ]------------
>> [    4.377415] WARNING: at /home/jeremy/git/linux/block/blk-core.c:245 blk_remove_plug+0x20/0x7e()
>> [    4.377415] Modules linked in: xfs exportfs xen_blkfront [last unloaded: scsi_wait_scan]
>> [    4.377415] Pid: 0, comm: swapper Tainted: G        W  2.6.32.21 #32
>> [    4.377415] Call Trace:
>> [    4.377415]  [<c1039f74>] warn_slowpath_common+0x65/0x7c
>> [    4.377415]  [<c11b3961>] ? blk_remove_plug+0x20/0x7e
>> [    4.377415]  [<c1039f98>] warn_slowpath_null+0xd/0x10
>> [    4.377415]  [<c11b3961>] blk_remove_plug+0x20/0x7e
>> [    4.377415]  [<c11b39ca>] __blk_run_queue+0xb/0x5e
>> [    4.377415]  [<c11b3af4>] blk_start_queue+0x33/0x36
>> [    4.377415]  [<edc74712>] kick_pending_request_queues+0x1c/0x2a [xen_blkfront]
>> [    4.377415]  [<edc74ec4>] blkif_do_interrupt+0x176/0x189 [xen_blkfront]
>> [    4.377415]  [<c103e063>] tasklet_action+0x63/0xa8
>> [    4.377415]  [<c103f2d5>] __do_softirq+0xac/0x152
>> [    4.377415]  [<c103f3ac>] do_softirq+0x31/0x3c
>> [    4.377415]  [<c103f484>] irq_exit+0x29/0x5c
>> [    4.377415]  [<c121a1b6>] xen_evtchn_do_upcall+0x29/0x34
>> [    4.377415]  [<c100a027>] xen_do_upcall+0x7/0xc
>> [    4.377415]  [<c10023a7>] ? hypercall_page+0x3a7/0x1005
>> [    4.377415]  [<c10065a9>] ? xen_safe_halt+0x12/0x1f
>> [    4.377415]  [<c10042cb>] xen_idle+0x27/0x38
>> [    4.377415]  [<c100877e>] cpu_idle+0x49/0x63
>> [    4.377415]  [<c14a6427>] rest_init+0x53/0x55
>> [    4.377415]  [<c179c814>] start_kernel+0x2d4/0x2d9
>> [    4.377415]  [<c179c0a8>] i386_start_kernel+0x97/0x9e
>> [    4.377415]  [<c179f478>] xen_start_kernel+0x576/0x57e
>> [    4.377415] ---[ end trace 0bfb98f0ed515cdc ]---
>>
>> J
>>
> Hi Jeremy,
>
> Any developments with this? I've got a report of the exact same warnings
> on RHEL6 guest. See
>
> https://bugzilla.redhat.com/show_bug.cgi?id=632802
>
> RHEL6 doesn't have the 'Move blkif_interrupt into a tasklet' patch, so
> that can be ruled out. Unfortunately I don't have this reproducing on a
> test machine, so it's difficult to debug.  The report I have showed that
> in at least one case it occurred on boot up, right after initting the
> block device. I'm trying to get confirmation if that's always the case.
>
> Thanks in advance for any pointers you might have.
>

Yes, I see it even after reverting that change as well.  However I only
see it on my domain with an XFS filesystem, but I haven't dug any deeper
to see if that's relevant.

Do you know when this appeared?  Is it recent?  What changes are in the
rhel6 kernel in question?

    J

_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] blkfront: Move blkif_interrupt into a tasklet.

pbonzini
On 09/23/2010 06:23 PM, Jeremy Fitzhardinge wrote:

>> Any developments with this? I've got a report of the exact same warnings
>> on RHEL6 guest. See
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=632802
>>
>> RHEL6 doesn't have the 'Move blkif_interrupt into a tasklet' patch, so
>> that can be ruled out. Unfortunately I don't have this reproducing on a
>> test machine, so it's difficult to debug.  The report I have showed that
>> in at least one case it occurred on boot up, right after initting the
>> block device. I'm trying to get confirmation if that's always the case.
>>
>> Thanks in advance for any pointers you might have.
>
> Yes, I see it even after reverting that change as well.  However I only
> see it on my domain with an XFS filesystem, but I haven't dug any deeper
> to see if that's relevant.
>
> Do you know when this appeared?  Is it recent?  What changes are in the
> rhel6 kernel in question?

It's got pretty much everything in stable-2.6.32.x, up to the 16 patch
blkfront series you posted last July.  There are some RHEL-specific
workarounds for PV-on-HVM, but for PV domains everything matches upstream.

Paolo

_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] blkfront: Move blkif_interrupt into a tasklet.

Jeremy Fitzhardinge
 On 09/23/2010 09:38 AM, Paolo Bonzini wrote:

> On 09/23/2010 06:23 PM, Jeremy Fitzhardinge wrote:
>>> Any developments with this? I've got a report of the exact same
>>> warnings
>>> on RHEL6 guest. See
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=632802
>>>
>>> RHEL6 doesn't have the 'Move blkif_interrupt into a tasklet' patch, so
>>> that can be ruled out. Unfortunately I don't have this reproducing on a
>>> test machine, so it's difficult to debug.  The report I have showed
>>> that
>>> in at least one case it occurred on boot up, right after initting the
>>> block device. I'm trying to get confirmation if that's always the case.
>>>
>>> Thanks in advance for any pointers you might have.
>>
>> Yes, I see it even after reverting that change as well.  However I only
>> see it on my domain with an XFS filesystem, but I haven't dug any deeper
>> to see if that's relevant.
>>
>> Do you know when this appeared?  Is it recent?  What changes are in the
>> rhel6 kernel in question?
>
> It's got pretty much everything in stable-2.6.32.x, up to the 16 patch
> blkfront series you posted last July.  There are some RHEL-specific
> workarounds for PV-on-HVM, but for PV domains everything matches
> upstream.

Have you tried bisecting to see when this particular problem appeared?
It looks to me like something is accidentally re-enabling interrupts -
perhaps a stack overrun is corrupting the "flags" argument between a
spin_lock_irqsave()/restore pair.

Is it only on 32-bit kernels?

    J

_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] blkfront: Move blkif_interrupt into a tasklet.

Andrew Jones-15
On 09/23/2010 08:36 PM, Jeremy Fitzhardinge wrote:

>  On 09/23/2010 09:38 AM, Paolo Bonzini wrote:
>> On 09/23/2010 06:23 PM, Jeremy Fitzhardinge wrote:
>>>> Any developments with this? I've got a report of the exact same
>>>> warnings
>>>> on RHEL6 guest. See
>>>>
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=632802
>>>>
>>>> RHEL6 doesn't have the 'Move blkif_interrupt into a tasklet' patch, so
>>>> that can be ruled out. Unfortunately I don't have this reproducing on a
>>>> test machine, so it's difficult to debug.  The report I have showed
>>>> that
>>>> in at least one case it occurred on boot up, right after initting the
>>>> block device. I'm trying to get confirmation if that's always the case.
>>>>
>>>> Thanks in advance for any pointers you might have.
>>>
>>> Yes, I see it even after reverting that change as well.  However I only
>>> see it on my domain with an XFS filesystem, but I haven't dug any deeper
>>> to see if that's relevant.
>>>
>>> Do you know when this appeared?  Is it recent?  What changes are in the
>>> rhel6 kernel in question?
>>
>> It's got pretty much everything in stable-2.6.32.x, up to the 16 patch
>> blkfront series you posted last July.  There are some RHEL-specific
>> workarounds for PV-on-HVM, but for PV domains everything matches
>> upstream.
>
> Have you tried bisecting to see when this particular problem appeared?
> It looks to me like something is accidentally re-enabling interrupts -
> perhaps a stack overrun is corrupting the "flags" argument between a
> spin_lock_irqsave()/restore pair.
>

Unfortunately I don't have a test machine where I can do a bisection
(yet). I'm looking for one. I only have this one report so far, and it's
on a production machine.

> Is it only on 32-bit kernels?
>

This one report I have is a 32b guest on a 64b host.

Drew


_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] blkfront: Move blkif_interrupt into a tasklet.

Jeremy Fitzhardinge
 On 09/24/2010 12:14 AM, Andrew Jones wrote:

> On 09/23/2010 08:36 PM, Jeremy Fitzhardinge wrote:
>>  On 09/23/2010 09:38 AM, Paolo Bonzini wrote:
>>> On 09/23/2010 06:23 PM, Jeremy Fitzhardinge wrote:
>>>>> Any developments with this? I've got a report of the exact same
>>>>> warnings
>>>>> on RHEL6 guest. See
>>>>>
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=632802
>>>>>
>>>>> RHEL6 doesn't have the 'Move blkif_interrupt into a tasklet' patch, so
>>>>> that can be ruled out. Unfortunately I don't have this reproducing on a
>>>>> test machine, so it's difficult to debug.  The report I have showed
>>>>> that
>>>>> in at least one case it occurred on boot up, right after initting the
>>>>> block device. I'm trying to get confirmation if that's always the case.
>>>>>
>>>>> Thanks in advance for any pointers you might have.
>>>> Yes, I see it even after reverting that change as well.  However I only
>>>> see it on my domain with an XFS filesystem, but I haven't dug any deeper
>>>> to see if that's relevant.
>>>>
>>>> Do you know when this appeared?  Is it recent?  What changes are in the
>>>> rhel6 kernel in question?
>>> It's got pretty much everything in stable-2.6.32.x, up to the 16 patch
>>> blkfront series you posted last July.  There are some RHEL-specific
>>> workarounds for PV-on-HVM, but for PV domains everything matches
>>> upstream.
>> Have you tried bisecting to see when this particular problem appeared?
>> It looks to me like something is accidentally re-enabling interrupts -
>> perhaps a stack overrun is corrupting the "flags" argument between a
>> spin_lock_irqsave()/restore pair.
>>
> Unfortunately I don't have a test machine where I can do a bisection
> (yet). I'm looking for one. I only have this one report so far, and it's
> on a production machine.

The report says that its repeatedly killing the machine though?  In my
testing, it seems to hit the warning once at boot, but is OK after that
(not that I'm doing anything very stressful on the domain).

>> Is it only on 32-bit kernels?
>>
> This one report I have is a 32b guest on a 64b host.

Is it using XFS by any chance?  So far I've traced the re-enable to
xfs_buf_bio_end_io().  However, my suspicion is that it might be related
to the barrier changes we did.

    J

_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] blkfront: Move blkif_interrupt into a tasklet.

Andrew Jones-15
On 09/24/2010 08:50 PM, Jeremy Fitzhardinge wrote:

>  On 09/24/2010 12:14 AM, Andrew Jones wrote:
>> On 09/23/2010 08:36 PM, Jeremy Fitzhardinge wrote:
>>>  On 09/23/2010 09:38 AM, Paolo Bonzini wrote:
>>>> On 09/23/2010 06:23 PM, Jeremy Fitzhardinge wrote:
>>>>>> Any developments with this? I've got a report of the exact same
>>>>>> warnings
>>>>>> on RHEL6 guest. See
>>>>>>
>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=632802
>>>>>>
>>>>>> RHEL6 doesn't have the 'Move blkif_interrupt into a tasklet' patch, so
>>>>>> that can be ruled out. Unfortunately I don't have this reproducing on a
>>>>>> test machine, so it's difficult to debug.  The report I have showed
>>>>>> that
>>>>>> in at least one case it occurred on boot up, right after initting the
>>>>>> block device. I'm trying to get confirmation if that's always the case.
>>>>>>
>>>>>> Thanks in advance for any pointers you might have.
>>>>> Yes, I see it even after reverting that change as well.  However I only
>>>>> see it on my domain with an XFS filesystem, but I haven't dug any deeper
>>>>> to see if that's relevant.
>>>>>
>>>>> Do you know when this appeared?  Is it recent?  What changes are in the
>>>>> rhel6 kernel in question?
>>>> It's got pretty much everything in stable-2.6.32.x, up to the 16 patch
>>>> blkfront series you posted last July.  There are some RHEL-specific
>>>> workarounds for PV-on-HVM, but for PV domains everything matches
>>>> upstream.
>>> Have you tried bisecting to see when this particular problem appeared?
>>> It looks to me like something is accidentally re-enabling interrupts -
>>> perhaps a stack overrun is corrupting the "flags" argument between a
>>> spin_lock_irqsave()/restore pair.
>>>
>> Unfortunately I don't have a test machine where I can do a bisection
>> (yet). I'm looking for one. I only have this one report so far, and it's
>> on a production machine.
>
> The report says that its repeatedly killing the machine though?  In my
> testing, it seems to hit the warning once at boot, but is OK after that
> (not that I'm doing anything very stressful on the domain).
>

It looks like the crash is from failing to read swap due to a bad page
map. It's possibly another issue, but I wanted to try and clean this
issue up first to see what happens.

>>> Is it only on 32-bit kernels?
>>>
>> This one report I have is a 32b guest on a 64b host.
>
> Is it using XFS by any chance?  So far I've traced the re-enable to
> xfs_buf_bio_end_io().  However, my suspicion is that it might be related
> to the barrier changes we did.
>

I'll check on the xfs and let you know.

>     J
>
> _______________________________________________
> Xen-devel mailing list
> [hidden email]
> http://lists.xensource.com/xen-devel


_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] blkfront: Move blkif_interrupt into a tasklet.

Daniel Stodden-5
On Mon, 2010-09-27 at 03:41 -0400, Andrew Jones wrote:

> On 09/24/2010 08:50 PM, Jeremy Fitzhardinge wrote:
> >  On 09/24/2010 12:14 AM, Andrew Jones wrote:
> >> On 09/23/2010 08:36 PM, Jeremy Fitzhardinge wrote:
> >>>  On 09/23/2010 09:38 AM, Paolo Bonzini wrote:
> >>>> On 09/23/2010 06:23 PM, Jeremy Fitzhardinge wrote:
> >>>>>> Any developments with this? I've got a report of the exact same
> >>>>>> warnings
> >>>>>> on RHEL6 guest. See
> >>>>>>
> >>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=632802
> >>>>>>
> >>>>>> RHEL6 doesn't have the 'Move blkif_interrupt into a tasklet' patch, so
> >>>>>> that can be ruled out. Unfortunately I don't have this reproducing on a
> >>>>>> test machine, so it's difficult to debug.  The report I have showed
> >>>>>> that
> >>>>>> in at least one case it occurred on boot up, right after initting the
> >>>>>> block device. I'm trying to get confirmation if that's always the case.
> >>>>>>
> >>>>>> Thanks in advance for any pointers you might have.
> >>>>> Yes, I see it even after reverting that change as well.  However I only
> >>>>> see it on my domain with an XFS filesystem, but I haven't dug any deeper
> >>>>> to see if that's relevant.
> >>>>>
> >>>>> Do you know when this appeared?  Is it recent?  What changes are in the
> >>>>> rhel6 kernel in question?
> >>>> It's got pretty much everything in stable-2.6.32.x, up to the 16 patch
> >>>> blkfront series you posted last July.  There are some RHEL-specific
> >>>> workarounds for PV-on-HVM, but for PV domains everything matches
> >>>> upstream.
> >>> Have you tried bisecting to see when this particular problem appeared?
> >>> It looks to me like something is accidentally re-enabling interrupts -
> >>> perhaps a stack overrun is corrupting the "flags" argument between a
> >>> spin_lock_irqsave()/restore pair.
> >>>
> >> Unfortunately I don't have a test machine where I can do a bisection
> >> (yet). I'm looking for one. I only have this one report so far, and it's
> >> on a production machine.
> >
> > The report says that its repeatedly killing the machine though?  In my
> > testing, it seems to hit the warning once at boot, but is OK after that
> > (not that I'm doing anything very stressful on the domain).
> >
>
> It looks like the crash is from failing to read swap due to a bad page
> map. It's possibly another issue, but I wanted to try and clean this
> issue up first to see what happens.

Uh oh. Sure this was a frontend crash? If you see it a again, a stack
trace to look at would be great.

Thanks,
Daniel





_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] blkfront: Move blkif_interrupt into a tasklet.

Andrew Jones-15
On 09/27/2010 11:46 AM, Daniel Stodden wrote:

> On Mon, 2010-09-27 at 03:41 -0400, Andrew Jones wrote:
>> On 09/24/2010 08:50 PM, Jeremy Fitzhardinge wrote:
>>>  On 09/24/2010 12:14 AM, Andrew Jones wrote:
>>>> On 09/23/2010 08:36 PM, Jeremy Fitzhardinge wrote:
>>>>>  On 09/23/2010 09:38 AM, Paolo Bonzini wrote:
>>>>>> On 09/23/2010 06:23 PM, Jeremy Fitzhardinge wrote:
>>>>>>>> Any developments with this? I've got a report of the exact same
>>>>>>>> warnings
>>>>>>>> on RHEL6 guest. See
>>>>>>>>
>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=632802
>>>>>>>>
>>>>>>>> RHEL6 doesn't have the 'Move blkif_interrupt into a tasklet' patch, so
>>>>>>>> that can be ruled out. Unfortunately I don't have this reproducing on a
>>>>>>>> test machine, so it's difficult to debug.  The report I have showed
>>>>>>>> that
>>>>>>>> in at least one case it occurred on boot up, right after initting the
>>>>>>>> block device. I'm trying to get confirmation if that's always the case.
>>>>>>>>
>>>>>>>> Thanks in advance for any pointers you might have.
>>>>>>> Yes, I see it even after reverting that change as well.  However I only
>>>>>>> see it on my domain with an XFS filesystem, but I haven't dug any deeper
>>>>>>> to see if that's relevant.
>>>>>>>
>>>>>>> Do you know when this appeared?  Is it recent?  What changes are in the
>>>>>>> rhel6 kernel in question?
>>>>>> It's got pretty much everything in stable-2.6.32.x, up to the 16 patch
>>>>>> blkfront series you posted last July.  There are some RHEL-specific
>>>>>> workarounds for PV-on-HVM, but for PV domains everything matches
>>>>>> upstream.
>>>>> Have you tried bisecting to see when this particular problem appeared?
>>>>> It looks to me like something is accidentally re-enabling interrupts -
>>>>> perhaps a stack overrun is corrupting the "flags" argument between a
>>>>> spin_lock_irqsave()/restore pair.
>>>>>
>>>> Unfortunately I don't have a test machine where I can do a bisection
>>>> (yet). I'm looking for one. I only have this one report so far, and it's
>>>> on a production machine.
>>>
>>> The report says that its repeatedly killing the machine though?  In my
>>> testing, it seems to hit the warning once at boot, but is OK after that
>>> (not that I'm doing anything very stressful on the domain).
>>>
>>
>> It looks like the crash is from failing to read swap due to a bad page
>> map. It's possibly another issue, but I wanted to try and clean this
>> issue up first to see what happens.
>
> Uh oh. Sure this was a frontend crash? If you see it a again, a stack
> trace to look at would be great.
>

Hi Daniel,

You can take a look at this bug

https://bugzilla.redhat.com/show_bug.cgi?id=632802

there's stacks for the swap issue in the comments and also this attached
dmesg

https://bugzilla.redhat.com/attachment.cgi?id=447789


Thanks,
Drew



> Thanks,
> Daniel
>
>
>
>


_______________________________________________
Xen-devel mailing list
[hidden email]
http://lists.xensource.com/xen-devel
123