Debugging sudden hangs

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Debugging sudden hangs

Liwei Xie
Hi list,
    We recently updated our system and started experiencing random
hangs. It happens, on average, once every 1.5 days (sometimes taking 2
days to occur, other times happening multiple times a day, somewhat
proportional to IO load).

    Before troubling the developers too much, I'd like to collect more
information, however, the problem is the hangs occur without any
symptoms/crashes/panics. I've booted xen and dom0 with:
"loglvl=all guest_loglvl=all" and "loglevel=10 debug initcall_debug"
respectively.

    When the hang occurs, all domUs and dom0 just stop responding to
key presses, networking and there is no IO activity. Nothing gets
generated in the console/logs (no symptoms either, no logs out of the
ordinary). Even hitting ctrl+a multiple times in the console does
nothing (indicating xen is dead too). On the video console, we just
have a blinking cursor after the last console log (though my
understanding is that the cursor blink might be generated by the video
card rather than any indication that at least something is still
running). If the hardware WDT is on, the watchdog eventually bites and
reboots the system.

    Although I believe it isn't related (since dom0 stalls too, and
we're looking at a completely stalled system rather than just domUs
having issues with disk IO), I added "gnttab_max_frames=256" to the
xen boot arguments anyway. Didn't seem to change anything.

    Then, grasping at straws, I turned off HWPM in BIOS, which we had
to do so on another machine hosting VMware ESX, obviously didn't seem
to change anything either.

    At this point, I'd like to know what is the best way to approach
this? Can I enable further levels of debugging so that I can even
begin to look towards a certain culprit? Is there a good way to
determine if it may be the hardware?

    I've tried running the same kernel without xen and just simulating
heavy IO on the disk array without issues, which leans me towards xen
being part of the equation. But then again, doing random file
read/writes isn't a good simulation of the type of workload the domUs
put on the server.

    OS: Debian Buster
    Kernel: 4.17.0-1-amd64
    Xen: 4.8.4-pre (Debian 4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9)
    CPU: Xeon E5-2699 v4
    RAM: Samsung 96GB ECC Registered
    MB: Supermicro X10SRi-F

    In case it is relevant, since it might be IO related...
    Net: Chelsio T520-CR (2 x XGB links, shared to domU using VF)
    RAID: LSI SAS3224 with 10 SAS3 drives

Warm regards,
Liwei

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xenproject.org/mailman/listinfo/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: Debugging sudden hangs

Eric Duncan


On Sun, Aug 19, 2018, 8:51 AM Liwei <[hidden email]> wrote:
Hi list,
    We recently updated our system and started experiencing random
hangs. It happens, on average, once every 1.5 days (sometimes taking 2
days to occur, other times happening multiple times a day, somewhat
proportional to IO load).

    Before troubling the developers too much, I'd like to collect more
information, however, the problem is the hangs occur without any
symptoms/crashes/panics. I've booted xen and dom0 with:
"loglvl=all guest_loglvl=all" and "loglevel=10 debug initcall_debug"
respectively.

    When the hang occurs, all domUs and dom0 just stop responding to
key presses, networking and there is no IO activity. Nothing gets
generated in the console/logs (no symptoms either, no logs out of the
ordinary). Even hitting ctrl+a multiple times in the console does
nothing (indicating xen is dead too). On the video console, we just
have a blinking cursor after the last console log (though my
understanding is that the cursor blink might be generated by the video
card rather than any indication that at least something is still
running). If the hardware WDT is on, the watchdog eventually bites and
reboots the system.

    Although I believe it isn't related (since dom0 stalls too, and
we're looking at a completely stalled system rather than just domUs
having issues with disk IO), I added "gnttab_max_frames=256" to the
xen boot arguments anyway. Didn't seem to change anything.

    Then, grasping at straws, I turned off HWPM in BIOS, which we had
to do so on another machine hosting VMware ESX, obviously didn't seem
to change anything either.

    At this point, I'd like to know what is the best way to approach
this? Can I enable further levels of debugging so that I can even
begin to look towards a certain culprit? Is there a good way to
determine if it may be the hardware?

    I've tried running the same kernel without xen and just simulating
heavy IO on the disk array without issues, which leans me towards xen
being part of the equation. But then again, doing random file
read/writes isn't a good simulation of the type of workload the domUs
put on the server.

    OS: Debian Buster
    Kernel: 4.17.0-1-amd64
    Xen: 4.8.4-pre (Debian 4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9)
    CPU: Xeon E5-2699 v4
    RAM: Samsung 96GB ECC Registered
    MB: Supermicro X10SRi-F

    In case it is relevant, since it might be IO related...
    Net: Chelsio T520-CR (2 x XGB links, shared to domU using VF)
    RAID: LSI SAS3224 with 10 SAS3 drives

Warm regards,
Liwei

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xenproject.org/mailman/listinfo/xen-users

In my experience, as a non-Xen user on nearly the identical motherboard (X10SRA), I would suggest the motherboard.

I've purchased 4 of these boards and run various Windows and Linux kernels.  They all have different CPUs (some Retail, some Engineering Samples), different ECC ram and different storage setups (some using onboard SATA, some using on LSI cards, etc).

They all, every single one of them, experience random hard-lockups just like you describe: becomes completely unresponsive, screen freezes, etc.

I don't run Xen on any of them.  I've swapped all sorts of hardware, tried several beta BIOS versions from support, RMA'd 3 of them...  They all continued to lockup.

This went on for about two years until I had enough.  I swapped all boards out for the X10DLA, using the exact same components, and I have had zero issues since.

Again, this is just one user's experience - and I just happened to be on the Xen mailing list and saw this.

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xenproject.org/mailman/listinfo/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: Debugging sudden hangs

Konrad Eisele
Systems that get overheated also exhibit  this kind of behavior. I was experiencing it once with a epyc mb that was crammed into a 1U case.

sön 19 aug. 2018 kl. 15:12 skrev Eric Duncan <[hidden email]>:


On Sun, Aug 19, 2018, 8:51 AM Liwei <[hidden email]> wrote:
Hi list,
    We recently updated our system and started experiencing random
hangs. It happens, on average, once every 1.5 days (sometimes taking 2
days to occur, other times happening multiple times a day, somewhat
proportional to IO load).

    Before troubling the developers too much, I'd like to collect more
information, however, the problem is the hangs occur without any
symptoms/crashes/panics. I've booted xen and dom0 with:
"loglvl=all guest_loglvl=all" and "loglevel=10 debug initcall_debug"
respectively.

    When the hang occurs, all domUs and dom0 just stop responding to
key presses, networking and there is no IO activity. Nothing gets
generated in the console/logs (no symptoms either, no logs out of the
ordinary). Even hitting ctrl+a multiple times in the console does
nothing (indicating xen is dead too). On the video console, we just
have a blinking cursor after the last console log (though my
understanding is that the cursor blink might be generated by the video
card rather than any indication that at least something is still
running). If the hardware WDT is on, the watchdog eventually bites and
reboots the system.

    Although I believe it isn't related (since dom0 stalls too, and
we're looking at a completely stalled system rather than just domUs
having issues with disk IO), I added "gnttab_max_frames=256" to the
xen boot arguments anyway. Didn't seem to change anything.

    Then, grasping at straws, I turned off HWPM in BIOS, which we had
to do so on another machine hosting VMware ESX, obviously didn't seem
to change anything either.

    At this point, I'd like to know what is the best way to approach
this? Can I enable further levels of debugging so that I can even
begin to look towards a certain culprit? Is there a good way to
determine if it may be the hardware?

    I've tried running the same kernel without xen and just simulating
heavy IO on the disk array without issues, which leans me towards xen
being part of the equation. But then again, doing random file
read/writes isn't a good simulation of the type of workload the domUs
put on the server.

    OS: Debian Buster
    Kernel: 4.17.0-1-amd64
    Xen: 4.8.4-pre (Debian 4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9)
    CPU: Xeon E5-2699 v4
    RAM: Samsung 96GB ECC Registered
    MB: Supermicro X10SRi-F

    In case it is relevant, since it might be IO related...
    Net: Chelsio T520-CR (2 x XGB links, shared to domU using VF)
    RAID: LSI SAS3224 with 10 SAS3 drives

Warm regards,
Liwei

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xenproject.org/mailman/listinfo/xen-users

In my experience, as a non-Xen user on nearly the identical motherboard (X10SRA), I would suggest the motherboard.

I've purchased 4 of these boards and run various Windows and Linux kernels.  They all have different CPUs (some Retail, some Engineering Samples), different ECC ram and different storage setups (some using onboard SATA, some using on LSI cards, etc).

They all, every single one of them, experience random hard-lockups just like you describe: becomes completely unresponsive, screen freezes, etc.

I don't run Xen on any of them.  I've swapped all sorts of hardware, tried several beta BIOS versions from support, RMA'd 3 of them...  They all continued to lockup.

This went on for about two years until I had enough.  I swapped all boards out for the X10DLA, using the exact same components, and I have had zero issues since.

Again, this is just one user's experience - and I just happened to be on the Xen mailing list and saw this.
_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xenproject.org/mailman/listinfo/xen-users

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xenproject.org/mailman/listinfo/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: Debugging sudden hangs

Roger Pau Monné-3
In reply to this post by Liwei Xie
On Sun, Aug 19, 2018 at 08:46:44PM +0800, Liwei wrote:

> Hi list,
>     We recently updated our system and started experiencing random
> hangs. It happens, on average, once every 1.5 days (sometimes taking 2
> days to occur, other times happening multiple times a day, somewhat
> proportional to IO load).
>
>     Before troubling the developers too much, I'd like to collect more
> information, however, the problem is the hangs occur without any
> symptoms/crashes/panics. I've booted xen and dom0 with:
> "loglvl=all guest_loglvl=all" and "loglevel=10 debug initcall_debug"
> respectively.

You should add iommu=debug to the command line.

>
>     When the hang occurs, all domUs and dom0 just stop responding to
> key presses, networking and there is no IO activity. Nothing gets
> generated in the console/logs (no symptoms either, no logs out of the
> ordinary). Even hitting ctrl+a multiple times in the console does
> nothing (indicating xen is dead too). On the video console, we just
> have a blinking cursor after the last console log (though my
> understanding is that the cursor blink might be generated by the video
> card rather than any indication that at least something is still
> running). If the hardware WDT is on, the watchdog eventually bites and
> reboots the system.

It would be interesting to get the crash trace printed by the watchdog.
And to use a debug build of the hypervisor, that might trigger some
assertions inside of Xen that could lead to the cause of the issue.

Roger.

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xenproject.org/mailman/listinfo/xen-users