I run a few Xen hosts (Supermicro X8D*, i.e. Intel Tylersburg) with
a few guests. From time to time, strange things happen to me.... this
one seems easy to describe:
I migrated a PV Linux (Slackware 14.1, kernel 4.12.2) guest from one
host (Xen 4.8.1, kernel 4.11.0) to another (Xen 4.9.0, kernel 4.12.4).
Shortly after migration some daemons on the guest machine crashed; this
is what dmesg shows:
[18440203865.237646] Suspended for 2.605 seconds
[18440203865.267996] PM: noirq restore of devices complete after 0.212 msecs
[18440203865.268170] PM: early restore of devices complete after 0.115 msecs
[18440203865.282097] PM: restore of devices complete after 12.602 msecs
[18440203865.282167] OOM killer enabled.
[18440203865.282168] Restarting tasks ... done.
[18440203865.283838] xen:manage: Unable to read sysrq code in control/sysrq
[18440203865.379604] dbus-daemon: segfault at 0 ip
(null) sp 00007ffce31aaf10 error 14 in dbus-daemon[400000+61000]
[18440203865.381385] ntpd: segfault at 8 ip 00007f81f24c3dc9 sp
00007ffdee851c90 error 4 in ld-2.17.so[7f81f24b5000+23000]
[18440204017.834883] bash: segfault at 0 ip 00007fe5bd185c2d sp
00007fff675a6b78 error 4 in libc-2.17.so[7fe5bd0f9000+1bf000]
[18440204017.865750] sshd: segfault at 7fa09372afa8 ip
00007fa093517429 sp 00007ffc4b605838 error 7 in
[18440204228.000316] automount: segfault at 8 ip 00007f46a9a03153
sp 00007f46a975f990 error 4 in libc-2.17.so[7f46a9983000+1bf000]
[18440204729.291952] fail2ban-server: segfault at 0 ip
00007ff8339f7c2c sp 00007ff82f7861c0 error 4 in
What seems suspicious to me are the timestamps: I'm quite sure that none
of the machines has been up for more than 500 years.
The only other thing I found is that xl dmesg is full of
"(XEN) tmem: operation requested on uncreated pool"
A different guest (kernel 4.12.3) seems fine so far; dmesg says
[18439429870.442886] Suspended for 2.513 seconds
[18439429870.443112] PM: noirq restore of devices complete after 0.157 msecs
[18439429870.443249] PM: early restore of devices complete after 0.116 msecs
[18439429870.464423] PM: restore of devices complete after 19.453 msecs
[18439429870.464498] OOM killer enabled.
[18439429870.464498] Restarting tasks ... done.
[18439429870.466351] xen:manage: Unable to read sysrq code in control/sysrq