vif-bridge errors when creating and destroying dozens of VMs simultaneously

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

vif-bridge errors when creating and destroying dozens of VMs simultaneously

Antony Saba-2
Hello xen-users,

We are seeing the following errors repeatedly while trying to create
domains using a script, with the end result that 2 or 3 out of about
20 VMs fail to start, and there are stale entries in the iptables for
domains that have been destroyed.


   2017-05-10 11:45:40 UTC libxl: error:
libxl_exec.c:118:libxl_report_child_exitstatus:
/etc/xen/scripts/vif-bridge remove [18767] exited with error status 4
   2017-05-10 11:50:52 UTC libxl: error:
libxl_exec.c:118:libxl_report_child_exitstatus:
/etc/xen/scripts/vif-bridge offline [1554] exited with error status 4

I've been testing the following patch of vif-common.sh over the last
day and it appears to resolve the issue.  iptables exits with status 4
when "Another app is currently holding the xtables lock."

Does this solution seem reasonable?

Thanks.

--- /etc/xen/scripts/vif-common.sh.bak 2017-05-15 18:57:34.549288900 +0000
+++ /etc/xen/scripts/vif-common.sh 2017-05-15 18:58:01.361208788 +0000
@@ -154,12 +154,13 @@
# binary is not sufficient, because the user may not have the appropriate
# modules installed. If iptables is not working, then there's no need to do
# anything with it, so we can just return.
+ claim_lock "iptables"
if ! iptables -L -n >&/dev/null
then
+ release_lock "iptables"
return
fi
- claim_lock "iptables"
if [ "$ip" != "" ]
then



--
Antony Saba, [hidden email]

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: vif-bridge errors when creating and destroying dozens of VMs simultaneously

George Dunlap-5
cc'ing xen-devel & some relevant people

On Tue, May 16, 2017 at 4:21 PM, Antony Saba <[hidden email]> wrote:

> Hello xen-users,
>
> We are seeing the following errors repeatedly while trying to create
> domains using a script, with the end result that 2 or 3 out of about
> 20 VMs fail to start, and there are stale entries in the iptables for
> domains that have been destroyed.
>
>
>    2017-05-10 11:45:40 UTC libxl: error:
> libxl_exec.c:118:libxl_report_child_exitstatus:
> /etc/xen/scripts/vif-bridge remove [18767] exited with error status 4
>    2017-05-10 11:50:52 UTC libxl: error:
> libxl_exec.c:118:libxl_report_child_exitstatus:
> /etc/xen/scripts/vif-bridge offline [1554] exited with error status 4
>
> I've been testing the following patch of vif-common.sh over the last
> day and it appears to resolve the issue.  iptables exits with status 4
> when "Another app is currently holding the xtables lock."
>
> Does this solution seem reasonable?
>
> Thanks.
>
> --- /etc/xen/scripts/vif-common.sh.bak 2017-05-15 18:57:34.549288900 +0000
> +++ /etc/xen/scripts/vif-common.sh 2017-05-15 18:58:01.361208788 +0000
> @@ -154,12 +154,13 @@
> # binary is not sufficient, because the user may not have the appropriate
> # modules installed. If iptables is not working, then there's no need to do
> # anything with it, so we can just return.
> + claim_lock "iptables"
> if ! iptables -L -n >&/dev/null
> then
> + release_lock "iptables"
> return
> fi
> - claim_lock "iptables"
> if [ "$ip" != "" ]
> then

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: vif-bridge errors when creating and destroying dozens of VMs simultaneously

Roger Pau Monné-3
On Wed, May 17, 2017 at 10:04:40AM +0100, George Dunlap wrote:
> cc'ing xen-devel & some relevant people

Please bear with me, my knowledge of iptables is 0.

> On Tue, May 16, 2017 at 4:21 PM, Antony Saba <[hidden email]> wrote:
> > Hello xen-users,
> >
> > We are seeing the following errors repeatedly while trying to create
> > domains using a script, with the end result that 2 or 3 out of about
> > 20 VMs fail to start, and there are stale entries in the iptables for
> > domains that have been destroyed.
> >
> >
> >    2017-05-10 11:45:40 UTC libxl: error:
> > libxl_exec.c:118:libxl_report_child_exitstatus:
> > /etc/xen/scripts/vif-bridge remove [18767] exited with error status 4
> >    2017-05-10 11:50:52 UTC libxl: error:
> > libxl_exec.c:118:libxl_report_child_exitstatus:
> > /etc/xen/scripts/vif-bridge offline [1554] exited with error status 4
> >
> > I've been testing the following patch of vif-common.sh over the last
> > day and it appears to resolve the issue.  iptables exits with status 4
> > when "Another app is currently holding the xtables lock."

So, an iptables command can fail randomly because there's someone else holding
an iptables internal lock?

Isn't there anyway to tell the iptables command to just block until it can get
the lock? This seems extremely racy, isn't people then forced to use something
like:

while true; do
        iptables <...>
        if [ $? == 0 ]; then
                break;
        elif [ $? != 4 ]; then
                error ...
        fi
done

When dealing with iptables?

> > Does this solution seem reasonable?

I'm not sure, this protects you from other hotplug scripts poking concurrently
at iptables, but what about the system administrator? It still seems racy to
me.

> > Thanks.
> >
> > --- /etc/xen/scripts/vif-common.sh.bak 2017-05-15 18:57:34.549288900 +0000
> > +++ /etc/xen/scripts/vif-common.sh 2017-05-15 18:58:01.361208788 +0000
> > @@ -154,12 +154,13 @@
> > # binary is not sufficient, because the user may not have the appropriate
> > # modules installed. If iptables is not working, then there's no need to do
> > # anything with it, so we can just return.
> > + claim_lock "iptables"
> > if ! iptables -L -n >&/dev/null
> > then
> > + release_lock "iptables"
> > return
> > fi
> > - claim_lock "iptables"
> > if [ "$ip" != "" ]
> > then

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: vif-bridge errors when creating and destroying dozens of VMs simultaneously

George Dunlap-5
On 17/05/17 10:45, Roger Pau Monné wrote:

> On Wed, May 17, 2017 at 10:04:40AM +0100, George Dunlap wrote:
>> cc'ing xen-devel & some relevant people
>
> Please bear with me, my knowledge of iptables is 0.
>
>> On Tue, May 16, 2017 at 4:21 PM, Antony Saba <[hidden email]> wrote:
>>> Hello xen-users,
>>>
>>> We are seeing the following errors repeatedly while trying to create
>>> domains using a script, with the end result that 2 or 3 out of about
>>> 20 VMs fail to start, and there are stale entries in the iptables for
>>> domains that have been destroyed.
>>>
>>>
>>>    2017-05-10 11:45:40 UTC libxl: error:
>>> libxl_exec.c:118:libxl_report_child_exitstatus:
>>> /etc/xen/scripts/vif-bridge remove [18767] exited with error status 4
>>>    2017-05-10 11:50:52 UTC libxl: error:
>>> libxl_exec.c:118:libxl_report_child_exitstatus:
>>> /etc/xen/scripts/vif-bridge offline [1554] exited with error status 4
>>>
>>> I've been testing the following patch of vif-common.sh over the last
>>> day and it appears to resolve the issue.  iptables exits with status 4
>>> when "Another app is currently holding the xtables lock."
>
> So, an iptables command can fail randomly because there's someone else holding
> an iptables internal lock?
>
> Isn't there anyway to tell the iptables command to just block until it can get
> the lock? This seems extremely racy, isn't people then forced to use something
> like:
>
> while true; do
> iptables <...>
> if [ $? == 0 ]; then
> break;
> elif [ $? != 4 ]; then
> error ...
> fi
> done
>
> When dealing with iptables?

This seems to be a common problem ([1][2][3] come up right away).

The basic solution seems to be to add the '-w' option to have it wait
for the lock.  It does seem like that should be the default though.
Having commands normally run inside of scripts randomly fail unless you
add the special "don't randomly fail" option seems a bit mad.

 -George


[1] https://github.com/kubernetes/kubernetes/issues/7370
[2] https://github.com/docker/for-mac/issues/285
[3]
https://serverfault.com/questions/805718/iptables-another-app-is-currently-holding-the-xtables-lock


_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xen-devel] vif-bridge errors when creating and destroying dozens of VMs simultaneously

George Dunlap-5
On Wed, May 17, 2017 at 11:10 AM, George Dunlap
<[hidden email]> wrote:

> On 17/05/17 10:45, Roger Pau Monné wrote:
>> On Wed, May 17, 2017 at 10:04:40AM +0100, George Dunlap wrote:
>>> cc'ing xen-devel & some relevant people
>>
>> Please bear with me, my knowledge of iptables is 0.
>>
>>> On Tue, May 16, 2017 at 4:21 PM, Antony Saba <[hidden email]> wrote:
>>>> Hello xen-users,
>>>>
>>>> We are seeing the following errors repeatedly while trying to create
>>>> domains using a script, with the end result that 2 or 3 out of about
>>>> 20 VMs fail to start, and there are stale entries in the iptables for
>>>> domains that have been destroyed.
>>>>
>>>>
>>>>    2017-05-10 11:45:40 UTC libxl: error:
>>>> libxl_exec.c:118:libxl_report_child_exitstatus:
>>>> /etc/xen/scripts/vif-bridge remove [18767] exited with error status 4
>>>>    2017-05-10 11:50:52 UTC libxl: error:
>>>> libxl_exec.c:118:libxl_report_child_exitstatus:
>>>> /etc/xen/scripts/vif-bridge offline [1554] exited with error status 4
>>>>
>>>> I've been testing the following patch of vif-common.sh over the last
>>>> day and it appears to resolve the issue.  iptables exits with status 4
>>>> when "Another app is currently holding the xtables lock."
>>
>> So, an iptables command can fail randomly because there's someone else holding
>> an iptables internal lock?
>>
>> Isn't there anyway to tell the iptables command to just block until it can get
>> the lock? This seems extremely racy, isn't people then forced to use something
>> like:
>>
>> while true; do
>>       iptables <...>
>>       if [ $? == 0 ]; then
>>               break;
>>       elif [ $? != 4 ]; then
>>               error ...
>>       fi
>> done
>>
>> When dealing with iptables?
>
> This seems to be a common problem ([1][2][3] come up right away).
>
> The basic solution seems to be to add the '-w' option to have it wait
> for the lock.  It does seem like that should be the default though.
> Having commands normally run inside of scripts randomly fail unless you
> add the special "don't randomly fail" option seems a bit mad.

Hmm, looking more into it:

* The -w option was introduced at the same time that the locking was
introduced [1].  So any version that has locking will have the -w
option.

* The bare -w option doesn't introduce a timeout, so in the case that
the xtables lock wasn't released, the script will hang indefinitely.
A '-W' option was introduced in 2016 to introduce a timeout, but this
is on even fewer systems than the -w option.  (My desktop, running
Debian Jessie, doesn't seem to have the -W option for instance.)

* The return code, RESOURCE_PROBLEM, is returned for other reasons;
but it looks like for our purposes in most case retrying might not be
a bad strategy in those cases either.

* But that was only in 2013 that the option was introduced, so it's
likely there are still old versions of iptables around that don't have
the -w option.

The good news is that versions without the -w option will *also* not
fail with error code 4 (although they may fail in other ways in the
case of concurrent accesses instead).

So we have three options:

1. Always add -w.  This will effectively drop support for systems
which don't have iptables -w.  It also wouldn't allow us to reliably
set a timeout.

2. Always do a loop.  This should work on all systems, but is
redundant for systems with -w and unnecessary on systems without.  On
the other hand, it would allow us to implement our own timeout even on
systems without the -W option.

3. Try to check to see if the version of iptables we have supports -w,
and use it if available.  This should also work on all systems, but
introduces a bit of complication.  It also doesn't allow us to
reliably use a timeout.

Any thoughts?

 -George

[1] https://git.netfilter.org/iptables/commit/?id=93587a04d0f2511e108bbc4d87a8b9d28a5c5dd8

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xen-devel] vif-bridge errors when creating and destroying dozens of VMs simultaneously

Ian Jackson-2
George Dunlap writes ("Re: [Xen-devel] [Xen-users] vif-bridge errors when creating and destroying dozens of VMs simultaneously"):
> So we have three options:
...
> 3. Try to check to see if the version of iptables we have supports -w,
> and use it if available.  This should also work on all systems, but
> introduces a bit of complication.  It also doesn't allow us to
> reliably use a timeout.

I think this is best.  Eventually we can get rid of the check for -w.

I think a timeout in this context is not very helpful.

Also, a loop, on a busy system, might need to have many attempts,
because it will be polling.

As I said on irc:

  If iptables fails to release its lock, then surely everything is going
  to be bust forever more, at least until someone manages to unstick it
  and get the lock released ?

  I'm not sure it's worth a lot of effort to try to contain the
  consequences.

Ian.

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xen-devel] vif-bridge errors when creating and destroying dozens of VMs simultaneously

George Dunlap-5
On 17/05/17 13:43, Ian Jackson wrote:

> George Dunlap writes ("Re: [Xen-devel] [Xen-users] vif-bridge errors when creating and destroying dozens of VMs simultaneously"):
>> So we have three options:
> ...
>> 3. Try to check to see if the version of iptables we have supports -w,
>> and use it if available.  This should also work on all systems, but
>> introduces a bit of complication.  It also doesn't allow us to
>> reliably use a timeout.
>
> I think this is best.  Eventually we can get rid of the check for -w.
>
> I think a timeout in this context is not very helpful.
>
> Also, a loop, on a busy system, might need to have many attempts,
> because it will be polling.

FWIW the iptables internal mechanism will try to grab the lock, and if
it fails (and -w is set), will call sleep(1) before trying again.  My
bash loop would do exactly the same thing.

But I agree that if timeouts are not important, doing it via iptables is
probably cleaner.  Let me work up a patch.

 -George


_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xen-devel] vif-bridge errors when creating and destroying dozens of VMs simultaneously

George Dunlap-5
On Wed, May 17, 2017 at 1:46 PM, George Dunlap <[hidden email]> wrote:

> On 17/05/17 13:43, Ian Jackson wrote:
>> George Dunlap writes ("Re: [Xen-devel] [Xen-users] vif-bridge errors when creating and destroying dozens of VMs simultaneously"):
>>> So we have three options:
>> ...
>>> 3. Try to check to see if the version of iptables we have supports -w,
>>> and use it if available.  This should also work on all systems, but
>>> introduces a bit of complication.  It also doesn't allow us to
>>> reliably use a timeout.
>>
>> I think this is best.  Eventually we can get rid of the check for -w.
>>
>> I think a timeout in this context is not very helpful.
>>
>> Also, a loop, on a busy system, might need to have many attempts,
>> because it will be polling.
>
> FWIW the iptables internal mechanism will try to grab the lock, and if
> it fails (and -w is set), will call sleep(1) before trying again.  My
> bash loop would do exactly the same thing.
>
> But I agree that if timeouts are not important, doing it via iptables is
> probably cleaner.  Let me work up a patch.
Antony,

Attached is a patch to add the -w option if it's available.  I've
smoke-tested that it works under normal conditions; but my simplistic
attempts to get the bug to trigger have failed.  Can you give it a try
and see if it works?

Thanks,
 -George

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users

0001-vif-common.sh-Have-iptables-wait-for-the-xtables-loc.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Xen-devel] vif-bridge errors when creating and destroying dozens of VMs simultaneously

Adam Goryachev-3
On 17/5/17 23:44, George Dunlap wrote:

>
> Antony,
>
> Attached is a patch to add the -w option if it's available.  I've
> smoke-tested that it works under normal conditions; but my simplistic
> attempts to get the bug to trigger have failed.  Can you give it a try
> and see if it works?
>
> Thanks,
>   -George
>

Apologies if this is just noise, but perhaps it is easier to fix this
before it is committed... (small typo in the patch)

+ # If we fail with PARAMETER_PROBLEM with -w and don't fail
+ # with PARAMETER_PRIBLEM without it, then it's the -w option

That should be:
+ # If we fail with PARAMETER_PROBLEM with -w and don't fail
+ # with PARAMETER_PROBLEM without it, then it's the -w option

Regards,
Adam


_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xen-devel] vif-bridge errors when creating and destroying dozens of VMs simultaneously

George Dunlap-5
On Wed, May 17, 2017 at 3:09 PM, Adam Goryachev
<[hidden email]> wrote:

> On 17/5/17 23:44, George Dunlap wrote:
>>
>>
>> Antony,
>>
>> Attached is a patch to add the -w option if it's available.  I've
>> smoke-tested that it works under normal conditions; but my simplistic
>> attempts to get the bug to trigger have failed.  Can you give it a try
>> and see if it works?
>>
>> Thanks,
>>   -George
>>
>
> Apologies if this is just noise, but perhaps it is easier to fix this before
> it is committed... (small typo in the patch)
>
> +               # If we fail with PARAMETER_PROBLEM with -w and don't fail
> +               # with PARAMETER_PRIBLEM without it, then it's the -w option
>
> That should be:
> +               # If we fail with PARAMETER_PROBLEM with -w and don't fail
> +               # with PARAMETER_PROBLEM without it, then it's the -w option

Definitely appreciated, thanks. :-)

 -George

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xen-devel] vif-bridge errors when creating and destroying dozens of VMs simultaneously

Antony Saba-2
In reply to this post by George Dunlap-5
On Wed, May 17, 2017 at 7:44 AM, George Dunlap <[hidden email]> wrote:

>
> Antony,
>
> Attached is a patch to add the -w option if it's available.  I've
> smoke-tested that it works under normal conditions; but my simplistic
> attempts to get the bug to trigger have failed.  Can you give it a try
> and see if it works?
>
> Thanks,
>  -George

No problem, I'll apply to one of the machines showing the issue and
run it overnight.

Thanks.

-Tony




--
Antony Saba, [hidden email]

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xen-devel] vif-bridge errors when creating and destroying dozens of VMs simultaneously

Antony Saba-2
In reply to this post by George Dunlap-5
George,

Patch works as expected, no failures on create and no stale iptables
rules after running under the same load that was producing the errors
previously.

Ubuntu 16.04
Linux 3.13.0-83-generic
iptables v1.6.0
Xen 4.6 5 from distro packages

Thanks!

-Tony

On Wed, May 17, 2017 at 7:44 AM, George Dunlap <[hidden email]> wrote:

> On Wed, May 17, 2017 at 1:46 PM, George Dunlap <[hidden email]> wrote:
>> On 17/05/17 13:43, Ian Jackson wrote:
>>> George Dunlap writes ("Re: [Xen-devel] [Xen-users] vif-bridge errors when creating and destroying dozens of VMs simultaneously"):
>>>> So we have three options:
>>> ...
>>>> 3. Try to check to see if the version of iptables we have supports -w,
>>>> and use it if available.  This should also work on all systems, but
>>>> introduces a bit of complication.  It also doesn't allow us to
>>>> reliably use a timeout.
>>>
>>> I think this is best.  Eventually we can get rid of the check for -w.
>>>
>>> I think a timeout in this context is not very helpful.
>>>
>>> Also, a loop, on a busy system, might need to have many attempts,
>>> because it will be polling.
>>
>> FWIW the iptables internal mechanism will try to grab the lock, and if
>> it fails (and -w is set), will call sleep(1) before trying again.  My
>> bash loop would do exactly the same thing.
>>
>> But I agree that if timeouts are not important, doing it via iptables is
>> probably cleaner.  Let me work up a patch.
>
> Antony,
>
> Attached is a patch to add the -w option if it's available.  I've
> smoke-tested that it works under normal conditions; but my simplistic
> attempts to get the bug to trigger have failed.  Can you give it a try
> and see if it works?
>
> Thanks,
>  -George



--
Antony Saba, [hidden email]

_______________________________________________
Xen-users mailing list
[hidden email]
https://lists.xen.org/xen-users