High IO on the system prevents MAAS to provide PXE config in a timely manner (less than 30 secs)

Bug #1743249 reported by Jason Hobbs
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Undecided
Unassigned
grub2 (Ubuntu)
Fix Released
Medium
Mathieu Trudel-Lapierre

Bug Description

A node failed to deploy after it failed to retrieve a grub.cfg from MAAS due to a timeout. In the logs, it's clear that the server tried to retrieve the grub cfg many times, over about 30 seconds:

http://paste.ubuntu.com/26387256/

We see the same thing for other hosts around the same time:

http://paste.ubuntu.com/26387262/

It seems like MAAS is taking way too long to respond to these requests.

This is very similar to bug 1724677, which was happening pre-metldown/spectre. The only difference is we don't see "[critical] TFTP back-end failed" in the logs anymore.

I connected to the console on this system and it had errors about timing out retrieving the grub-cfg, then it had an error message along the lines of "error not an ip" and then "double free". After I connected but before I could get a screenshot the system rebooted and was directed by maas to power off, which it did successfully after booting to linux.

Full logs are available here: https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-ed277a020e7c/cpe_cloud_395/infra-logs.tar

This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.

Related branches

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Hit this again, here's an image: https://imgur.com/a/as8Sx

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Jason,

Can you confirm:

1. I'm guessing you are running MAAS in HA. Can you remind us of your HA configuration (3 regions, 2 racks? The racks connect to the regions via the VIP ?).
2. From what rack controller did it boot?
3. Was the chosen region with high load?
4. Could this be another HAProxy issue?

Changed in maas:
status: New → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Also, I see this in the log. Could this potentially be something that's confusing the bootloader?

Jan 14 19:33:59 meinfoo maas.rpc.rackcontrollers: [warn] meinfoo: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on meinfoo': 10.245.208.5
Jan 14 19:35:16 meinfoo maas.rpc.rackcontrollers: [warn] swoobat: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on swoobat': 10.245.208.5
Jan 14 19:39:22 meinfoo maas.rpc.rackcontrollers: [warn] leafeon: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on leafeon': 10.245.208.5
Jan 14 19:43:59 meinfoo maas.rpc.rackcontrollers: [warn] meinfoo: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on meinfoo': 10.245.208.5
Jan 14 19:45:16 meinfoo maas.rpc.rackcontrollers: [warn] swoobat: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on swoobat': 10.245.208.5
Jan 14 19:53:59 meinfoo maas.rpc.rackcontrollers: [warn] meinfoo: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on meinfoo': 10.245.208.5
Jan 14 19:55:16 meinfoo maas.rpc.rackcontrollers: [warn] swoobat: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on swoobat': 10.245.208.5
Jan 14 19:59:22 meinfoo maas.rpc.rackcontrollers: [warn] leafeon: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on leafeon': 10.245.208.5

Revision history for this message
Andres Rodriguez (andreserl) wrote :
Download full text (5.6 KiB)

Also, in another of your logs, I saw this:

2018-01-14 18:12:38 maasserver.bootresources: [critical] Importing boot resources failed.

Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 842, in worker
    return target()
  File "/usr/lib/python3/dist-packages/twisted/_threads/_threadworker.py", line 46, in work
    task()
  File "/usr/lib/python3/dist-packages/twisted/_threads/_team.py", line 190, in doWork
    task()
--- <exception caught here> ---
  File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 246, in inContext
    result = inContext.theWork()
  File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 262, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 875, in callInContext
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 232, in wrapper
    result = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 169, in wrapper
    return func_in_reactor(*args, **kwargs).wait(LONGTIME)
  File "/usr/lib/python3/dist-packages/crochet/_eventloop.py", line 231, in wait
    result.raiseException()
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 368, in raiseException
    raise self.value.with_traceback(self.tb)
  File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/python3/dist-packages/maasserver/bootresources.py", line 1262, in eb_import
    failure.trap(DatabaseLockNotHeld)
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 342, in trap
    self.raiseException()
  File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 368, in raiseException
    raise self.value.with_traceback(self.tb)
  File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 246, in inContext
    result = inContext.theWork()
  File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 262, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
  File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 875, in callInContext
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/provisioningserver/utils/twist...

Read more...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

On Wed, Jan 17, 2018 at 11:13 AM, Andres Rodriguez
<email address hidden> wrote:
> Also, I see this in the log. Could this potentially be something that's
> confusing the bootloader?
>
> Jan 14 19:33:59 meinfoo maas.rpc.rackcontrollers: [warn] meinfoo: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on meinfoo': 10.245.208.5
> Jan 14 19:35:16 meinfoo maas.rpc.rackcontrollers: [warn] swoobat: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on swoobat': 10.245.208.5
> Jan 14 19:39:22 meinfoo maas.rpc.rackcontrollers: [warn] leafeon: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on leafeon': 10.245.208.5
> Jan 14 19:43:59 meinfoo maas.rpc.rackcontrollers: [warn] meinfoo: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on meinfoo': 10.245.208.5
> Jan 14 19:45:16 meinfoo maas.rpc.rackcontrollers: [warn] swoobat: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on swoobat': 10.245.208.5
> Jan 14 19:53:59 meinfoo maas.rpc.rackcontrollers: [warn] meinfoo: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on meinfoo': 10.245.208.5
> Jan 14 19:55:16 meinfoo maas.rpc.rackcontrollers: [warn] swoobat: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on swoobat': 10.245.208.5
> Jan 14 19:59:22 meinfoo maas.rpc.rackcontrollers: [warn] leafeon: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on leafeon': 10.245.208.5

No, this is expected. This vlan is not connected to the booting machine.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

> Are you using a custom image mirror or are you using the upstream mirror ? Is there any customization that would mean that the keys are not available?

We're using the default upstream mirror, without any customization.
Note that it works fine usually, and on other machines around the same
time.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

On Wed, Jan 17, 2018 at 11:11 AM, Andres Rodriguez
<email address hidden> wrote:
> Hi Jason,
>
> Can you confirm:
>
> 1. I'm guessing you are running MAAS in HA. Can you remind us of your HA configuration (3 regions, 2 racks? The racks connect to the regions via the VIP ?).

We have 3 region controllers and 2 active rack controllers (rack
controller packages installed on all 3 but only configured to be
active on 2).

> 2. From what rack controller did it boot?

It looks like from 10.244.40.32, but I expect you're more skilled at
interpreting the logs than I am.

> 3. Was the chosen region with high load?

We don't have load monitoring software installed on the MAAS servers.
The MAAS servers are 10 core 128GB of RAM gen9 HPE servers. They're
hosting MAAS, postgresql, and some VMs.

> 4. Could this be another HAProxy issue?

I don't know. It's timing out retrieving the grub cfg from the rack
controller - where would HAproxy come into play there? I don't know
enough about MAAS internals to answer. I know rack has to get some
info from the region controller for this, but I would expect this goes
through RPC, which does not go through HAproxy as far as I know.

Changed in maas:
status: Incomplete → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: Failed Deployment after timeout trying to retrieve grub cfg

We hit this again, and on a single deploy had 3 nodes fail trying to retrieve grub cfg, and another one failed to get network info from MAAS data source after it made it to cloud-init:

https://paste.ubuntu.com/26409138/

Despite the fact that it made it that far and that event got logged, there is nothing in rsyslog.

I watched load on the machine during an earlier test and it was up around 11 on a 10 core (20 thread) system, so that's pretty high. There was also a constant 5-35MB/s disk write. We're using bcache, but that's still a high write load and things weren't feeling very responsive.

I think this high load is likely partially to blame, but it's strange that some operations in MAAS seem to work fine and others don't.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Ok, so, as I understand this is not an issue that was present before. In other words, this issue just started appearing recently.

Looking at grub update,s it seems that a new grub update was made available before you started seeing this issue:

https://launchpad.net/ubuntu/+source/grub2/2.02~beta2-36ubuntu3.15

I'd assume since this issue wasn't happening before and now it is happening, and since MAAS hasn't changed, this is a change somewhere else (like grub's) which has broken things.

Can you confirm a few more things:

1. Is your system configure to only UEFI boot or is it configure to also *secure* boot ?
2. What happens if you change your system to *secure* boot ? Can you try that ?

Changed in maas:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

That's not a sound assumption. It's a race condition - if MAAS has had an issue where it responds slowly to grub cfg requests for a long time, and it's taking just a little longer sometimes now than it used to, that can be enough to push it over the edge to take too long. If it used to take 29 seconds, the timeout is 30 seconds, and it takes 31 seconds sometimes now, we will see failures where we didn't used to due to the 1 second change in time - not necessarily a new bug anywhere else. We did just get kernel patches for metldown/spectre that reduce performance... that seems like a possible culprit to me.

Note also that one system made it to running cloud-init and was unable to retrieve network config info from the maas data source. How could that possibly be a grub (or secure boot issue?).

Can you please explain the relevance of secure boot? It's running grub it retrieved from the network, and the systems usually boot - it's a race condition where it doesn't sometimes. It's not something that either works or it doesn't - I don't see how secure boot could possibly be related.

Changed in maas:
status: Incomplete → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

The servers are booting in UEFI, but not secure boot. They do not
have TPMs so they are not capable of secure boot.

On Thu, Jan 18, 2018 at 1:44 PM, Jason Hobbs <email address hidden> wrote:
> That's not a sound assumption. It's a race condition - if MAAS has had
> an issue where it responds slowly to grub cfg requests for a long time,
> and it's taking just a little longer sometimes now than it used to, that
> can be enough to push it over the edge to take too long. If it used to
> take 29 seconds, the timeout is 30 seconds, and it takes 31 seconds
> sometimes now, we will see failures where we didn't used to due to the 1
> second change in time - not necessarily a new bug anywhere else. We did
> just get kernel patches for metldown/spectre that reduce performance...
> that seems like a possible culprit to me.
>
> Note also that one system made it to running cloud-init and was unable
> to retrieve network config info from the maas data source. How could
> that possibly be a grub (or secure boot issue?).
>
> Can you please explain the relevance of secure boot? It's running grub
> it retrieved from the network, and the systems usually boot - it's a
> race condition where it doesn't sometimes. It's not something that
> either works or it doesn't - I don't see how secure boot could possibly
> be related.
>
> ** Changed in: maas
> Status: Incomplete => New
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
> New
>
> Bug description:
> A node failed to deploy after it failed to retrieve a grub.cfg from
> MAAS due to a timeout. In the logs, it's clear that the server tried
> to retrieve the grub cfg many times, over about 30 seconds:
>
> http://paste.ubuntu.com/26387256/
>
> We see the same thing for other hosts around the same time:
>
> http://paste.ubuntu.com/26387262/
>
> It seems like MAAS is taking way too long to respond to these
> requests.
>
> I connected to the console on this system and it had errors about
> timing out retrieving the grub-cfg, then it had an error message along
> the lines of "error not an ip" and then "double free". After I
> connected but before I could get a screenshot the system rebooted and
> was directed by maas to power off, which it did successfully after
> booting to linux.
>
> Full logs are available here:
> https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
> ed277a020e7c/cpe_cloud_395/infra-logs.tar
>
> This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: Failed Deployment after timeout trying to retrieve grub cfg

Indeed the meltdown kernel fix was released on the 9th of January. This will have definitely had a performance impact on MAAS which could explain why it's slower now and we're getting timeouts.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We've been testing with kpti disabled on our maas servers, and we haven't been hitting failed deployments due to this anymore. We're still seeing many repeated requests for grub-cfg, but it's not timing out anymore. It seems like it was barely on the side of working before, and kpti's performance hit pushes it over the edge.

Revision history for this message
Chris Gregan (cgregan) wrote :

Escalated to Field High.

Changed in maas:
status: New → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote :

As IRC discussion over last week.

 - This issue started presenting itself after the security upgrades for meltdown.
 - This is suspected to be a performance issue due to such security update.
 - We know that due to meltdown, postgresql is 40% less performance.
 - MAAS runs with all 5 processes (4 regionds, 1 rackd).
 - There's 5 VM's running on the same system (details unknown), that seem to be booting at the same time.
 - The system has 10 cores.

So, due to basic analysis, this issue could be a direct result of having the VM's contending for resources + MAAS using high resources as well as other processes, and the metldown patches.

So far, through our investigation, we have not identified any specific issue in MAAS to fix in order to improve/avoid this issue.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We are seeing another variant of this, where nodes end up powered off instead of hung. What happens there is:

1) Node tries to get grub.cfg-<mac> over and over until it gives up.
2) Node tries to get grub.cfg-default-amd64, and succeeds.
3) This results in the node getting directed to enlist as a new node, since MAAS doesn't have anything to recognize it by at this point.
4) After running the enlistment scripts, the node powers off.

We are seeing this variant even with kpti disabled.

Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

Do you have logs of the *exact* moment this happened on the *exact* rack &
region controller?

Since you know when this happened please provide the logs when that
happened.

Thanks!

On Mon, Jan 29, 2018 at 3:25 PM Jason Hobbs <email address hidden>
wrote:

> We are seeing another variant of this, where nodes end up powered off
> instead of hung. What happens there is:
>
> 1) Node tries to get grub.cfg-<mac> over and over until it gives up.
> 2) Node tries to get grub.cfg-default-amd64, and succeeds.
> 3) This results in the node getting directed to enlist as a new node,
> since MAAS doesn't have anything to recognize it by at this point.
> 4) After running the enlistment scripts, the node powers off.
>
> We are seeing this variant even with kpti disabled.
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; status=Incomplete; importance=Undecided;
> assignee=None;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl cgregan jason-hobbs
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>
--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: Failed Deployment after timeout trying to retrieve grub cfg

Well, there are 3 controller machines so it's more complicated than that.

Sure, here's the rack controller logs from the rack controller that was receiving the tftp requests, with just the requests from that machine:
http://paste.ubuntu.com/26485770/

Same time period with everything from that rackd.log:
http://paste.ubuntu.com/26485789/

Finally all of the maas/regiond/rackd log files from those two minutes:
http://paste.ubuntu.com/26485806/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

No matter what the load on MAAS, directing a machine that is supposed to be deploying to go through enlistment and then power off is a bug.

The attached infra-logs (1).tar is the full set of logs from a run where this happened.

Changed in maas:
status: Incomplete → New
Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

ktpi only address metldown. It doesn't address spectre, whcih could also be
providing performance degradation.

What's happening there is simple:

1. The machine is asking for grub.cfg-<mac>
2. MAAS makes a DB request
3. If it finds the MAC in the db, it returns the kernel params for the
specific boot purpose.
4. If it doesn't find the mac, it goes to enlistment.

So the fact that's going through enlistment is NOT a bug, because that's
what is intended. The fact that it is not finding its MAC address could be
an indication that the query is taking too long to process because of
performance degradation, causing the *client* to time out (e.g. it didn't
receive the response it was expecting), so it boots of the "default" config
and not from the "mac" based config.

On Mon, Jan 29, 2018 at 4:23 PM, Jason Hobbs <email address hidden>
wrote:

> No matter what the load on MAAS, directing a machine that is supposed to
> be deploying to go through enlistment and then power off is a bug.
>
> The attached infra-logs (1).tar is the full set of logs from a run where
> this happened.
>
> ** Attachment added: "infra-logs (1).tar"
> https://bugs.launchpad.net/maas/+bug/1743249/+attachment/
> 5045201/+files/infra-logs%20%281%29.tar
>
> ** Changed in: maas
> Status: Incomplete => New
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; status=New; importance=Undecided;
> assignee=None;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl cgregan jason-hobbs
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Is it possible for you to provide the dump of machines read before this
issue happens?

On Mon, Jan 29, 2018 at 4:20 PM, Jason Hobbs <email address hidden>
wrote:

> Well, there are 3 controller machines so it's more complicated than
> that.
>
> Sure, here's the rack controller logs from the rack controller that was
> receiving the tftp requests, with just the requests from that machine:
> http://paste.ubuntu.com/26485770/
>
> Same time period with everything from that rackd.log:
> http://paste.ubuntu.com/26485789/
>
> Finally all of the maas/regiond/rackd log files from those two minutes:
> http://paste.ubuntu.com/26485806/
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; status=New; importance=Undecided;
> assignee=None;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl cgregan jason-hobbs
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Also, can you provide the MAC address of the machine that's failing before
this fails?

On Mon, Jan 29, 2018 at 4:51 PM, Andres Rodriguez <email address hidden>
wrote:

> Is it possible for you to provide the dump of machines read before this
> issue happens?
>
> On Mon, Jan 29, 2018 at 4:20 PM, Jason Hobbs <email address hidden>
> wrote:
>
>> Well, there are 3 controller machines so it's more complicated than
>> that.
>>
>> Sure, here's the rack controller logs from the rack controller that was
>> receiving the tftp requests, with just the requests from that machine:
>> http://paste.ubuntu.com/26485770/
>>
>> Same time period with everything from that rackd.log:
>> http://paste.ubuntu.com/26485789/
>>
>> Finally all of the maas/regiond/rackd log files from those two minutes:
>> http://paste.ubuntu.com/26485806/
>>
>> --
>> You received this bug notification because you are subscribed to MAAS.
>> https://bugs.launchpad.net/bugs/1743249
>>
>> Title:
>> Failed Deployment after timeout trying to retrieve grub cfg
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>>
>> Launchpad-Notification-Type: bug
>> Launchpad-Bug: product=maas; status=New; importance=Undecided;
>> assignee=None;
>> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine
>> Launchpad-Bug-Information-Type: Public
>> Launchpad-Bug-Private: no
>> Launchpad-Bug-Security-Vulnerability: no
>> Launchpad-Bug-Commenters: andreserl cgregan jason-hobbs
>> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
>> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
>> Launchpad-Message-Rationale: Subscriber (MAAS)
>> Launchpad-Message-For: andreserl
>>
>
>
>
> --
> Andres Rodriguez (RoAkSoAx)
> Ubuntu Server Developer
> MSc. Telecom & Networking
> Systems Engineer
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message
Chris Gregan (cgregan) wrote : Re: Failed Deployment after timeout trying to retrieve grub cfg

Escalated to Critical based on frequency of failure detected in CI and inability to work around

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

The MAC address from comment #18 is 14:02:ec:41:d7:44. We can add a dump of machine read, but I can tell you right now that MAAS knows of the machines that are failing and it will show them in the DB, so that is not the issue.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This is very similar to bug 1724677, which was happening pre-metldown/spectre. The only difference is we don't see "[critical] TFTP back-end failed" in the logs anymore.

description: updated
Revision history for this message
Blake Rouse (blake-rouse) wrote :

I know its almost impossible for you, but we need a network capture. If we could reproduce this locally and reliably I would not be asking for it. But since your all are the only ones running into this issue, we must have a network capture.

I know that TFTP is random after the request from port 69, so it will be a lot of traffic but you need to capture all UDP traffic.

We also need to know the MAC addresses or received IP addresses of the machines that are failing, along with the regiond, rackd, maas logs for that time range to correlate.

This sounds like I am asking for the world, but that is the only way I know to look into this further. With no guarantee that it will tell me anything more.

Changed in maas:
importance: Undecided → Critical
status: New → Incomplete
milestone: none → 2.4.x
Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

We're working on getting a packet capture right now.

On Wed, Jan 31, 2018 at 9:08 AM, Blake Rouse <email address hidden> wrote:
> I know its almost impossible for you, but we need a network capture. If
> we could reproduce this locally and reliably I would not be asking for
> it. But since your all are the only ones running into this issue, we
> must have a network capture.
>
> I know that TFTP is random after the request from port 69, so it will be
> a lot of traffic but you need to capture all UDP traffic.
>
> We also need to know the MAC addresses or received IP addresses of the
> machines that are failing, along with the regiond, rackd, maas logs for
> that time range to correlate.
>
> This sounds like I am asking for the world, but that is the only way I
> know to look into this further. With no guarantee that it will tell me
> anything more.
>
> ** Changed in: maas
> Importance: Undecided => Critical
>
> ** Changed in: maas
> Status: New => Incomplete
>
> ** Changed in: maas
> Milestone: None => 2.4.x
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
> Incomplete
>
> Bug description:
> A node failed to deploy after it failed to retrieve a grub.cfg from
> MAAS due to a timeout. In the logs, it's clear that the server tried
> to retrieve the grub cfg many times, over about 30 seconds:
>
> http://paste.ubuntu.com/26387256/
>
> We see the same thing for other hosts around the same time:
>
> http://paste.ubuntu.com/26387262/
>
> It seems like MAAS is taking way too long to respond to these
> requests.
>
> This is very similar to bug 1724677, which was happening pre-
> metldown/spectre. The only difference is we don't see "[critical] TFTP
> back-end failed" in the logs anymore.
>
> I connected to the console on this system and it had errors about
> timing out retrieving the grub-cfg, then it had an error message along
> the lines of "error not an ip" and then "double free". After I
> connected but before I could get a screenshot the system rebooted and
> was directed by maas to power off, which it did successfully after
> booting to linux.
>
> Full logs are available here:
> https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
> ed277a020e7c/cpe_cloud_395/infra-logs.tar
>
> This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

Changed in maas:
importance: Critical → Undecided
Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: Failed Deployment after timeout trying to retrieve grub cfg

Here is a log of maas taking a while to respond to the request - you can see grub requests the file https://paste.ubuntu.com/26491661/ 14 times before maas responds.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

In the log from 28, grub's behavior on retries looks reasonable - it does look to delay 400ms the first time, 420ms the second, etc. One thing that's strange is that looks like it still ends up retrieving the file 4 times.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :

tcpdump.pcap is a file for a idled maas showing grub make 4 requests regardless it gets the config every time.

Revision history for this message
Steve Langasek (vorlon) wrote :

Regarding grub requesting the same file 4 times, a surprising finding: I'm able to reproduce this with files of a certain length. By chance my grub.cfg was 1 byte shorter than the one maas serves (269 bytes instead of 270), and I saw multiple requests for this file.

To reproduce this in a VM using UEFI:
- set up dhcp to point to bootx64.efi
- set up tftp with bootx64.efi and grubx64.efi but not grub/grub.cfg
- create files of varying sizes and access them using 'source (pxe)/config-file-on-server'

A simple file consisting of nothing but newlines is sufficient.

confirmed "good" file lengths: 1,2,3,4,266,268,270
confirmed "bad" file lengths: 267,269,271,584,595,627

No pattern established yet.

Revision history for this message
Steve Langasek (vorlon) wrote :

Since the grub.cfg files served by maas tend to be in a rather narrow range of file sizes, it may be possible to work around this bug by having maas pad to known "good" file lengths.

584,585,587,589,591,593-601,603: bad. 586,588,590,592,602: good.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Attaching a pcap from a failure case. In this case, grub tried for 30 seconds to retrieve the interface specific grub.cfg, but never got a response from MAAS. It then gave up and got the amd64-default one instead, which caused the machine to try to enlist and then power off, leading to a failed deployment.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

In the pcap from comment #35, MAAS eventually does respond to the interface specific grub request, 61 seconds after the request, after it's already sent the grub.cfg-default-amd64, kernel, and initrd. You can see the responses to the interface specific grub.cfg requests coming back starting at packet 90573.

While Steve's finding in #33/34 seem to indicate a grub bug, this seems like a MAAS problem occurring before that grub bug even has a chance to take effect. I'm attaching MAAS logs from this same test run.

From the maas logs, the requests start at 01:02:49
logs-2018-02-01-01.04.49/10.244.40.30/var/log/maas/rackd.log

There are some "critical" tftp errors logged in the same file not long afterwards:
http://paste.ubuntu.com/26501394/

There are errors in postgresl's log around the same time too:
(logs-2018-02-01-01.04.49/10.244.40.30/var/log/postgresql$ vim postgresql-9.5-ha.log)

http://paste.ubuntu.com/26501399/

Changed in maas:
status: Incomplete → New
Revision history for this message
Andres Rodriguez (andreserl) wrote :

@Jason,

Packet 90573 doesn't seem to me as an indication of what you are describing. What I see is this:

1. grub makes ~30 requests for PXE config on grub.cfg-<mac>, after which it gives up because it didn't receive a response.
2. grub moves on and requests grub.cfg-default-amd64, and it receives a response from MAAS.

Now, the difference between the above, is that 1 does *database* lookups, while 2 does not. In other words, 1 causes a request to obtain the 'node' object based on the MAC to provide, and if grub is making 30+ requests, then this can definitely flood the db with requests.

That said, based on my understanding of how your environment is configured, you have other 3 VM's in the system PXE booting from MAAS + other machines at the same time, where each VM has assigned to itself 8 CPU's on a system that has 20 CPU's (that means that the VM's alone, in other words, you are over committing CPU), combined with other machines PXE booting off MAAS at the same time, plus the performance implications of the recent kernel, then it does seem to me that all of the other things could be impacting maas in contending resources, when we already know postgresql is running in degraded performance due to the newer kernels.

That said, did you disable spectre features and rebooted your machine?
Did you test this by NOT running VM's in the same system as MAAS or at least, reducing the number of cores each VM access to (since there's 3 VM's, with 8 cores each, that means 24 cores on a 20 core system).

Also, do you have any CPU load at the time of failure?

Changed in maas:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Andres,

You can tell packet 90573 is a response to the requests for grub.cfg-<mac> because its destination port (25305) is the src port the request for grub.cfg-<mac> was coming from (packets 2 through 38).

We're running another test now to collect load information.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Your description of our VM setup is not accurate.

We have 20 cpu threads on the machine and 18 cpu threads given to VMs.

Here's the VMs we create on each infra node:
http://paste.ubuntu.com/26501925/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We can reduce the # of cores we give to VMs, btw. How many does MAAS need for postgres, region controller, and rack controller on the same machine?

Steve Langasek (vorlon)
Changed in grub2 (Ubuntu):
status: New → Triaged
assignee: nobody → Mathieu Trudel-Lapierre (cyphermox)
tags: added: patch
Chris Gregan (cgregan)
Changed in maas:
status: Incomplete → New
Changed in maas:
status: New → Incomplete
Changed in grub2 (Ubuntu):
status: Triaged → In Progress
importance: Undecided → Critical
Steve Langasek (vorlon)
Changed in grub2 (Ubuntu):
importance: Critical → Medium
Changed in maas:
status: Incomplete → New
25 comments hidden view all 105 comments
Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Download full text (3.7 KiB)

@Jason,

On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs <email address hidden>
wrote:

> On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez
> <email address hidden> wrote:
> > No new data was provided to mark this New in MAAS:
> >
> > 1. Changes to the storage seem to have improved things
>
> Yes, it has. That doesn't change whether or not there is a bug in
> MAAS. Can you please address the critical log errors that I mentioned
> in comment #36? This seems like enough to establish something is
> going wrong in MAAS.
>

The tftp issue shows no evidence this is causing any booting failures. We
have seen this issue before and confirmed that it doesn't cause boot
issues. See [1]. If you want to try it, it is available in
ppa:maas/proposed.

[1].https://bugs.launchpad.net/maas/+bug/1376483

As far as the postgresql logs with "maas@maasdb ERROR: could not serialize
access due to concurrent update" that is *not* a bug in MAAS or an issue.
That's perfectly normal messages with the isolation level the MAAS DB is
running with. This basically means something else is trying to update the
db while something else is updating it, and MAAS already handles this by
doing retries.

> > 2. No tests have been run with fixed grub that have caused boot
> failures.
>
> The comments from #56 were testing with the fixed grub - sorry if that
> wasn't clear.
>
> > 3. AFAIK, the VM config has not changed to use less CPU to compare
> results and whether this config change causes the bugs in question.
>
> The CPU load data from comments #48 and #50 shows that CPU load is not
> the problem. The max load average was under 12 on a 20 thread system.
> That means there was lots of free CPU time, and that this workload is
> not CPU bound.
>

CPU load is not CPU utilization. We know that at the time there's 6 other
VM's with 150%+ CPU usage are writing to the disk because they are being
deployed and/or configured (e.g. software installation). Correct me if
wrong, but this can cause the prioritization of whatever is writing to disk
over anything else, like the MAAS processes access for resources.

That being said, because CPU load doesn't show high we are making the
*assumption* that it is not impacting MAAS, but again, this is an
assumption. Making the requested change for having at least 4 CPUs (ideally
6) would allow us to determining what are the effects and see whether
there's any difference on behavior and would help identify what other
issues.

Without having the comparison then we are making it more difficult to
isolate the problem.

>
> Jason
>
>
> ** Changed in: maas
> Status: Incomplete => New
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=In Progress; importance=Medium; assignee...

Read more...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Download full text (4.6 KiB)

On Mon, Feb 5, 2018 at 3:27 PM, Andres Rodriguez
<email address hidden> wrote:
> @Steve,
>
> MAAS already has a mechanism to collapse retries into the initial request.
> In this case, it is the rack that grabs the requests and makes a request to
> the region. If retries come within the time that the rack is waiting for a
> response from the region, these request get "ignored" and the Rack will
> only answer the first request. This is what the logs show after testing
> with fixed grub, where grub makes multiple requests and MAAS answers
> seconds after does requests, but only answers once. This is because the
> requests were collapsed on the maas side.
>
> If, however, the retries come in after the region has answered the rack,
> they these requests will be served.

This is not true. MAAS is responding to every single request grub
makes for the file - the tcpdump logs show it. And these are not
"read 4 times" requests - they are retries because grub didn't get a
response.

This pcap shows MAAS responding to every request for grub.cfg-<mac>:
https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5046952/+files/spearow-fall-back-to-default-amd64.pcap

Jason

>
> On Mon, Feb 5, 2018 at 2:34 PM, Steve Langasek <<email address hidden>
>> wrote:
>
>> Jason's feedback was that, after making the changes to the storage
>> configuration of his environment, deploying the test grubx64.efi doesn't
>> have any effect on the MAAS server's response time to tftp requests. So
>> at this point it's not at all clear that the grub change, while correct,
>> helps with this high-level symptom.
>>
>> It has also been suggested that each udp retry is generating a separate
>> database query from MAAS. That is absolutely a MAAS bug if true, and
>> not something that can or should be fixed in GRUB.
>>
>> ** Changed in: grub2 (Ubuntu)
>> Importance: Critical => Medium
>>
>> --
>> You received this bug notification because you are subscribed to MAAS.
>> https://bugs.launchpad.net/bugs/1743249
>>
>> Title:
>> Failed Deployment after timeout trying to retrieve grub cfg
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>>
>> Launchpad-Notification-Type: bug
>> Launchpad-Bug: product=maas; milestone=2.4.x; status=Incomplete;
>> importance=Undecided; assignee=None;
>> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
>> status=In Progress; importance=Medium; <email address hidden>;
>> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
>> Launchpad-Bug-Information-Type: Public
>> Launchpad-Bug-Private: no
>> Launchpad-Bug-Security-Vulnerability: no
>> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon
>> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
>> Launchpad-Bug-Modifier: Steve Langasek (vorlon)
>> Launchpad-Message-Rationale: Subscriber (MAAS)
>> Launchpad-Message-For: andreserl
>>
>
>
> --
> Andres Rodriguez (RoAkSoAx)
> Ubuntu Server Developer
> MSc. Telecom & Networking
> Systems Engineer
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1...

Read more...

Revision history for this message
Andres Rodriguez (andreserl) wrote :
Download full text (4.9 KiB)

I think there's a misunderstanding on how the network boot process happens:
Let's look at pxe linux first. Pxe linux does this:

1. tries UUID first # if no answer, it moves on
2. Tries mac # if no answer, it moves on
3. tries full IP address # if no answer, it moves on
4. tries partial IP address # if no answer, it moves on
5. does 4
6. does 4
[...]
7. boots default.

This can be seen in here:

/mybootdir/pxelinux.cfg/b8945908-d6a6-41a9-611d-74a6ab80b83d
/mybootdir/pxelinux.cfg/01-88-99-aa-bb-cc-dd
/mybootdir/pxelinux.cfg/C0A8025B
/mybootdir/pxelinux.cfg/C0A8025
/mybootdir/pxelinux.cfg/C0A802
/mybootdir/pxelinux.cfg/C0A80
/mybootdir/pxelinux.cfg/C0A8
/mybootdir/pxelinux.cfg/C0A
/mybootdir/pxelinux.cfg/C0
/mybootdir/pxelinux.cfg/C
/mybootdir/pxelinux.cfg/default

That said, in the case of grub, this behavior is similar. You have
described this behavior in comment #16. So what is it that's happening:

1. grub is trying grub.cfg-<mac> address multiple times, but since it
doesn't get a response, it gives it.
2. Once it gives up, grub.cfg-default-amd64 is tried instead.

That said, the requests are handled completely different. The -<mac>
requests actually accesses the *node* object in the database by searching
it with the mac address where the request is made. With this node object,
we generate the config file.

In comparison, the -default-amd64 does *not* access the node object. It
just access two config settings and the db query is *much* cheaper. Also,
we have to keep in mind that after grub has done many retries, this returns
rather fast in comparison because it is not only cheaper, but at that point
MAAS may be with way less load of queued DB requests. Either way, grub
giving up means that it wont expect for the initial request, but it will
expect a new response for the new file it asked for.

That said, this is working *exactly* as expected, because this effectively
tells grub "if config for your MAC address was not returned, you can safely
assume you are an unknown machine to MAAS", hence grub requests a different
config file to start the enlistment process.

So this is *not* a race condition in MAAS. This is working as designed and
is expected. The problem here is that MAAS takes too long to answer the
initial request, which causes grub to timeout and move on to request a
different config file.

On Mon, Feb 5, 2018 at 4:30 PM, Jason Hobbs <email address hidden>
wrote:

> The packetdump (comment #35) of MAAS not responding to grub's request
> for the mac specific grub.cfg before grub times out, and then responding
> immediately to the generic-amd64 grub cfg, clearly shows a race
> condition in MAAS.
>
> MAAS's design of dynamically generating the interface specific grub
> config only after it receives the tftp request for it is susceptible to
> a race condition where grub times out before MAAS can respond.
>
> That design is not the only possible design. All the information
> required for the interface specific grub.cfg is available before the
> machine ever powers on, and could be made available on the rack
> controllers at that time too.
>
> Doing so would eliminate that race condition, or at least reduce the
> opportunity great...

Read more...

Revision history for this message
Andres Rodriguez (andreserl) wrote :
Download full text (6.2 KiB)

@Jason,

The pcap exactly shows the behavior I was hoping to see, which is grub
tries to get X config first, and since it didn't get a response, it moves
on and tries to get Y config.

On Mon, Feb 5, 2018 at 4:45 PM, Jason Hobbs <email address hidden>
wrote:

> On Mon, Feb 5, 2018 at 3:27 PM, Andres Rodriguez
> <email address hidden> wrote:
> > @Steve,
> >
> > MAAS already has a mechanism to collapse retries into the initial
> request.
> > In this case, it is the rack that grabs the requests and makes a request
> to
> > the region. If retries come within the time that the rack is waiting for
> a
> > response from the region, these request get "ignored" and the Rack will
> > only answer the first request. This is what the logs show after testing
> > with fixed grub, where grub makes multiple requests and MAAS answers
> > seconds after does requests, but only answers once. This is because the
> > requests were collapsed on the maas side.
> >
> > If, however, the retries come in after the region has answered the rack,
> > they these requests will be served.
>
> This is not true. MAAS is responding to every single request grub
> makes for the file - the tcpdump logs show it. And these are not
> "read 4 times" requests - they are retries because grub didn't get a
> response.
>
> This pcap shows MAAS responding to every request for grub.cfg-<mac>:
> https://bugs.launchpad.net/maas/+bug/1743249/+attachment/
> 5046952/+files/spearow-fall-back-to-default-amd64.pcap
>
> Jason
>
> >
> > On Mon, Feb 5, 2018 at 2:34 PM, Steve Langasek <
> <email address hidden>
> >> wrote:
> >
> >> Jason's feedback was that, after making the changes to the storage
> >> configuration of his environment, deploying the test grubx64.efi doesn't
> >> have any effect on the MAAS server's response time to tftp requests. So
> >> at this point it's not at all clear that the grub change, while correct,
> >> helps with this high-level symptom.
> >>
> >> It has also been suggested that each udp retry is generating a separate
> >> database query from MAAS. That is absolutely a MAAS bug if true, and
> >> not something that can or should be fixed in GRUB.
> >>
> >> ** Changed in: grub2 (Ubuntu)
> >> Importance: Critical => Medium
> >>
> >> --
> >> You received this bug notification because you are subscribed to MAAS.
> >> https://bugs.launchpad.net/bugs/1743249
> >>
> >> Title:
> >> Failed Deployment after timeout trying to retrieve grub cfg
> >>
> >> To manage notifications about this bug go to:
> >> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
> >>
> >> Launchpad-Notification-Type: bug
> >> Launchpad-Bug: product=maas; milestone=2.4.x; status=Incomplete;
> >> importance=Undecided; assignee=None;
> >> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> >> status=In Progress; importance=Medium; <email address hidden>;
> >> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> >> Launchpad-Bug-Information-Type: Public
> >> Launchpad-Bug-Private: no
> >> Launchpad-Bug-Security-Vulnerability: no
> >> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs
> vorlon
> >> Launchpad-Bug-Reporter: Ja...

Read more...

Revision history for this message
Steve Langasek (vorlon) wrote :

On Mon, Feb 05, 2018 at 09:27:15PM -0000, Andres Rodriguez wrote:
> MAAS already has a mechanism to collapse retries into the initial request.

Are we certain that this is working correctly? If so, why are packet
captures showing that MAAS is sending stacked tftp OACK responses, 1:1 for
the duplicate incoming requests?

It's clear to me that MAAS's handling at the wire level is incorrect - 10
retries of the same tftp request should result in a single OACK, not 10 of
them (unless MAAS receives a retry *after* it has sent its OACK). I don't
know if that also means that MAAS is inefficiently translating these into
database requests on the backend. It had been suggested in this bug log and
on IRC that MAAS *was* sending duplicate db requests for each of these
packets; OTOH the timing of the stacked responses shows no latency in
between them that would imply additional db round-trips.

I think someone needs to directly inspect the behavior of a running MAAS
server in this scenario to be sure.

> In this case, it is the rack that grabs the requests and makes a request to
> the region. If retries come within the time that the rack is waiting for a
> response from the region, these request get "ignored" and the Rack will
> only answer the first request.

That is absolutely contradicted by the packet captures. The rack does not
ignore the additional requests, it answers *ALL* of the requests. It's only
the *client* that consolidates the duplicate responses from MAAS. (And
then, because of a grub bug higher up the stack, re-requests the same file
that it has already received.)

Revision history for this message
Mike Pontillo (mpontillo) wrote : Re: Failed Deployment after timeout trying to retrieve grub cfg

Steve, can you be more specific about which packet capture showed the "stacked OACK" behavior?

I looked at a packet capture Andres pointed me to, and don't see the "stacked OACKs" you describe. Each TFTP transaction (per RFC 1350) is indicated by the (source port, dest port) tuple, and I see that MAAS correctly OACKs each individual transaction (per RFC 2347) - not the retry packets within the same transaction. Subsequently (in the same second, after the client ACKs the data packet) it re-requests the same file (which is the bug in grub that I understand is fixed), and then the client starts a new transaction and MAAS correctly issues another OACK.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Download full text (5.1 KiB)

On Mon, Feb 5, 2018 at 3:45 PM, Andres Rodriguez
<email address hidden> wrote:
> @Jason,
>
> On Mon, Feb 5, 2018 at 3:38 PM, Jason Hobbs <email address hidden>
> wrote:
>
>> On Mon, Feb 5, 2018 at 11:58 AM, Andres Rodriguez
>> <email address hidden> wrote:
>> > No new data was provided to mark this New in MAAS:
>> >
>> > 1. Changes to the storage seem to have improved things
>>
>> Yes, it has. That doesn't change whether or not there is a bug in
>> MAAS. Can you please address the critical log errors that I mentioned
>> in comment #36? This seems like enough to establish something is
>> going wrong in MAAS.
>>
>
> The tftp issue shows no evidence this is causing any booting failures. We
> have seen this issue before and confirmed that it doesn't cause boot
> issues. See [1]. If you want to try it, it is available in
> ppa:maas/proposed.
>
> [1].https://bugs.launchpad.net/maas/+bug/1376483

It's been "Fixed" multiple times before, in your link above and also
in bug 1724677, but we still see them, very suspiciously around the
time of failures. I'm not convinced these are actually understood.
Do you have a specific commit or some idea of what change that
addresses these in 2.4?

> As far as the postgresql logs with "maas@maasdb ERROR: could not serialize
> access due to concurrent update" that is *not* a bug in MAAS or an issue.
> That's perfectly normal messages with the isolation level the MAAS DB is
> running with. This basically means something else is trying to update the
> db while something else is updating it, and MAAS already handles this by
> doing retries.

That is just one type of db error in the log. There are many more.
Here's one that says there was a deadlock detected. That's not normal
OK behavior is it?

http://paste.ubuntu.com/26527181/

>> > 2. No tests have been run with fixed grub that have caused boot
>> failures.
>>
>> The comments from #56 were testing with the fixed grub - sorry if that
>> wasn't clear.
>>
>> > 3. AFAIK, the VM config has not changed to use less CPU to compare
>> results and whether this config change causes the bugs in question.
>>
>> The CPU load data from comments #48 and #50 shows that CPU load is not
>> the problem. The max load average was under 12 on a 20 thread system.
>> That means there was lots of free CPU time, and that this workload is
>> not CPU bound.
>>
>
> CPU load is not CPU utilization. We know that at the time there's 6 other
> VM's with 150%+ CPU usage are writing to the disk because they are being
> deployed and/or configured (e.g. software installation). Correct me if
> wrong, but this can cause the prioritization of whatever is writing to disk
> over anything else, like the MAAS processes access for resources.

The 150%+ number you are seeing is that process using all of 1 core
(hyperthread) and 50% of another (it's a multithreaded process). This
does not mean the process is using 150% of the entire CPU capacity.

We don't just have load average - we also have a breakdown of CPU
utilization from top, every 5 seconds:

%Cpu(s): 23.5 us, 6.3 sy, 0.0 ni, 62.9 id, 7.2 wa, 0.0 hi, 0.1 si, 0.0 st

The top man page has more to say about this line, but have a l...

Read more...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

@Mike, you can see the stacked response behavior in
https://bugs.launchpad.net/maas/+bug/1743249/+attachment/5046952/+files/spearow-fall-back-to-default-amd64.pcap

You can tell packet 90573 is a response to the requests for
grub.cfg-<mac> because its destination port (25305) is the src port
the request for grub.cfg-<mac> was coming from (packets 2 through 38).

On Mon, Feb 5, 2018 at 6:11 PM, Mike Pontillo
<email address hidden> wrote:
> Steve, can you be more specific about which packet capture showed the
> "stacked OACK" behavior?
>
> I looked at a packet capture Andres pointed me to, and don't see the
> "stacked OACKs" you describe. Each TFTP transaction (per RFC 1350) is
> indicated by the (source port, dest port) tuple, and I see that MAAS
> correctly OACKs each individual transaction (per RFC 2347) - not the
> retry packets within the same transaction. Subsequently (in the same
> second, after the client ACKs the data packet) it re-requests the same
> file (which is the bug in grub that I understand is fixed), and then the
> client starts a new transaction and MAAS correctly issues another OACK.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
> New
> Status in grub2 package in Ubuntu:
> In Progress
>
> Bug description:
> A node failed to deploy after it failed to retrieve a grub.cfg from
> MAAS due to a timeout. In the logs, it's clear that the server tried
> to retrieve the grub cfg many times, over about 30 seconds:
>
> http://paste.ubuntu.com/26387256/
>
> We see the same thing for other hosts around the same time:
>
> http://paste.ubuntu.com/26387262/
>
> It seems like MAAS is taking way too long to respond to these
> requests.
>
> This is very similar to bug 1724677, which was happening pre-
> metldown/spectre. The only difference is we don't see "[critical] TFTP
> back-end failed" in the logs anymore.
>
> I connected to the console on this system and it had errors about
> timing out retrieving the grub-cfg, then it had an error message along
> the lines of "error not an ip" and then "double free". After I
> connected but before I could get a screenshot the system rebooted and
> was directed by maas to power off, which it did successfully after
> booting to linux.
>
> Full logs are available here:
> https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
> ed277a020e7c/cpe_cloud_395/infra-logs.tar
>
> This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

Revision history for this message
Steve Langasek (vorlon) wrote :

On Tue, Feb 06, 2018 at 12:11:21AM -0000, Mike Pontillo wrote:
> Steve, can you be more specific about which packet capture showed the
> "stacked OACK" behavior?

This was the first packet capture that Jason posted, in comment #30. The
udp retransmits shown in packets 6262-6268 each receive an answering packet
in 6270-6271,6273-6277, in addition to 6269 as an answer to 6261. For
whatever reason, wireshark here does not decipher these duplicate OACK
packets as OACK, but an examination of the raw packets shows that's clearly
what they are.

> I looked at a packet capture Andres pointed me to, and don't see the
> "stacked OACKs" you describe. Each TFTP transaction (per RFC 1350) is
> indicated by the (source port, dest port) tuple, and I see that MAAS
> correctly OACKs each individual transaction (per RFC 2347) - not the
> retry packets within the same transaction.

Packets 6269-6271,6273-6277 are all answers to the same port on the client.
They don't have the same source port, because MAAS has allocated a separate
source port for each of these. It's not acking a separate individual
transaction, it's MAAS /creating/ a separate transaction (with the
allocation of a separate source port) for each one.

RFC2347 does not speak to this; the discussion of the port negotiation is in
RFC1350 §4:

   In order to create a connection, each end of the connection chooses a
   TID for itself, to be used for the duration of that connection. The
   TID's chosen for a connection should be randomly chosen, so that the
   probability that the same number is chosen twice in immediate
   succession is very low. Every packet has associated with it the two
   TID's of the ends of the connection, the source TID and the
   destination TID. These TID's are handed to the supporting UDP (or
   other datagram protocol) as the source and destination ports. A
   requesting host chooses its source TID as described above, and sends
   its initial request to the known TID 69 decimal (105 octal) on the
   serving host. The response to the request, under normal operation,
   uses a TID chosen by the server as its source TID and the TID chosen
   for the previous message by the requestor as its destination TID.
   The two chosen TID's are then used for the remainder of the transfer.

MAAS responds to 8 udp retransmits on srcport=25305, dstport=69 by sending 8
independent OACK packets back to dstport=25305 each from a different source
port.

Since Andres confirms that these duplicate acks still only result in one
database query, this may be a negligible bug if the only impact is duplicate
small udp packets. OTOH, depending on how MAAS implements this, it could
also result in port exhaustion on the server if unanswered OACKs are allowed
to linger.

Revision history for this message
Mike Pontillo (mpontillo) wrote : Re: Failed Deployment after timeout trying to retrieve grub cfg

Ah, I see what you mean there; I used the following filter in Wireshark:

    udp.dstport == 25305 or udp.srcport == 25305

This is not the behavior I saw if the TFTP request is answered in a timely manner, so I suspect that the long delay between the initial request and the answer is causing the timeouts to occur in the TFTP code, which causes this separate "stacked OACK" bug.

By the time the "stacked OACKs" are sent, it's been over one minute, and the client isn't listening for a reply any more. So yes, "stacked OACKs" are a real bug, but right now I think they're just distracting from the root cause (the long delay).

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

+1 Mike. I agree it's a bug, but it there isn't real evidence that
it's what causes the long delay.

On Mon, Feb 5, 2018 at 7:12 PM, Mike Pontillo
<email address hidden> wrote:
> Ah, I see what you mean there; I used the following filter in Wireshark:
>
> udp.dstport == 25305 or udp.srcport == 25305
>
> This is not the behavior I saw if the TFTP request is answered in a
> timely manner, so I suspect that the long delay between the initial
> request and the answer is causing the timeouts to occur in the TFTP
> code, which causes this separate "stacked OACK" bug.
>
> By the time the "stacked OACKs" are sent, it's been over one minute, and
> the client isn't listening for a reply any more. So yes, "stacked OACKs"
> are a real bug, but right now I think they're just distracting from the
> root cause (the long delay).
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
> New
> Status in grub2 package in Ubuntu:
> In Progress
>
> Bug description:
> A node failed to deploy after it failed to retrieve a grub.cfg from
> MAAS due to a timeout. In the logs, it's clear that the server tried
> to retrieve the grub cfg many times, over about 30 seconds:
>
> http://paste.ubuntu.com/26387256/
>
> We see the same thing for other hosts around the same time:
>
> http://paste.ubuntu.com/26387262/
>
> It seems like MAAS is taking way too long to respond to these
> requests.
>
> This is very similar to bug 1724677, which was happening pre-
> metldown/spectre. The only difference is we don't see "[critical] TFTP
> back-end failed" in the logs anymore.
>
> I connected to the console on this system and it had errors about
> timing out retrieving the grub-cfg, then it had an error message along
> the lines of "error not an ip" and then "double free". After I
> connected but before I could get a screenshot the system rebooted and
> was directed by maas to power off, which it did successfully after
> booting to linux.
>
> Full logs are available here:
> https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
> ed277a020e7c/cpe_cloud_395/infra-logs.tar
>
> This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions

Revision history for this message
Andres Rodriguez (andreserl) wrote :

>
>
> > That being said, because CPU load doesn't show high we are making the
> > *assumption* that it is not impacting MAAS, but again, this is an
> > assumption. Making the requested change for having at least 4 CPUs
> (ideally
> > 6) would allow us to determining what are the effects and see whether
> > there's any difference on behavior and would help identify what other
> > issues.
> >
> > Without having the comparison then we are making it more difficult to
> > isolate the problem.
>
> To improve performance the typical pattern is 1) identify the
> bottleneck 2) eliminate that as the bottleneck 3) repeat.
>
> We have not identified CPU as a bottleneck. The top data says it is
> not!
>

Jason,

That doesn't change the fact that we are requesting tests to be run with
different CPU configuration for VM's, so we can make a *comparison* and see
if there is any material difference or none at all with the current
conditions. While I agree with you that the data /seems/ to show that there
is not issue with CPU, that doesn't change the fact that we don't have any
data to compare with, as there could still be an impact even if it is
minimum.

Without the data, we cannot certainly assert that there's no issue caused
by CPU usage because we don't have a reference or point of comparison. So
while all fingers seem to be pointing to storage, It strongly believe it is
worth gathering the data now and fully discard.

If this is something that your environment is unable to do, I would
appreciate that you clarify that instead of asserting that there's no
performance impact in MAAS due to CPU usage, when we don't really know for
sure (e.g. we don't know if MAAS behaves differently with less CPU usage in
the current conditions, and that's data worth gathering to be able to
better support you in the future).

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Download full text (7.2 KiB)

On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguez
<email address hidden> wrote:
> I think there's a misunderstanding on how the network boot process happens:
> Let's look at pxe linux first. Pxe linux does this:
>
> 1. tries UUID first # if no answer, it moves on
> 2. Tries mac # if no answer, it moves on
> 3. tries full IP address # if no answer, it moves on
> 4. tries partial IP address # if no answer, it moves on
> 5. does 4
> 6. does 4
> [...]
> 7. boots default.
>
> This can be seen in here:
>
> /mybootdir/pxelinux.cfg/b8945908-d6a6-41a9-611d-74a6ab80b83d
> /mybootdir/pxelinux.cfg/01-88-99-aa-bb-cc-dd
> /mybootdir/pxelinux.cfg/C0A8025B
> /mybootdir/pxelinux.cfg/C0A8025
> /mybootdir/pxelinux.cfg/C0A802
> /mybootdir/pxelinux.cfg/C0A80
> /mybootdir/pxelinux.cfg/C0A8
> /mybootdir/pxelinux.cfg/C0A
> /mybootdir/pxelinux.cfg/C0
> /mybootdir/pxelinux.cfg/C
> /mybootdir/pxelinux.cfg/default
>
>
> That said, in the case of grub, this behavior is similar. You have
> described this behavior in comment #16. So what is it that's happening:
>
> 1. grub is trying grub.cfg-<mac> address multiple times, but since it
> doesn't get a response, it gives it.
> 2. Once it gives up, grub.cfg-default-amd64 is tried instead.
>
> That said, the requests are handled completely different. The -<mac>
> requests actually accesses the *node* object in the database by searching
> it with the mac address where the request is made. With this node object,
> we generate the config file.
>
> In comparison, the -default-amd64 does *not* access the node object. It
> just access two config settings and the db query is *much* cheaper. Also,
> we have to keep in mind that after grub has done many retries, this returns
> rather fast in comparison because it is not only cheaper, but at that point
> MAAS may be with way less load of queued DB requests. Either way, grub
> giving up means that it wont expect for the initial request, but it will
> expect a new response for the new file it asked for.
>
> That said, this is working *exactly* as expected, because this effectively
> tells grub "if config for your MAC address was not returned, you can safely
> assume you are an unknown machine to MAAS", hence grub requests a different
> config file to start the enlistment process.

Except it's not an unknown machine, and MAAS treating it like one is
bad behavior and a bug.

This is not "working exactly as expected". "Working exactly as
expected" would be my machine being deployed when I asked for it to
be.

> So this is *not* a race condition in MAAS. This is working as designed and
> is expected. The problem here is that MAAS takes too long to answer the
> initial request, which causes grub to timeout and move on to request a
> different config file.

Yes, because there is a race condition in the design - the MAC
specific file has to be generated before grub times out. It could
instead be generated before the node ever starts booting, allowing it
to be served just as fast as the -default-amd64 file is, eliminating
that race condition.

Jason

> On Mon, Feb 5, 2018 at 4:30 PM, Jason Hobbs <email address hidden>
> wrote:
>
>> The packetdump (comment #35) of MAAS not responding to g...

Read more...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Download full text (4.2 KiB)

Andres,

I ran the test with VMs limited to 9 of 20 cores (cut the core limit
in half for VMs). The first time range from this dump is with the
cores at their normal limit (18).

As you can see, the behavior didn't change much from one set to the
other. Both sets had instances where grub started doing retries,
although in neither case did it take very long.

http://paste.ubuntu.com/26530737/

So it seems that changing the CPU limits for the VMs doesn't change
the results drastically, which lines up with the data showing CPU
utilization never gets over 50%.

Jason

On Mon, Feb 5, 2018 at 10:19 PM, Andres Rodriguez
<email address hidden> wrote:
>>
>>
>> > That being said, because CPU load doesn't show high we are making the
>> > *assumption* that it is not impacting MAAS, but again, this is an
>> > assumption. Making the requested change for having at least 4 CPUs
>> (ideally
>> > 6) would allow us to determining what are the effects and see whether
>> > there's any difference on behavior and would help identify what other
>> > issues.
>> >
>> > Without having the comparison then we are making it more difficult to
>> > isolate the problem.
>>
>> To improve performance the typical pattern is 1) identify the
>> bottleneck 2) eliminate that as the bottleneck 3) repeat.
>>
>> We have not identified CPU as a bottleneck. The top data says it is
>> not!
>>
>
> Jason,
>
> That doesn't change the fact that we are requesting tests to be run with
> different CPU configuration for VM's, so we can make a *comparison* and see
> if there is any material difference or none at all with the current
> conditions. While I agree with you that the data /seems/ to show that there
> is not issue with CPU, that doesn't change the fact that we don't have any
> data to compare with, as there could still be an impact even if it is
> minimum.
>
> Without the data, we cannot certainly assert that there's no issue caused
> by CPU usage because we don't have a reference or point of comparison. So
> while all fingers seem to be pointing to storage, It strongly believe it is
> worth gathering the data now and fully discard.
>
> If this is something that your environment is unable to do, I would
> appreciate that you clarify that instead of asserting that there's no
> performance impact in MAAS due to CPU usage, when we don't really know for
> sure (e.g. we don't know if MAAS behaves differently with less CPU usage in
> the current conditions, and that's data worth gathering to be able to
> better support you in the future).
>
> --
> Andres Rodriguez (RoAkSoAx)
> Ubuntu Server Developer
> MSc. Telecom & Networking
> Systems Engineer
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
> New
> Status in grub2 package in Ubuntu:
> In Progress
>
> Bug description:
> A node failed to deploy after it failed to retrieve a grub.cfg from
> MAAS due to a timeout. In the logs, it's clear that the server tried
> to retrieve the grub cfg many times, over about 30 seconds:
>
> http://paste.ubuntu.co...

Read more...

Revision history for this message
Andres Rodriguez (andreserl) wrote :
Download full text (10.3 KiB)

On Tue, Feb 6, 2018 at 11:24 AM, Jason Hobbs <email address hidden>
wrote:

> On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguez
> <email address hidden> wrote:
> > I think there's a misunderstanding on how the network boot process
> happens:
> > Let's look at pxe linux first. Pxe linux does this:
> >
> > 1. tries UUID first # if no answer, it moves on
> > 2. Tries mac # if no answer, it moves on
> > 3. tries full IP address # if no answer, it moves on
> > 4. tries partial IP address # if no answer, it moves on
> > 5. does 4
> > 6. does 4
> > [...]
> > 7. boots default.
> >
> > This can be seen in here:
> >
> > /mybootdir/pxelinux.cfg/b8945908-d6a6-41a9-611d-74a6ab80b83d
> > /mybootdir/pxelinux.cfg/01-88-99-aa-bb-cc-dd
> > /mybootdir/pxelinux.cfg/C0A8025B
> > /mybootdir/pxelinux.cfg/C0A8025
> > /mybootdir/pxelinux.cfg/C0A802
> > /mybootdir/pxelinux.cfg/C0A80
> > /mybootdir/pxelinux.cfg/C0A8
> > /mybootdir/pxelinux.cfg/C0A
> > /mybootdir/pxelinux.cfg/C0
> > /mybootdir/pxelinux.cfg/C
> > /mybootdir/pxelinux.cfg/default
> >
> >
> > That said, in the case of grub, this behavior is similar. You have
> > described this behavior in comment #16. So what is it that's happening:
> >
> > 1. grub is trying grub.cfg-<mac> address multiple times, but since it
> > doesn't get a response, it gives it.
> > 2. Once it gives up, grub.cfg-default-amd64 is tried instead.
> >
> > That said, the requests are handled completely different. The -<mac>
> > requests actually accesses the *node* object in the database by
> searching
> > it with the mac address where the request is made. With this node object,
> > we generate the config file.
> >
> > In comparison, the -default-amd64 does *not* access the node object. It
> > just access two config settings and the db query is *much* cheaper. Also,
> > we have to keep in mind that after grub has done many retries, this
> returns
> > rather fast in comparison because it is not only cheaper, but at that
> point
> > MAAS may be with way less load of queued DB requests. Either way, grub
> > giving up means that it wont expect for the initial request, but it will
> > expect a new response for the new file it asked for.
> >
> > That said, this is working *exactly* as expected, because this
> effectively
> > tells grub "if config for your MAC address was not returned, you can
> safely
> > assume you are an unknown machine to MAAS", hence grub requests a
> different
> > config file to start the enlistment process.
>
> Except it's not an unknown machine, and MAAS treating it like one is
> bad behavior and a bug.

> This is not "working exactly as expected". "Working exactly as
> expected" would be my machine being deployed when I asked for it to
> be.
>

Yes, it is not an unknown machine, but that doesn;t change the fact that
this is working as designed. If the client didn't get a response for the
request it makes, and the client decides to move on and makes a different
request, then it is working as designed. Again, the bug here is not on the
clients behavior, the bug here is on the fact that the response is not
being done in a timely manner.

>
> > So this is *not* a race condition in MAAS. This is working as designed
>...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: Failed Deployment after timeout trying to retrieve grub cfg

Can you please comment on the deadlock detected error from the db log in posted in #36

http://paste.ubuntu.com/26530761/

That is not expected behavior is it? Also the fact that MAAS thinks its losing rack/region connections seems like it could be related to this behavior.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Download full text (5.4 KiB)

On Tue, Feb 6, 2018 at 10:40 AM, Andres Rodriguez
<email address hidden> wrote:
> On Tue, Feb 6, 2018 at 11:24 AM, Jason Hobbs <email address hidden>
> wrote:
>
>> On Mon, Feb 5, 2018 at 4:07 PM, Andres Rodriguez
>> <email address hidden> wrote:
>> > I think there's a misunderstanding on how the network boot process
>> happens:
>> > Let's look at pxe linux first. Pxe linux does this:
>> >
>> > 1. tries UUID first # if no answer, it moves on
>> > 2. Tries mac # if no answer, it moves on
>> > 3. tries full IP address # if no answer, it moves on
>> > 4. tries partial IP address # if no answer, it moves on
>> > 5. does 4
>> > 6. does 4
>> > [...]
>> > 7. boots default.
>> >
>> > This can be seen in here:
>> >
>> > /mybootdir/pxelinux.cfg/b8945908-d6a6-41a9-611d-74a6ab80b83d
>> > /mybootdir/pxelinux.cfg/01-88-99-aa-bb-cc-dd
>> > /mybootdir/pxelinux.cfg/C0A8025B
>> > /mybootdir/pxelinux.cfg/C0A8025
>> > /mybootdir/pxelinux.cfg/C0A802
>> > /mybootdir/pxelinux.cfg/C0A80
>> > /mybootdir/pxelinux.cfg/C0A8
>> > /mybootdir/pxelinux.cfg/C0A
>> > /mybootdir/pxelinux.cfg/C0
>> > /mybootdir/pxelinux.cfg/C
>> > /mybootdir/pxelinux.cfg/default
>> >
>> >
>> > That said, in the case of grub, this behavior is similar. You have
>> > described this behavior in comment #16. So what is it that's happening:
>> >
>> > 1. grub is trying grub.cfg-<mac> address multiple times, but since it
>> > doesn't get a response, it gives it.
>> > 2. Once it gives up, grub.cfg-default-amd64 is tried instead.
>> >
>> > That said, the requests are handled completely different. The -<mac>
>> > requests actually accesses the *node* object in the database by
>> searching
>> > it with the mac address where the request is made. With this node object,
>> > we generate the config file.
>> >
>> > In comparison, the -default-amd64 does *not* access the node object. It
>> > just access two config settings and the db query is *much* cheaper. Also,
>> > we have to keep in mind that after grub has done many retries, this
>> returns
>> > rather fast in comparison because it is not only cheaper, but at that
>> point
>> > MAAS may be with way less load of queued DB requests. Either way, grub
>> > giving up means that it wont expect for the initial request, but it will
>> > expect a new response for the new file it asked for.
>> >
>> > That said, this is working *exactly* as expected, because this
>> effectively
>> > tells grub "if config for your MAC address was not returned, you can
>> safely
>> > assume you are an unknown machine to MAAS", hence grub requests a
>> different
>> > config file to start the enlistment process.
>>
>> Except it's not an unknown machine, and MAAS treating it like one is
>> bad behavior and a bug.
>
>
>> This is not "working exactly as expected". "Working exactly as
>> expected" would be my machine being deployed when I asked for it to
>> be.
>>
>
> Yes, it is not an unknown machine, but that doesn;t change the fact that
> this is working as designed. If the client didn't get a response for the
> request it makes, and the client decides to move on and makes a different
> request, then it is working as designed. Again, the bug here is not on the
> clients behavi...

Read more...

Revision history for this message
Andres Rodriguez (andreserl) wrote :

The deadlock is not expected behavior.

Due to the isolation level, the number of workers (e.g. 12
workers/3regions) and the fact that there could be IO starvation, its
surfacing this issue. That said, changes to improve this and prevent the
deadlocks are not backportable to 2.3 and are targetted for 2.4

On Tue, Feb 6, 2018 at 11:43 AM, Jason Hobbs <email address hidden>
wrote:

> Can you please comment on the deadlock detected error from the db log in
> posted in #36
>
> http://paste.ubuntu.com/26530761/
>
> That is not expected behavior is it? Also the fact that MAAS thinks its
> losing rack/region connections seems like it could be related to this
> behavior.
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=In Progress; importance=Medium; <email address hidden>;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs
> mpontillo vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message
Andres Rodriguez (andreserl) wrote :
Download full text (4.2 KiB)

>
> >
> > Yes, it is not an unknown machine, but that doesn;t change the fact that
> > this is working as designed. If the client didn't get a response for the
> > request it makes, and the client decides to move on and makes a different
> > request, then it is working as designed. Again, the bug here is not on
> the
> > clients behavior, the bug here is on the fact that the response is not
> > being done in a timely manner.
>
> Yes, agreed 100%. It's not a client bug, it's a server bug.
>
> >
> >>
> >> > So this is *not* a race condition in MAAS. This is working as designed
> >> and
> >> > is expected. The problem here is that MAAS takes too long to answer
> the
> >> > initial request, which causes grub to timeout and move on to request a
> >> > different config file.
> >>
> >> Yes, because there is a race condition in the design - the MAC
> >> specific file has to be generated before grub times out. It could
> >> instead be generated before the node ever starts booting, allowing it
> >> to be served just as fast as the -default-amd64 file is, eliminating
> >> that race condition.
> >>
> >
> > It is not a race condition. It is doing exactly what it was told to do.
> It
> > request X thing, didn't get a response, then it requested Y thing, and
> got
> > a response. The fact that there's no response when X happens on a
> /timely/
> > manner is not a race, its a bug on the server side. So, if the machine
> were
> > to not be known to MAAS, it would work as expected. But since it is known
> > and the response doesn't come on a timely manner for grub, it moves on.
> > This is the same behavior pxe, uboot and other network bootloaders
> follow.
>
> Right - it's a bug on the server side! That's what I've been saying.
>
> > And yes, you could argue that the config could be generated before the
> node
> > starts booting, but what you are not considering is that the node can
> boot
> > from any rack controller really and that would require maas to send the
> > same file to all rack controllers in the same vlan the machine is booting
> > from and write files onto the disk dynamically, which in fact, can impact
> > performance even more. The fact the config is generated on the fly is
> > because it is generated for the specific rack controller where the
> machine
> > is booting from and that;'s the intended design.
>
> I never suggested the files had to be written to disk, but yes, they
> would need to be sent to each rack controller that it could boot from.
>
> I know it's the intended design, but it has a race condition built in
> that could be eliminated with another design. That's all I'm saying.
>
> It sounds like you agree and you point out there would be trade offs,
> and that's fine.
>

Actually we dont believe this is a good change. In fact, this will cause
booting issues and overall performance issues.

We already know of two areas where this can be improved. One is
non-backportable to 2.3, the other one is this:

https://paste.ubuntu.com/26530972/

Is there any chance you can test that patch, or do you want me to put a
patched package somehwere?

>
> Jason
>
> --
> You received this bug notification because you are subscribed to MAAS.
> h...

Read more...

Revision history for this message
Andres Rodriguez (andreserl) wrote :

fwiw, the deadlocks issues is regiond trying to determine which process
should send updates to which racks for *dhcp* changes, so this is not at
all related to the RPC boot requests for pxe.

On Tue, Feb 6, 2018 at 11:43 AM, Jason Hobbs <email address hidden>
wrote:

> Can you please comment on the deadlock detected error from the db log in
> posted in #36
>
> http://paste.ubuntu.com/26530761/
>
> That is not expected behavior is it? Also the fact that MAAS thinks its
> losing rack/region connections seems like it could be related to this
> behavior.
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=In Progress; importance=Medium; <email address hidden>;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs
> mpontillo vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message
Andres Rodriguez (andreserl) wrote :
Download full text (5.7 KiB)

@Jason,

Are these tests with archive grub or patched grub?

On Tue, Feb 6, 2018 at 11:39 AM, Jason Hobbs <email address hidden>
wrote:

> Andres,
>
> I ran the test with VMs limited to 9 of 20 cores (cut the core limit
> in half for VMs). The first time range from this dump is with the
> cores at their normal limit (18).
>
> As you can see, the behavior didn't change much from one set to the
> other. Both sets had instances where grub started doing retries,
> although in neither case did it take very long.
>
> http://paste.ubuntu.com/26530737/
>
> So it seems that changing the CPU limits for the VMs doesn't change
> the results drastically, which lines up with the data showing CPU
> utilization never gets over 50%.
>
> Jason
>
>
> On Mon, Feb 5, 2018 at 10:19 PM, Andres Rodriguez
> <email address hidden> wrote:
> >>
> >>
> >> > That being said, because CPU load doesn't show high we are making the
> >> > *assumption* that it is not impacting MAAS, but again, this is an
> >> > assumption. Making the requested change for having at least 4 CPUs
> >> (ideally
> >> > 6) would allow us to determining what are the effects and see whether
> >> > there's any difference on behavior and would help identify what other
> >> > issues.
> >> >
> >> > Without having the comparison then we are making it more difficult to
> >> > isolate the problem.
> >>
> >> To improve performance the typical pattern is 1) identify the
> >> bottleneck 2) eliminate that as the bottleneck 3) repeat.
> >>
> >> We have not identified CPU as a bottleneck. The top data says it is
> >> not!
> >>
> >
> > Jason,
> >
> > That doesn't change the fact that we are requesting tests to be run with
> > different CPU configuration for VM's, so we can make a *comparison* and
> see
> > if there is any material difference or none at all with the current
> > conditions. While I agree with you that the data /seems/ to show that
> there
> > is not issue with CPU, that doesn't change the fact that we don't have
> any
> > data to compare with, as there could still be an impact even if it is
> > minimum.
> >
> > Without the data, we cannot certainly assert that there's no issue caused
> > by CPU usage because we don't have a reference or point of comparison. So
> > while all fingers seem to be pointing to storage, It strongly believe it
> is
> > worth gathering the data now and fully discard.
> >
> > If this is something that your environment is unable to do, I would
> > appreciate that you clarify that instead of asserting that there's no
> > performance impact in MAAS due to CPU usage, when we don't really know
> for
> > sure (e.g. we don't know if MAAS behaves differently with less CPU usage
> in
> > the current conditions, and that's data worth gathering to be able to
> > better support you in the future).
> >
> > --
> > Andres Rodriguez (RoAkSoAx)
> > Ubuntu Server Developer
> > MSc. Telecom & Networking
> > Systems Engineer
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1743249
> >
> > Title:
> > Failed Deployment after timeout trying to retrieve grub cfg
> >
> > Status in MAAS:
> > New
> > Sta...

Read more...

Revision history for this message
Launchpad Janitor (janitor) wrote : Re: Failed Deployment after timeout trying to retrieve grub cfg

This bug was fixed in the package grub2 - 2.02-2ubuntu6

---------------
grub2 (2.02-2ubuntu6) bionic; urgency=medium

  [ Steve Langasek ]
  * debian/patches/bufio_sensible_block_sizes.patch: Don't use arbitrary file
    fizes as block sizes in bufio: this avoids potentially seeking back in
    the files unnecessarily, which may require re-open files that cannot be
    seeked into, such as via TFTP. (LP: #1743249)

 -- Mathieu Trudel-Lapierre <email address hidden> Mon, 05 Feb 2018 11:58:09 -0500

Changed in grub2 (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Download full text (3.5 KiB)

Ok - and what about the region controller losing contact with the rack
controller log messages? What is that about?

On Tue, Feb 6, 2018 at 11:37 AM, Andres Rodriguez
<email address hidden> wrote:
> fwiw, the deadlocks issues is regiond trying to determine which process
> should send updates to which racks for *dhcp* changes, so this is not at
> all related to the RPC boot requests for pxe.
>
> On Tue, Feb 6, 2018 at 11:43 AM, Jason Hobbs <email address hidden>
> wrote:
>
>> Can you please comment on the deadlock detected error from the db log in
>> posted in #36
>>
>> http://paste.ubuntu.com/26530761/
>>
>> That is not expected behavior is it? Also the fact that MAAS thinks its
>> losing rack/region connections seems like it could be related to this
>> behavior.
>>
>> --
>> You received this bug notification because you are subscribed to MAAS.
>> https://bugs.launchpad.net/bugs/1743249
>>
>> Title:
>> Failed Deployment after timeout trying to retrieve grub cfg
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>>
>> Launchpad-Notification-Type: bug
>> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
>> importance=Undecided; assignee=None;
>> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
>> status=In Progress; importance=Medium; <email address hidden>;
>> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
>> Launchpad-Bug-Information-Type: Public
>> Launchpad-Bug-Private: no
>> Launchpad-Bug-Security-Vulnerability: no
>> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs
>> mpontillo vorlon
>> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
>> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
>> Launchpad-Message-Rationale: Subscriber (MAAS)
>> Launchpad-Message-For: andreserl
>>
>
>
> --
> Andres Rodriguez (RoAkSoAx)
> Ubuntu Server Developer
> MSc. Telecom & Networking
> Systems Engineer
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
> New
> Status in grub2 package in Ubuntu:
> In Progress
>
> Bug description:
> A node failed to deploy after it failed to retrieve a grub.cfg from
> MAAS due to a timeout. In the logs, it's clear that the server tried
> to retrieve the grub cfg many times, over about 30 seconds:
>
> http://paste.ubuntu.com/26387256/
>
> We see the same thing for other hosts around the same time:
>
> http://paste.ubuntu.com/26387262/
>
> It seems like MAAS is taking way too long to respond to these
> requests.
>
> This is very similar to bug 1724677, which was happening pre-
> metldown/spectre. The only difference is we don't see "[critical] TFTP
> back-end failed" in the logs anymore.
>
> I connected to the console on this system and it had errors about
> timing out retrieving the grub-cfg, then it had an error message along
> the lines of "error not an ip" and then "double free". After I
> connected but before I could get a screenshot the system rebooted and
> ...

Read more...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: Failed Deployment after timeout trying to retrieve grub cfg

The patch from #84 is adding a cache for reading the template file on the rack controller. I don't understand why this change is being made.

This file will almost certainly be in the page cache anyhow as these systems have a lot of free ram. Usually it's best to just let the page cache do its thing and not try to re-implement it in userspace, unless you really know what you're doing.

I haven't seen any logs that indicate that there was a bottleneck reading the template file. Do you have some data along those lines?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Anyhow, I tested with the patch from #84 as requested, here are the results: http://paste.ubuntu.com/26531873/

We're still seeing some retries with it, same as before.

But, I think the test is of limited value. It didn't make things worse but we don't have any evidence from the test that it made things better.

We're not seeing big delays on a regular basis anymore after changing the storage configuration to reduce contention. With or without this patch, we don't see any big delays.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Actually caching does make a difference. That method is not just caching the reading of a file, it caches the searching of the file based on the purpose, the reading of that file from disk (sure can be in kernel cache), the parsing of the template by tempita.

All of that is redudant work that is being done on every single request. Searching the filesystem and reading the file from cache is all syscalls even if they come from the kernel cache. Since MAAS is async based that means that coroutine will be placed on hold while we wait for the result to be loaded from the kernel into the memory of the process. That gives other coroutines time to do other things, which means that coroutine doesn't get to execute until others are done or blocked by there own async request.

Caching this information can greatly improve that by not requiring the coroutine to be pushed back into the eventloop while it is waiting for data from the kernel and without this change when the data comes back it still has to be processed by tempita which will take time and block the eventloop from completing other work.

So its not simply that we should use the kernel to cache reads from the disk there is a lot more involved here. We have noticed improvements with this change on systems that are being ran with large number of VM's because of the reduction of IO.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

@Jason,

I'm comparing pb in #79 vs pb in #90

#79 (non-patched): https://paste.ubuntu.com/26530737/
#90 (patched with lru_cache): https://paste.ubuntu.com/26531873/

Examples I see in #79:

14:02:ec:42:38:dc # makes 9 requests. on line 160+
14:02:ec:42:28:70 # 8 requests on line 72
14:02:ec:41:d7:38 # 7 on line 92.

In #90 i see:

14:02:ec:41:d7:44 # makes 6 requests on line 7
14:02:ec:42:38:dc # makes 5 requests on line 19

So it is interesting to see that in #79 more machines make more requests than those in #90. Obviously we need more data over time to really tell the difference in both scenarios, but based on the current logs, it does /apparently/ show an improvement.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg
Download full text (3.2 KiB)

Blake, that's great. Do you have before and after numbers showing the
improvement this change made?

Do you have any data or logs that led you to believe this was the
culprit in the slow responses I saw on my cluster?

On Tue, Feb 6, 2018 at 3:12 PM, Blake Rouse <email address hidden> wrote:
> Actually caching does make a difference. That method is not just caching
> the reading of a file, it caches the searching of the file based on the
> purpose, the reading of that file from disk (sure can be in kernel
> cache), the parsing of the template by tempita.
>
> All of that is redudant work that is being done on every single request.
> Searching the filesystem and reading the file from cache is all syscalls
> even if they come from the kernel cache. Since MAAS is async based that
> means that coroutine will be placed on hold while we wait for the result
> to be loaded from the kernel into the memory of the process. That gives
> other coroutines time to do other things, which means that coroutine
> doesn't get to execute until others are done or blocked by there own
> async request.
>
> Caching this information can greatly improve that by not requiring the
> coroutine to be pushed back into the eventloop while it is waiting for
> data from the kernel and without this change when the data comes back it
> still has to be processed by tempita which will take time and block the
> eventloop from completing other work.
>
> So its not simply that we should use the kernel to cache reads from the
> disk there is a lot more involved here. We have noticed improvements
> with this change on systems that are being ran with large number of VM's
> because of the reduction of IO.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
> New
> Status in grub2 package in Ubuntu:
> Fix Released
>
> Bug description:
> A node failed to deploy after it failed to retrieve a grub.cfg from
> MAAS due to a timeout. In the logs, it's clear that the server tried
> to retrieve the grub cfg many times, over about 30 seconds:
>
> http://paste.ubuntu.com/26387256/
>
> We see the same thing for other hosts around the same time:
>
> http://paste.ubuntu.com/26387262/
>
> It seems like MAAS is taking way too long to respond to these
> requests.
>
> This is very similar to bug 1724677, which was happening pre-
> metldown/spectre. The only difference is we don't see "[critical] TFTP
> back-end failed" in the logs anymore.
>
> I connected to the console on this system and it had errors about
> timing out retrieving the grub-cfg, then it had an error message along
> the lines of "error not an ip" and then "double free". After I
> connected but before I could get a screenshot the system rebooted and
> was directed by maas to power off, which it did successfully after
> booting to linux.
>
> Full logs are available here:
> https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
> ed277a020e7c/cpe_cloud_395/infra-logs.tar
>
> This is with 2.3.0-6434-gd354690-0ubunt...

Read more...

Revision history for this message
Blake Rouse (blake-rouse) wrote :
Download full text (5.1 KiB)

Andres did the testing of the changes and has logs to prove the improvement.

On Tue, Feb 6, 2018 at 4:43 PM, Jason Hobbs <email address hidden>
wrote:

> Blake, that's great. Do you have before and after numbers showing the
> improvement this change made?
>
> Do you have any data or logs that led you to believe this was the
> culprit in the slow responses I saw on my cluster?
>
> On Tue, Feb 6, 2018 at 3:12 PM, Blake Rouse <email address hidden>
> wrote:
> > Actually caching does make a difference. That method is not just caching
> > the reading of a file, it caches the searching of the file based on the
> > purpose, the reading of that file from disk (sure can be in kernel
> > cache), the parsing of the template by tempita.
> >
> > All of that is redudant work that is being done on every single request.
> > Searching the filesystem and reading the file from cache is all syscalls
> > even if they come from the kernel cache. Since MAAS is async based that
> > means that coroutine will be placed on hold while we wait for the result
> > to be loaded from the kernel into the memory of the process. That gives
> > other coroutines time to do other things, which means that coroutine
> > doesn't get to execute until others are done or blocked by there own
> > async request.
> >
> > Caching this information can greatly improve that by not requiring the
> > coroutine to be pushed back into the eventloop while it is waiting for
> > data from the kernel and without this change when the data comes back it
> > still has to be processed by tempita which will take time and block the
> > eventloop from completing other work.
> >
> > So its not simply that we should use the kernel to cache reads from the
> > disk there is a lot more involved here. We have noticed improvements
> > with this change on systems that are being ran with large number of VM's
> > because of the reduction of IO.
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1743249
> >
> > Title:
> > Failed Deployment after timeout trying to retrieve grub cfg
> >
> > Status in MAAS:
> > New
> > Status in grub2 package in Ubuntu:
> > Fix Released
> >
> > Bug description:
> > A node failed to deploy after it failed to retrieve a grub.cfg from
> > MAAS due to a timeout. In the logs, it's clear that the server tried
> > to retrieve the grub cfg many times, over about 30 seconds:
> >
> > http://paste.ubuntu.com/26387256/
> >
> > We see the same thing for other hosts around the same time:
> >
> > http://paste.ubuntu.com/26387262/
> >
> > It seems like MAAS is taking way too long to respond to these
> > requests.
> >
> > This is very similar to bug 1724677, which was happening pre-
> > metldown/spectre. The only difference is we don't see "[critical] TFTP
> > back-end failed" in the logs anymore.
> >
> > I connected to the console on this system and it had errors about
> > timing out retrieving the grub-cfg, then it had an error message along
> > the lines of "error not an ip" and then "double free". After I
> > connected but before I could get a screenshot the system r...

Read more...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: Failed Deployment after timeout trying to retrieve grub cfg

Andres, it was a single test in both cases, and in both cases there was almost no delay from MAAS. It's not significant enough to call it positive results.

Since neither of you answered yes, I'll assume the answer was no to my question of whether there was anything in my logs or data that showed reading the template from disk on the rack controller was the culprit, and that this fix just represents a guess at what might be causing the delay.

Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

I don't have logs anymore as I have since rebuilt my environment, but I can
confirm seeing improvements on a maas server running with high IO (note it
was a single region/rack).

see inlien:

On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs <email address hidden>
wrote:

> Andres, it was a single test in both cases, and in both cases there was
> almost no delay from MAAS. It's not significant enough to call it
> positive results.
>
>
Comment #93 shows there are /some/ improvements when comparing those two
samples only, but as I have already said, we need data over time to in both
scenarios to properly compare and determine whether the changes do make any
material performance improvements with the current conditions of the
samples (both samples are with a fixed io starvation on the environment).

> Since neither of you answered yes, I'll assume the answer was no to my
> question of whether there was anything in my logs or data that showed
> reading the template from disk on the rack controller was the culprit,
> and that this fix just represents a guess at what might be causing the
> delay.
>

To be fair, your logs do not provide anything concrete to determine what's
the culprit of the issue on the MAAS side. It provides a lot of clues, and
we have since then determine that those issues were a result of IO
starvation (from the VM's writing to disk). As such, the only way we can
*really* see if the patch brings any significant performance improvements
is to run tests in the environment were you were seeing the issues in the
first place.

As such, if you are willing to test if these make any material difference,
I would unfix your environment and do two runs (one without the fix, and
one with the fix). That's the only way we can really compare and be certain
in *your* environment.

>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=Fix Released; importance=Medium; <email address hidden>;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan janitor
> jason-hobbs mpontillo vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Download full text (5.1 KiB)

On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez
<email address hidden> wrote:
> I don't have logs anymore as I have since rebuilt my environment, but I can
> confirm seeing improvements on a maas server running with high IO (note it
> was a single region/rack).
>
> see inlien:
>
>
> On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs <email address hidden>
> wrote:
>
>> Andres, it was a single test in both cases, and in both cases there was
>> almost no delay from MAAS. It's not significant enough to call it
>> positive results.
>>
>>
> Comment #93 shows there are /some/ improvements when comparing those two
> samples only, but as I have already said, we need data over time to in both
> scenarios to properly compare and determine whether the changes do make any
> material performance improvements with the current conditions of the
> samples (both samples are with a fixed io starvation on the environment).
>
>
>> Since neither of you answered yes, I'll assume the answer was no to my
>> question of whether there was anything in my logs or data that showed
>> reading the template from disk on the rack controller was the culprit,
>> and that this fix just represents a guess at what might be causing the
>> delay.
>>
>
> To be fair, your logs do not provide anything concrete to determine what's
> the culprit of the issue on the MAAS side. It provides a lot of clues, and
> we have since then determine that those issues were a result of IO
> starvation (from the VM's writing to disk). As such, the only way we can
> *really* see if the patch brings any significant performance improvements
> is to run tests in the environment were you were seeing the issues in the
> first place.

I didn't think my logs provided anything concrete! That's because the
logging built into MAAS is not sufficient enough to do so.

I can't break that environment to test anymore - we got it working
thanks to you guy's help and it's a production environment that needs
to keep running other tests.

It might possible to recreate this on another maas server, using
'stress' or a similar tool to cause disk contention.

Jason

> As such, if you are willing to test if these make any material difference,
> I would unfix your environment and do two runs (one without the fix, and
> one with the fix). That's the only way we can really compare and be certain
> in *your* environment.
>
>>
>> --
>> You received this bug notification because you are subscribed to MAAS.
>> https://bugs.launchpad.net/bugs/1743249
>>
>> Title:
>> Failed Deployment after timeout trying to retrieve grub cfg
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>>
>> Launchpad-Notification-Type: bug
>> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
>> importance=Undecided; assignee=None;
>> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
>> status=Fix Released; importance=Medium; <email address hidden>;
>> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
>> Launchpad-Bug-Information-Type: Public
>> Launchpad-Bug-Private: no
>> Launchpad-Bug-Security-Vulnerability: no
>> Launchpad-Bug-Comment...

Read more...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Download full text (5.5 KiB)

dm-delay looks very interesting along those lines.

https://www.enodev.fr/posts/emulate-a-slow-block-device-with-dm-delay.html

https://www.kernel.org/doc/Documentation/device-mapper/delay.txt

On Tue, Feb 6, 2018 at 5:06 PM, Jason Hobbs <email address hidden> wrote:
> On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez
> <email address hidden> wrote:
>> I don't have logs anymore as I have since rebuilt my environment, but I can
>> confirm seeing improvements on a maas server running with high IO (note it
>> was a single region/rack).
>>
>> see inlien:
>>
>>
>> On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs <email address hidden>
>> wrote:
>>
>>> Andres, it was a single test in both cases, and in both cases there was
>>> almost no delay from MAAS. It's not significant enough to call it
>>> positive results.
>>>
>>>
>> Comment #93 shows there are /some/ improvements when comparing those two
>> samples only, but as I have already said, we need data over time to in both
>> scenarios to properly compare and determine whether the changes do make any
>> material performance improvements with the current conditions of the
>> samples (both samples are with a fixed io starvation on the environment).
>>
>>
>>> Since neither of you answered yes, I'll assume the answer was no to my
>>> question of whether there was anything in my logs or data that showed
>>> reading the template from disk on the rack controller was the culprit,
>>> and that this fix just represents a guess at what might be causing the
>>> delay.
>>>
>>
>> To be fair, your logs do not provide anything concrete to determine what's
>> the culprit of the issue on the MAAS side. It provides a lot of clues, and
>> we have since then determine that those issues were a result of IO
>> starvation (from the VM's writing to disk). As such, the only way we can
>> *really* see if the patch brings any significant performance improvements
>> is to run tests in the environment were you were seeing the issues in the
>> first place.
>
> I didn't think my logs provided anything concrete! That's because the
> logging built into MAAS is not sufficient enough to do so.
>
> I can't break that environment to test anymore - we got it working
> thanks to you guy's help and it's a production environment that needs
> to keep running other tests.
>
> It might possible to recreate this on another maas server, using
> 'stress' or a similar tool to cause disk contention.
>
> Jason
>
>> As such, if you are willing to test if these make any material difference,
>> I would unfix your environment and do two runs (one without the fix, and
>> one with the fix). That's the only way we can really compare and be certain
>> in *your* environment.
>>
>>>
>>> --
>>> You received this bug notification because you are subscribed to MAAS.
>>> https://bugs.launchpad.net/bugs/1743249
>>>
>>> Title:
>>> Failed Deployment after timeout trying to retrieve grub cfg
>>>
>>> To manage notifications about this bug go to:
>>> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>>>
>>> Launchpad-Notification-Type: bug
>>> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
>>> importance=Undecided; assignee=None;
>>> La...

Read more...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Download full text (6.0 KiB)

BTW to be clear here I'm saying I don't think the path forward on
improving this issue is thinking about how MAAS works and throwing out
patches that might improve performance here and there. The path
forward is to instrument MAAS on a system with slow i/o and to figure
out exactly where it's getting hung up.

Jason

On Tue, Feb 6, 2018 at 5:09 PM, Jason Hobbs <email address hidden> wrote:
> dm-delay looks very interesting along those lines.
>
> https://www.enodev.fr/posts/emulate-a-slow-block-device-with-dm-delay.html
>
> https://www.kernel.org/doc/Documentation/device-mapper/delay.txt
>
> On Tue, Feb 6, 2018 at 5:06 PM, Jason Hobbs <email address hidden> wrote:
>> On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez
>> <email address hidden> wrote:
>>> I don't have logs anymore as I have since rebuilt my environment, but I can
>>> confirm seeing improvements on a maas server running with high IO (note it
>>> was a single region/rack).
>>>
>>> see inlien:
>>>
>>>
>>> On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs <email address hidden>
>>> wrote:
>>>
>>>> Andres, it was a single test in both cases, and in both cases there was
>>>> almost no delay from MAAS. It's not significant enough to call it
>>>> positive results.
>>>>
>>>>
>>> Comment #93 shows there are /some/ improvements when comparing those two
>>> samples only, but as I have already said, we need data over time to in both
>>> scenarios to properly compare and determine whether the changes do make any
>>> material performance improvements with the current conditions of the
>>> samples (both samples are with a fixed io starvation on the environment).
>>>
>>>
>>>> Since neither of you answered yes, I'll assume the answer was no to my
>>>> question of whether there was anything in my logs or data that showed
>>>> reading the template from disk on the rack controller was the culprit,
>>>> and that this fix just represents a guess at what might be causing the
>>>> delay.
>>>>
>>>
>>> To be fair, your logs do not provide anything concrete to determine what's
>>> the culprit of the issue on the MAAS side. It provides a lot of clues, and
>>> we have since then determine that those issues were a result of IO
>>> starvation (from the VM's writing to disk). As such, the only way we can
>>> *really* see if the patch brings any significant performance improvements
>>> is to run tests in the environment were you were seeing the issues in the
>>> first place.
>>
>> I didn't think my logs provided anything concrete! That's because the
>> logging built into MAAS is not sufficient enough to do so.
>>
>> I can't break that environment to test anymore - we got it working
>> thanks to you guy's help and it's a production environment that needs
>> to keep running other tests.
>>
>> It might possible to recreate this on another maas server, using
>> 'stress' or a similar tool to cause disk contention.
>>
>> Jason
>>
>>> As such, if you are willing to test if these make any material difference,
>>> I would unfix your environment and do two runs (one without the fix, and
>>> one with the fix). That's the only way we can really compare and be certain
>>> in *your* environment.
>>>
>>>>
>>>> --
...

Read more...

Revision history for this message
Andres Rodriguez (andreserl) wrote :
Download full text (9.3 KiB)

That's what we have done to test the difference. So for the greater
audience, this patch was tested in a 4 core NUC with SSD, deploying 6 VM's
at the same time other 4 nodes are PXE booting from MAAS.

Before the fix we saw:
1. client would do multiple requests for the same file.
2. maas would run up to 3 DB requests for the node object to used to render
the config.
3. Inspected why we had 3 DB requests for the same config.

With this behavior, we determined that what happens is that the rack
queries the region, obtains the object, takes a while to generate the
config and return it to the client. But before it returns it to the client,
the client makes another request and that causes another db query. With
this, we confirmed that the collapsing works as expected, provided that
this collapsing happens between region/rack communication, but the rack had
already received and response and treats the new request as a new db query.

With the fix we aw:
1. client would do multiple requests for the same file
2. maas would always perform 1 DB requqest for the node object to render
the config.

With this, we were able to identify that the rack was taking too long to
answer the client, which caused that if a new request came it, it was
treated as a new request that was server by the region. With the changes
the rack responds faster, hence MAAS collapsed multiple requests, responded
in a timely fashion before it can actually be caused to make another
request to the db.

So the fix does improve things for sure, and we believe is one of the
reasons as to why this happened while there's IO starvation. That said, it
is not the only thing to improve, as there are other sections that need
improvement and as I had earlier said, those involve improving the DB as
well.

On Tue, Feb 6, 2018 at 6:10 PM, Jason Hobbs <email address hidden>
wrote:

> BTW to be clear here I'm saying I don't think the path forward on
> improving this issue is thinking about how MAAS works and throwing out
> patches that might improve performance here and there. The path
> forward is to instrument MAAS on a system with slow i/o and to figure
> out exactly where it's getting hung up.
>
> Jason
>
> On Tue, Feb 6, 2018 at 5:09 PM, Jason Hobbs <email address hidden>
> wrote:
> > dm-delay looks very interesting along those lines.
> >
> > https://www.enodev.fr/posts/emulate-a-slow-block-device-
> with-dm-delay.html
> >
> > https://www.kernel.org/doc/Documentation/device-mapper/delay.txt
> >
> > On Tue, Feb 6, 2018 at 5:06 PM, Jason Hobbs <email address hidden>
> wrote:
> >> On Tue, Feb 6, 2018 at 4:50 PM, Andres Rodriguez
> >> <email address hidden> wrote:
> >>> I don't have logs anymore as I have since rebuilt my environment, but
> I can
> >>> confirm seeing improvements on a maas server running with high IO
> (note it
> >>> was a single region/rack).
> >>>
> >>> see inlien:
> >>>
> >>>
> >>> On Tue, Feb 6, 2018 at 5:17 PM, Jason Hobbs <<email address hidden>
> >
> >>> wrote:
> >>>
> >>>> Andres, it was a single test in both cases, and in both cases there
> was
> >>>> almost no delay from MAAS. It's not significant enough to call it
> >>>> positive results.
> >>>>
> >>>...

Read more...

Revision history for this message
Andres Rodriguez (andreserl) wrote :

and fwiw, I'm not saying this is *the* solution for a problem like this one
where there is IO starvation. But it is definitely a step forward.

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer

summary: - Failed Deployment after timeout trying to retrieve grub cfg
+ High IO on the system can cause MAAS to not provide PXE config in a
+ timely manner (less than 30 secs)
Changed in maas:
status: New → Triaged
tags: added: performance
summary: - High IO on the system can cause MAAS to not provide PXE config in a
- timely manner (less than 30 secs)
+ High IO on the system prevents MAAS to provide PXE config in a timely
+ manner (less than 30 secs)
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Andres, there is more to this bug than just the slow grub response. MAAS could do more after the fact to recover in this situation. For example, once a machine boots to the ephemeral environment, it has plenty of time to talk to MAAS and find out it's not actually supposed to be enlisting, and to do something other than just power off. I will file another bug for that.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Ok - I filed a few more bugs to cover the aspects of this failure other than the slow response for grub.cfg.

bug 1747927 - grub should not hang waiting for user input when booting from MAAS.

bug 1747928 - When a known server in Deploying state boots to the enlisting environment, it should not just power off.

bug 1747930 - MAAS should recognize servers that are Powered Off in Deploying state, and do something about it.

Revision history for this message
Christian Reis (kiko) wrote :

And summarizing the changes we've put in place here.

1. We started doing our juju bootstrap --to a VM on a node which isn't the PostgreSQL master. This didn't stop the problem from occurring frequently.

2. We have a bcache on /var in write-back mode. This seems to have made the issue stop occurring.

Did the lru change Blake described get committed?

Changed in maas:
status: Triaged → Fix Committed
Changed in maas:
milestone: 2.4.x → 2.4.0beta1
Changed in maas:
status: Fix Committed → Fix Released
Displaying first 40 and last 40 comments. View all 105 comments or add a comment.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.