Comment 69 for bug 1743249

Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

@Jason,

The pcap exactly shows the behavior I was hoping to see, which is grub
tries to get X config first, and since it didn't get a response, it moves
on and tries to get Y config.

On Mon, Feb 5, 2018 at 4:45 PM, Jason Hobbs <email address hidden>
wrote:

> On Mon, Feb 5, 2018 at 3:27 PM, Andres Rodriguez
> <email address hidden> wrote:
> > @Steve,
> >
> > MAAS already has a mechanism to collapse retries into the initial
> request.
> > In this case, it is the rack that grabs the requests and makes a request
> to
> > the region. If retries come within the time that the rack is waiting for
> a
> > response from the region, these request get "ignored" and the Rack will
> > only answer the first request. This is what the logs show after testing
> > with fixed grub, where grub makes multiple requests and MAAS answers
> > seconds after does requests, but only answers once. This is because the
> > requests were collapsed on the maas side.
> >
> > If, however, the retries come in after the region has answered the rack,
> > they these requests will be served.
>
> This is not true. MAAS is responding to every single request grub
> makes for the file - the tcpdump logs show it. And these are not
> "read 4 times" requests - they are retries because grub didn't get a
> response.
>
> This pcap shows MAAS responding to every request for grub.cfg-<mac>:
> https://bugs.launchpad.net/maas/+bug/1743249/+attachment/
> 5046952/+files/spearow-fall-back-to-default-amd64.pcap
>
> Jason
>
> >
> > On Mon, Feb 5, 2018 at 2:34 PM, Steve Langasek <
> <email address hidden>
> >> wrote:
> >
> >> Jason's feedback was that, after making the changes to the storage
> >> configuration of his environment, deploying the test grubx64.efi doesn't
> >> have any effect on the MAAS server's response time to tftp requests. So
> >> at this point it's not at all clear that the grub change, while correct,
> >> helps with this high-level symptom.
> >>
> >> It has also been suggested that each udp retry is generating a separate
> >> database query from MAAS. That is absolutely a MAAS bug if true, and
> >> not something that can or should be fixed in GRUB.
> >>
> >> ** Changed in: grub2 (Ubuntu)
> >> Importance: Critical => Medium
> >>
> >> --
> >> You received this bug notification because you are subscribed to MAAS.
> >> https://bugs.launchpad.net/bugs/1743249
> >>
> >> Title:
> >> Failed Deployment after timeout trying to retrieve grub cfg
> >>
> >> To manage notifications about this bug go to:
> >> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
> >>
> >> Launchpad-Notification-Type: bug
> >> Launchpad-Bug: product=maas; milestone=2.4.x; status=Incomplete;
> >> importance=Undecided; assignee=None;
> >> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> >> status=In Progress; importance=Medium; <email address hidden>;
> >> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> >> Launchpad-Bug-Information-Type: Public
> >> Launchpad-Bug-Private: no
> >> Launchpad-Bug-Security-Vulnerability: no
> >> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs
> vorlon
> >> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> >> Launchpad-Bug-Modifier: Steve Langasek (vorlon)
> >> Launchpad-Message-Rationale: Subscriber (MAAS)
> >> Launchpad-Message-For: andreserl
> >>
> >
> >
> > --
> > Andres Rodriguez (RoAkSoAx)
> > Ubuntu Server Developer
> > MSc. Telecom & Networking
> > Systems Engineer
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1743249
> >
> > Title:
> > Failed Deployment after timeout trying to retrieve grub cfg
> >
> > Status in MAAS:
> > New
> > Status in grub2 package in Ubuntu:
> > In Progress
> >
> > Bug description:
> > A node failed to deploy after it failed to retrieve a grub.cfg from
> > MAAS due to a timeout. In the logs, it's clear that the server tried
> > to retrieve the grub cfg many times, over about 30 seconds:
> >
> > http://paste.ubuntu.com/26387256/
> >
> > We see the same thing for other hosts around the same time:
> >
> > http://paste.ubuntu.com/26387262/
> >
> > It seems like MAAS is taking way too long to respond to these
> > requests.
> >
> > This is very similar to bug 1724677, which was happening pre-
> > metldown/spectre. The only difference is we don't see "[critical] TFTP
> > back-end failed" in the logs anymore.
> >
> > I connected to the console on this system and it had errors about
> > timing out retrieving the grub-cfg, then it had an error message along
> > the lines of "error not an ip" and then "double free". After I
> > connected but before I could get a screenshot the system rebooted and
> > was directed by maas to power off, which it did successfully after
> > booting to linux.
> >
> > Full logs are available here:
> > https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
> > ed277a020e7c/cpe_cloud_395/infra-logs.tar
> >
> > This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> Launchpad-Notification-Type: bug
> Launchpad-Bug: product=maas; milestone=2.4.x; status=New;
> importance=Undecided; assignee=None;
> Launchpad-Bug: distribution=ubuntu; sourcepackage=grub2; component=main;
> status=In Progress; importance=Medium; <email address hidden>;
> Launchpad-Bug-Tags: cdo-qa cdo-qa-blocker foundations-engine patch
> Launchpad-Bug-Information-Type: Public
> Launchpad-Bug-Private: no
> Launchpad-Bug-Security-Vulnerability: no
> Launchpad-Bug-Commenters: andreserl blake-rouse cgregan jason-hobbs vorlon
> Launchpad-Bug-Reporter: Jason Hobbs (jason-hobbs)
> Launchpad-Bug-Modifier: Jason Hobbs (jason-hobbs)
> Launchpad-Message-Rationale: Subscriber (MAAS)
> Launchpad-Message-For: andreserl
>

--
Andres Rodriguez (RoAkSoAx)
Ubuntu Server Developer
MSc. Telecom & Networking
Systems Engineer