MAAS

Bug #1743249
Comment #94

Comment 94 for bug 1743249

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2018-02-06: Re: [Bug 1743249] Re: Failed Deployment after timeout trying to retrieve grub cfg

#94

Andres did the testing of the changes and has logs to prove the improvement.

On Tue, Feb 6, 2018 at 4:43 PM, Jason Hobbs <email address hidden>
wrote:

> Blake, that's great. Do you have before and after numbers showing the
> improvement this change made?
>
> Do you have any data or logs that led you to believe this was the
> culprit in the slow responses I saw on my cluster?
>
> On Tue, Feb 6, 2018 at 3:12 PM, Blake Rouse <email address hidden>
> wrote:
> > Actually caching does make a difference. That method is not just caching
> > the reading of a file, it caches the searching of the file based on the
> > purpose, the reading of that file from disk (sure can be in kernel
> > cache), the parsing of the template by tempita.
> >
> > All of that is redudant work that is being done on every single request.
> > Searching the filesystem and reading the file from cache is all syscalls
> > even if they come from the kernel cache. Since MAAS is async based that
> > means that coroutine will be placed on hold while we wait for the result
> > to be loaded from the kernel into the memory of the process. That gives
> > other coroutines time to do other things, which means that coroutine
> > doesn't get to execute until others are done or blocked by there own
> > async request.
> >
> > Caching this information can greatly improve that by not requiring the
> > coroutine to be pushed back into the eventloop while it is waiting for
> > data from the kernel and without this change when the data comes back it
> > still has to be processed by tempita which will take time and block the
> > eventloop from completing other work.
> >
> > So its not simply that we should use the kernel to cache reads from the
> > disk there is a lot more involved here. We have noticed improvements
> > with this change on systems that are being ran with large number of VM's
> > because of the reduction of IO.
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1743249
> >
> > Title:
> > Failed Deployment after timeout trying to retrieve grub cfg
> >
> > Status in MAAS:
> > New
> > Status in grub2 package in Ubuntu:
> > Fix Released
> >
> > Bug description:
> > A node failed to deploy after it failed to retrieve a grub.cfg from
> > MAAS due to a timeout. In the logs, it's clear that the server tried
> > to retrieve the grub cfg many times, over about 30 seconds:
> >
> > http://paste.ubuntu.com/26387256/
> >
> > We see the same thing for other hosts around the same time:
> >
> > http://paste.ubuntu.com/26387262/
> >
> > It seems like MAAS is taking way too long to respond to these
> > requests.
> >
> > This is very similar to bug 1724677, which was happening pre-
> > metldown/spectre. The only difference is we don't see "[critical] TFTP
> > back-end failed" in the logs anymore.
> >
> > I connected to the console on this system and it had errors about
> > timing out retrieving the grub-cfg, then it had an error message along
> > the lines of "error not an ip" and then "double free". After I
> > connected but before I could get a screenshot the system rebooted and
> > was directed by maas to power off, which it did successfully after
> > booting to linux.
> >
> > Full logs are available here:
> > https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
> > ed277a020e7c/cpe_cloud_395/infra-logs.tar
> >
> > This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
> Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
> New
> Status in grub2 package in Ubuntu:
> Fix Released
>
> Bug description:
> A node failed to deploy after it failed to retrieve a grub.cfg from
> MAAS due to a timeout. In the logs, it's clear that the server tried
> to retrieve the grub cfg many times, over about 30 seconds:
>
> http://paste.ubuntu.com/26387256/
>
> We see the same thing for other hosts around the same time:
>
> http://paste.ubuntu.com/26387262/
>
> It seems like MAAS is taking way too long to respond to these
> requests.
>
> This is very similar to bug 1724677, which was happening pre-
> metldown/spectre. The only difference is we don't see "[critical] TFTP
> back-end failed" in the logs anymore.
>
> I connected to the console on this system and it had errors about
> timing out retrieving the grub-cfg, then it had an error message along
> the lines of "error not an ip" and then "double free". After I
> connected but before I could get a screenshot the system rebooted and
> was directed by maas to power off, which it did successfully after
> booting to linux.
>
> Full logs are available here:
> https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
> ed277a020e7c/cpe_cloud_395/infra-logs.tar
>
> This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>

Andres did the testing of the changes and has logs to prove the improvement.

On Tue, Feb 6, 2018 at 4:43 PM, Jason Hobbs <jason.hobbs@canonical.com>
wrote:

> Blake, that's great.  Do you have before and after numbers showing the
> improvement this change made?
>
> Do you have any data or logs that led you to believe this was the
> culprit in the slow responses I saw on my cluster?
>
> On Tue, Feb 6, 2018 at 3:12 PM, Blake Rouse <blake.rouse@canonical.com>
> wrote:
> > Actually caching does make a difference. That method is not just caching
> > the reading of a file, it caches the searching of the file based on the
> > purpose, the reading of that file from disk (sure can be in kernel
> > cache), the parsing of the template by tempita.
> >
> > All of that is redudant work that is being done on every single request.
> > Searching the filesystem and reading the file from cache is all syscalls
> > even if they come from the kernel cache. Since MAAS is async based that
> > means that coroutine will be placed on hold while we wait for the result
> > to be loaded from the kernel into the memory of the process. That gives
> > other coroutines time to do other things, which means that coroutine
> > doesn't get to execute until others are done or blocked by there own
> > async request.
> >
> > Caching this information can greatly improve that by not requiring the
> > coroutine to be pushed back into the eventloop while it is waiting for
> > data from the kernel and without this change when the data comes back it
> > still has to be processed by tempita which will take time and block the
> > eventloop from completing other work.
> >
> > So its not simply that we should use the kernel to cache reads from the
> > disk there is a lot more involved here. We have noticed improvements
> > with this change on systems that are being ran with large number of VM's
> > because of the reduction of IO.
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1743249
> >
> > Title:
> >   Failed Deployment after timeout trying to retrieve grub cfg
> >
> > Status in MAAS:
> >   New
> > Status in grub2 package in Ubuntu:
> >   Fix Released
> >
> > Bug description:
> >   A node failed to deploy after it failed to retrieve a grub.cfg from
> >   MAAS due to a timeout.  In the logs, it's clear that the server tried
> >   to retrieve the grub cfg many times, over about 30 seconds:
> >
> >   http://paste.ubuntu.com/26387256/
> >
> >   We see the same thing for other hosts around the same time:
> >
> >   http://paste.ubuntu.com/26387262/
> >
> >   It seems like MAAS is taking way too long to respond to these
> >   requests.
> >
> >   This is very similar to bug 1724677, which was happening pre-
> >   metldown/spectre. The only difference is we don't see "[critical] TFTP
> >   back-end failed" in the logs anymore.
> >
> >   I connected to the console on this system and it had errors about
> >   timing out retrieving the grub-cfg, then it had an error message along
> >   the lines of "error not an ip" and then "double free".  After I
> >   connected but before I could get a screenshot the system rebooted and
> >   was directed by maas to power off, which it did successfully after
> >   booting to linux.
> >
> >   Full logs are available here:
> >   https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
> >   ed277a020e7c/cpe_cloud_395/infra-logs.tar
> >
> >   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>
> --
> You received this bug notification because you are subscribed to MAAS.
> https://bugs.launchpad.net/bugs/1743249
>
> Title:
>   Failed Deployment after timeout trying to retrieve grub cfg
>
> Status in MAAS:
>   New
> Status in grub2 package in Ubuntu:
>   Fix Released
>
> Bug description:
>   A node failed to deploy after it failed to retrieve a grub.cfg from
>   MAAS due to a timeout.  In the logs, it's clear that the server tried
>   to retrieve the grub cfg many times, over about 30 seconds:
>
>   http://paste.ubuntu.com/26387256/
>
>   We see the same thing for other hosts around the same time:
>
>   http://paste.ubuntu.com/26387262/
>
>   It seems like MAAS is taking way too long to respond to these
>   requests.
>
>   This is very similar to bug 1724677, which was happening pre-
>   metldown/spectre. The only difference is we don't see "[critical] TFTP
>   back-end failed" in the logs anymore.
>
>   I connected to the console on this system and it had errors about
>   timing out retrieving the grub-cfg, then it had an error message along
>   the lines of "error not an ip" and then "double free".  After I
>   connected but before I could get a screenshot the system rebooted and
>   was directed by maas to power off, which it did successfully after
>   booting to linux.
>
>   Full logs are available here:
>   https://10.245.162.101/artifacts/14a34b5a-9321-4d1a-b2fa-
>   ed277a020e7c/cpe_cloud_395/infra-logs.tar
>
>   This is with 2.3.0-6434-gd354690-0ubuntu1~16.04.1.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1743249/+subscriptions
>