MAAS

Bug #1743249
Comment #37

Comment 37 for bug 1743249

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2018-02-01: Re: Failed Deployment after timeout trying to retrieve grub cfg

#37

@Jason,

Packet 90573 doesn't seem to me as an indication of what you are describing. What I see is this:

1. grub makes ~30 requests for PXE config on grub.cfg-<mac>, after which it gives up because it didn't receive a response.
2. grub moves on and requests grub.cfg-default-amd64, and it receives a response from MAAS.

Now, the difference between the above, is that 1 does *database* lookups, while 2 does not. In other words, 1 causes a request to obtain the 'node' object based on the MAC to provide, and if grub is making 30+ requests, then this can definitely flood the db with requests.

That said, based on my understanding of how your environment is configured, you have other 3 VM's in the system PXE booting from MAAS + other machines at the same time, where each VM has assigned to itself 8 CPU's on a system that has 20 CPU's (that means that the VM's alone, in other words, you are over committing CPU), combined with other machines PXE booting off MAAS at the same time, plus the performance implications of the recent kernel, then it does seem to me that all of the other things could be impacting maas in contending resources, when we already know postgresql is running in degraded performance due to the newer kernels.

That said, did you disable spectre features and rebooted your machine?
Did you test this by NOT running VM's in the same system as MAAS or at least, reducing the number of cores each VM access to (since there's 3 VM's, with 8 cores each, that means 24 cores on a 20 core system).

Also, do you have any CPU load at the time of failure?