Large dhcp leases file leads to tftp timeouts

Bug #1366212 reported by David Britton
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Gavin Panella

Bug Description

I run a maas cluster that does a lot of work. Eventually the leases file got to about 3M. At this point I installed 1.7 for testing purposes, and noticed almost every boot would timeout in tftp. Finally I discovered that truncating the leases file fixed the issue.

I have more debugging if you need it, but I suspect you probably know about this issue already and am just filing this defect for tracking.

Related branches

David Britton (dpb)
tags: added: cloud-installer
Revision history for this message
Julian Edwards (julian-edwards) wrote :

I think we can blame the GIL. The leases parser is running for so long in its own thread that it's blocking I/O in the reactor thread. We need a way to defer leases parsing to a separate process I think.

Changed in maas:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Gavin Panella (allenap) wrote :

David, if you find your MAAS suffering from this again, please can you attach a copy of the leases file here?

Revision history for this message
David Britton (dpb) wrote :

I saved the too-big leases file. We only have 10 machines, but we use a lot of lxcs, and cycle tests a lot.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

By the way, there was a known bug in the dhcpd packaging that stopped it from cleaning up its leases file. There's a workaround in 1.7, so the leases file should truncate itself and parsing should speed up significantly.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

The dhcpd problem was with apparmor config, so making apparmor reload its configuration and then restarting the DHCP server ought to cure this situation. (Which is a way of saying “have you tried turning it off and on again”).

Revision history for this message
Julian Edwards (julian-edwards) wrote : Re: [Bug 1366212] Re: Large dhcp leases file leads to tftp timeouts

On Monday 08 September 2014 03:24:47 you wrote:
> By the way, there was a known bug in the dhcpd packaging that stopped it
> from cleaning up its leases file. There's a workaround in 1.7, so the
> leases file should truncate itself and parsing should speed up
> significantly.

The leases are still only truncated every hour, so with a rapidly recycled set
of nodes it will still get large.

Gavin Panella (allenap)
Changed in maas:
assignee: nobody → Gavin Panella (allenap)
status: Triaged → In Progress
Revision history for this message
David Britton (dpb) wrote :

On Mon, Sep 08, 2014 at 05:07:25AM -0000, Julian Edwards wrote:
> The leases are still only truncated every hour, so with a rapidly recycled set
> of nodes it will still get large.

One thing that may have been different: The leases file was large
already before I upgraded to 1.7? In any case, there were multiple
hours that passed between when I installed 1.7 and when I had a working
system (pretty much all day on Friday). Many rounds of booting,
installing on other machines, etc.

@Jeroen: I actually did try rebooting the system once to clear any sort
of network error, since the symptoms appeared very similar to

https://bugs.launchpad.net/ubuntu/+source/maas/+bug/1246236

And that mentioned that reboots tended to fix the underlying networking
error. But, once I figured out the root cause was the dhcp file, I
disregarded that.

--
David Britton <email address hidden>

Revision history for this message
Gavin Panella (allenap) wrote :

Fwiw, I tried parsing that leases file with MAAS's lease parser, and it took nearly 2 minutes on my laptop. The parser is, unfortunately, very very slow. In the medium term we need to come up with a different solution for lease parsing. In the short term I'm breaking the parsing out of the pserv process into its own process so that it doesn't hog the GIL.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I see a lot of branches merged, but the bug is still "in progress". Is there something else missing?

Revision history for this message
Gavin Panella (allenap) wrote :

Sorry Andreas, I forgot to update the status.

Changed in maas:
status: In Progress → Fix Committed
milestone: none → 1.7.0
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.