MAAS

[2.0] TFTP issues when rackd — Missing connections to 1 region controller(s).

Bug #1626654 reported by Larry Michel on 2016-09-22

This bug affects 9 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	Critical	Blake Rouse	MAAS 2.2.0

Bug Description

We started seeing our EFI systems failing to deploy. It looks to be an issue with the TFTP exchange with the maas server. the errors started happening around 1800 UTC on 9/21 and appears to impact EFI systems only.

Per juju status, this was the error:

machines:
  "0":
    juju-status:
      current: pending
      since: 22 Sep 2016 13:03:55Z
    dns-name: 10.245.0.153
    instance-id: 4y3xat
    machine-status:
      message: 'Failed deployment: bootx64.efi'
      since: 22 Sep 2016 13:45:27Z
    series: trusty
    containers:
      0/lxd/0:
        juju-status:
          current: pending
          since: 22 Sep 2016 13:04:11Z
        instance-id: pending
        machine-status:
          current: pending
          since: 22 Sep 2016 13:04:11Z
        series: trusty
      0/lxd/1:
        juju-status:
          current: pending
          since: 22 Sep 2016 13:04:13Z
        instance-id: pending
        machine-status:
          current: pending
          since: 22 Sep 2016 13:04:13Z
        series: trusty
      0/lxd/2:
        juju-status:
          current: pending
          since: 22 Sep 2016 13:04:15Z
        instance-id: pending
        machine-status:
          current: pending
          since: 22 Sep 2016 13:04:15Z
        series: trusty

In the rackd.log, only seeing:
2016-09-22 16:36:44 [TFTP (UDP)] Datagram received from ('10.245.0.226', 1705): <RRQDatagram(filename=b'bootx64.efi', mode=b'octet', options=OrderedDict([(b'tsize', b'0'), (b'blksize', b'1468')]))>

and monitoring the console of the booting server:
..PXE-E23: Client received TFTP error from server.

On the rackd page, it shows:

Region Controller

   regiond
   bind9
   proxy
Rack Controller

   rackd — Missing connections to 1 region controller(s).
   http
   tftp
   dhcpd
   dhcpd6
   tgt

ubuntu@maas2-production:~$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-==============================-============-=================================================
ii maas 2.0.0+bzr5189-0ubuntu1~16.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cli 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.0.0+bzr5189-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.0.0+bzr5189-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.0.0+bzr5189-0ubuntu1~16.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

Tags:

Related branches

lp:~blake-rouse/maas/rpc-client-now

Merged into lp:~maas-committers/maas/trunk at revision 5725

Mike Pontillo (community): Approve on 2017-02-14

lp:~blake-rouse/maas/rpc-client-now-2.1

Merged into lp:maas/2.1 at revision 5585

Blake Rouse (community): Approve on 2017-02-14

Superseded for merging into lp:~maas-committers/maas/trunk

MAAS Maintainers: Pending requested 2017-02-14

Diff: 3861 lines (+2983/-3) (has conflicts)

80 files modified

docs/changelog.rst (+148/-0)
src/maasserver/api/spaces.py (+7/-0)
src/maasserver/api/subnets.py (+49/-0)
src/maasserver/forms/tests/test_interface_link.py (+6/-1)
src/maasserver/models/staticipaddress.py (+160/-0)
src/maasserver/models/tests/test_staticipaddress.py (+4/-0)
src/maasserver/static/assets/images/icons/account.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/add-logical-volume.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/add-partition.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/add.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/cross-orange.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/cross.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/cross_orange.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/debug.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/delete.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/edit.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/green-tick.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/help.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/info-mono.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/info.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/logical-volume.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/magnifying_glass.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/mount.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/partition.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/power-error.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/power-off.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/power-on.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/remove.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/settings.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/success-mono.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/success.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/success_colour_black.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/success_colour_white.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/success_mono.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/sync.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/system-shutdown.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/tags.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/tick.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/tooltip.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/unmount.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/warning-mono.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/warning.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/warning_colour_black.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/warning_colour_white.svg.OTHER (+1/-0)
src/maasserver/static/assets/images/icons/warning_mono.svg.OTHER (+1/-0)
src/maasserver/static/css/build.css (+1/-0)
src/maasserver/static/js/angular/maas.js (+5/-0)
src/maasserver/static/partials/dashboard.html (+207/-2)
src/maasserver/static/partials/domain-details.html (+4/-0)
src/maasserver/static/partials/domains-list.html (+4/-0)
src/maasserver/static/partials/error.html (+4/-0)
src/maasserver/static/partials/fabric-details.html (+8/-0)
src/maasserver/static/partials/intro-user.html (+33/-0)
src/maasserver/static/partials/intro.html (+71/-0)
src/maasserver/static/partials/networks-list.html (+4/-0)
src/maasserver/static/partials/node-details.html (+8/-0)
src/maasserver/static/partials/node-events.html (+4/-0)
src/maasserver/static/partials/node-result.html (+8/-0)
src/maasserver/static/partials/nodes-list.html (+4/-0)
src/maasserver/static/partials/settings.html (+4/-0)
src/maasserver/static/partials/space-details.html (+8/-0)
src/maasserver/static/partials/ssh-keys.html (+11/-0)
src/maasserver/static/partials/subnet-details.html (+12/-0)
src/maasserver/static/partials/vlan-details.html (+10/-0)
src/maasserver/static/scss/_base.forms.scss.OTHER (+377/-0)
src/maasserver/static/scss/_components.buttons.scss.OTHER (+149/-0)
src/maasserver/static/scss/_components.flashmessages.scss.OTHER (+59/-0)
src/maasserver/static/scss/_components.forms.scss.OTHER (+201/-0)
src/maasserver/static/scss/_components.icons.scss.OTHER (+274/-0)
src/maasserver/static/scss/_components.tables.scss.OTHER (+668/-0)
src/maasserver/static/scss/_helpers.display.scss.OTHER (+101/-0)
src/maasserver/static/scss/_helpers.spacing.scss.OTHER (+48/-0)
src/maasserver/static/scss/_patterns.page-header.scss.OTHER (+177/-0)
src/maasserver/templates/maasserver/base.html (+4/-0)
src/maasserver/tests/test_preseed.py (+17/-0)
src/metadataserver/api.py (+30/-0)
src/provisioningserver/boot/uefi_amd64.py (+4/-0)
src/provisioningserver/import_images/boot_resources.py (+15/-0)
src/provisioningserver/import_images/tests/test_boot_resources.py (+28/-0)
versions.cfg (+8/-0)

lp:~blake-rouse/maas/reap-old-regiond-workers

Merged into lp:~maas-committers/maas/trunk at revision 5728

Andres Rodriguez (community): Approve on 2017-02-14

Lee Trager (community): Approve on 2017-02-14

lp:~blake-rouse/maas/reap-old-regiond-workers-2.1

Merged into lp:maas/2.1 at revision 5587

Blake Rouse (community): Approve on 2017-02-16

Revision history for this message

Larry Michel (lmic) wrote on 2016-09-22:

logs.tar.gz Edit (5.7 MiB, application/x-tar)

Andres Rodriguez (andreserl) on 2016-09-22

summary:

- [2.0] PXE EFI boot failures: message: 'Failed deployment: bootx64.efi' -
- PXE-E23 Client received TFTP error from server
+ [2.0] 'Failed deployment: bootx64.efi' - rackd — Missing connections to
+ 1 region controller(s).

Andres Rodriguez (andreserl) on 2016-09-22

summary:	- [2.0] 'Failed deployment: bootx64.efi' - rackd — Missing connections to - 1 region controller(s). + [2.0] TFTP issues when rackd — Missing connections to 1 region + controller(s).
Changed in maas:
importance:	Undecided → Critical
status:	New → Confirmed

Gavin Panella (allenap) on 2016-09-26

Changed in maas:
status:	Confirmed → Triaged

Revision history for this message

Larry Michel (lmic) wrote on 2016-10-05:

logs.tar.gz Edit (3.6 MiB, application/x-tar)

This actually turns out to be re-creatable. It happens when doing an image upload.

I recreated it with:

maas root boot-resources create name=windows/win2012hvr2 title="Windows \"Hyper-V 2012 R2\"" architecture=amd64/generic filetype=ddtgz content@=/home/ubuntu/images/win2012hvr2-dd-img-root-dd --debug

In retrospect, original issue happened around the time that I was loading the previous 1.6Gig image. This image is larger 2.9Gig and I saw unresponsiveness immediately after starting the upload and seeing the multiple disconnections and attempts to reconnect. Attaching the latest logs..

For 2.1, I uploaded the same issue and did not observe this.

Revision history for this message

Larry Michel (lmic) wrote on 2016-10-06:

Screenshot from 2016-10-06 08-25-30.png Edit (52.8 KiB, image/png)

Andres, I recreated with 2.1 this morning. Yesterday, it went without a hitch but this morning, the things became sluggish as the upload image was completing or shortly after; I couldn't open the rackd.log for a short while nor could I load any page on the webui. I did see a message "rack controller importing ... " flash on the screen before things from bad to worse. My guess is that it may have been due to part of the upload image overlapping a bit with the import of daily images based on the completion timestamp of the ladder. Once things settled down, I then saw the error in the attached screenshot. All of this would have happened around the ~1320UTC and ~1328UTC time frame. I'll attach the logs shortly.

Revision history for this message

Larry Michel (lmic) wrote on 2016-10-06:

logs.tar.gz Edit (1.9 MiB, application/x-tar)

logs before restarting maas-rackd attached.

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2016-10-06:

The logs don't provide much help at the moment. It does look like the rack is trying to sync often from the region, so the rack believes that the region has information that it does not. Your maas.log is also unhelpful as it full of messages about an IP address moving between two MAC addresses. I recommend checking your network to make sure that same IP address is not assign to two interfaces.

Revision history for this message

mahmoh (mahmoh) wrote on 2016-10-10:

Also hitting this on other arches with a previously working server on MAAS 2.0 beta:

2016-10-10 10:05:44 [TFTP (UDP)] Datagram received from ('172.16.3.193', 1583): <RRQDatagram(filename=b'grubaa64.efi', mode=b'octet', options=OrderedDict([(b'tsize', b'0'), (b'blksize', b'1468')]))>

Revision history for this message

mahmoh (mahmoh) wrote on 2016-10-10:

^Spoke too soon, the image was not downloaded on the correct maas server, it worked after download: 2.0.0+bzr5189-0ubuntu1~16.04.1

Andres Rodriguez (andreserl) on 2016-10-13

Changed in maas:
milestone:	2.1.0 → 2.1.1

Revision history for this message

Larry Michel (lmic) wrote on 2016-10-27:

Recreating with 2.1. I will attach the logs shortly.

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2016-10-27:

2016-10-27 19:03:53 [-] Unhandled Error
Traceback (most recent call last):
   File "/usr/lib/python3/dist-packages/twisted/application/app.py", line 364, in startReactor
     self.config, oldstdout, oldstderr, self.profiler, reactor)
   File "/usr/lib/python3/dist-packages/twisted/application/app.py", line 285, in runReactorWithLogging
     reactor.run()
   File "/usr/lib/python3/dist-packages/twisted/internet/base.py", line 1194, in run
     self.mainLoop()
   File "/usr/lib/python3/dist-packages/twisted/internet/base.py", line 1203, in mainLoop
     self.runUntilCurrent()
--- <exception caught here> ---
   File "/usr/lib/python3/dist-packages/twisted/internet/base.py", line 798, in runUntilCurrent
     f(*a, **kw)
   File "/usr/lib/python3/dist-packages/twisted/web/wsgi.py", line 512, in wsgiError
     self.request.finish()
   File "/usr/lib/python3/dist-packages/twisted/web/server.py", line 224, in finish
     return http.Request.finish(self)
   File "/usr/lib/python3/dist-packages/twisted/web/http.py", line 865, in finish
     "Request.finish called on a request after its connection was lost; "
builtins.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.

2016-10-27 19:03:53 [-] WSGI application error
	Traceback (most recent call last):
	  File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 118, in callWithContext
	    return self.currentContext().callWithContext(ctx, func, *args, **kw)
	  File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 81, in callWithContext
	    return func(*args,**kw)
	  File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 857, in callInContext
	    return func(*args, **kwargs)
	  File "/usr/lib/python3/dist-packages/twisted/web/wsgi.py", line 513, in run
	    self.reactor.callFromThread(wsgiError, self.started, *exc_info())
	--- <exception caught here> ---
	  File "/usr/lib/python3/dist-packages/twisted/web/wsgi.py", line 496, in run
	    appIterator = self.application(self.environ, self.startResponse)
	  File "/usr/lib/python3/dist-packages/django/core/handlers/wsgi.py", line 189, in __call__
	    response = self.get_response(request)
	  File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 240, in get_response
	    request = reset_request(request)
	  File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 90, in reset_request
	    wsgi_input._wrapped.seek(0)
	builtins.ValueError: I/O operation on closed file.
	
2016-10-27 19:03:53 [-] Unhandled Error
	Traceback (most recent call last):
	  File "/usr/lib/python3/dist-packages/twisted/application/app.py", line 364, in startReactor
	    self.config, oldstdout, oldstderr, self.profiler, reactor)
	  File "/usr/lib/python3/dist-packages/twisted/application/app.py", line 285, in runReactorWithLogging
	    reactor.run()
	  File "/usr/lib/python3/dist-packages/twisted/internet/base.py", line 1194, in run
	    self.mainLoop()
	  File "/usr/lib/python3/dist-packages/twisted/internet/base.py", line 1203, in mainLoop
	    self.runUntilCurrent()
	--- <exception caught here> ---
	  File "/usr/lib/python3/dist-packages/twisted/internet/base.py", line 798, in runUntilCurrent
	    f(*a, **kw)
	  File "/usr/lib/python3/dist-packages/twisted/web/wsgi.py", line 512, in wsgiError
	    self.request.finish()
	  File "/usr/lib/python3/dist-packages/twisted/web/server.py", line 224, in finish
	    return http.Request.finish(self)
	  File "/usr/lib/python3/dist-packages/twisted/web/http.py", line 865, in finish
	    "Request.finish called on a request after its connection was lost; "
	builtins.RuntimeError: Request.finish called on a request after its connection was lost; use Request.notifyFinish to keep track of this.

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2016-10-27:

#10

>>> [process.pid for process in RegionControllerProcess.objects.all()]
[509, 512, 518, 522]

>>> [(conn.endpoint.address, conn.endpoint.port, conn.endpoint.process.pid) for conn in RegionRackRPCConnection.objects.all()]
[('192.168.224.241', 5250, 522), ('10.245.0.10', 5252, 512), ('10.245.0.10', 5253, 518)]

You can see that the rack controller is not connected to pid 509. Now if this causes TFTP issues that is unknown as at the moment OIL is still deploying fine. But it is possible that it causes issues because the rack controller should have tried to re-connect to pid 509. Since it didn't that means the rack is in a bad state and might think that the connection to pid 509 is good. If the rack controller tries to use that connection then it would have an issue. I believe a log would appear if it did and at the moment I am not seeing anything in the logs.

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2016-10-27:

#11

The region is showing all endpoints over the RPC endpoint:

{"eventloops": {"maas2-production:pid=512": [["10.245.0.10", 5252], ["192.168.224.241", 5252]], "maas2-production:pid=522": [["192.168.224.241", 5250], ["10.245.0.10", 5250]], "maas2-production:pid=518": [["10.245.0.10", 5253], ["192.168.224.241", 5253]], "maas2-production:pid=509": [["192.168.224.241", 5251], ["10.245.0.10", 5251]]}}

Blake Rouse (blake-rouse) on 2016-11-04

Changed in maas:
milestone:	2.1.1 → 2.1.2

Andres Rodriguez (andreserl) on 2016-11-23

Changed in maas:
milestone:	2.1.2 → 2.1.3

Ryan Beisner (1chb1n) on 2017-01-24

tags:

added: uosci

Andres Rodriguez (andreserl) on 2017-02-21

no longer affects:

maas/2.0

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2017-02-21:

#13

In our OpenStack Charms test lab, the controller periodically and repeatedly goes into error state for rackd in the web ui. Restarting maas-rackd resolves it for some time.

MAAS Version 2.1.3+bzr5573-0ubuntu1 (16.04.1)

Andres Rodriguez (andreserl) on 2017-11-02

no longer affects:

maas/2.1

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.