Master Cluster fails to connect after importing multiple images and multiple subarchs in 1.7 and 1.8

Bug #1472707 reported by Sean Feole
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Blake Rouse
1.8
Fix Released
Critical
Blake Rouse

Bug Description

Maas Version: 1.7.5+bzr3369-0ubuntu1~trusty1
Boot Images: http://maas.ubuntu.com/images/ephemeral-v2/daily/

Problem:

Last night I was monitoring our maas server. And was notified that the master cluster was disconnected. After looking through the logs, the following appears to scroll over and over after restarting the maas-clusterd service

2015-07-08 12:48:44-0400 [ClusterClient,client] Amp server or network failure unhandled by client application. Dropping connection! To avoid, add errbacks to ALL remote commands!
 Traceback (most recent call last):
   File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 913, in ampBoxReceived
     self._commandReceived(box)
   File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 892, in _commandReceived
     deferred.addCallback(self._safeEmit)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 306, in addCallback
     callbackKeywords=kw)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 295, in addCallbacks
     self._runCallbacks()
 --- <exception caught here> ---
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 924, in _safeEmit
     aBox._sendTo(self.boxSender)
   File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 577, in _sendTo
     proto.sendBox(self)
   File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 2153, in sendBox
     self.transport.write(box.serialize())
   File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 555, in serialize
     raise TooLong(False, True, v, k)
 twisted.protocols.amp.TooLong:

2015-07-08 12:48:44-0400 [ClusterClient,client] ClusterClient connection lost (HOST:IPv4Address(TCP, '127.0.0.1', 35866) PEER:IPv4Address(TCP, u'127.0.0.1', 40140))

==> /var/log/maas/maas-django.log <==
INFO 2015-07-08 12:48:44,186 twisted RegionServer connection lost (HOST:IPv4Address(TCP, '127.0.0.1', 40140) PEER:IPv4Address(TCP, '127.0.0.1', 35866))
ERROR 2015-07-08 12:48:44,187 django.request Internal Server Error: /MAAS/clusters/
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 137, in get_response
    response = response.render()
  File "/usr/lib/python2.7/dist-packages/django/template/response.py", line 105, in render
    self.content = self.rendered_content
  File "/usr/lib/python2.7/dist-packages/django/template/response.py", line 82, in rendered_content
    content = template.render(context)
  File "/usr/lib/python2.7/dist-packages/django/template/base.py", line 140, in render
    return self._render(context)
  File "/usr/lib/python2.7/dist-packages/django/template/base.py", line 134, in _render
    return self.nodelist.render(context)
  File "/usr/lib/python2.7/dist-packages/django/template/base.py", line 840, in render
    bit = self.render_node(node, context)
  File "/usr/lib/python2.7/dist-packages/django/template/base.py", line 854, in render_node
    return node.render(context)
  File "/usr/lib/python2.7/dist-packages/django/template/loader_tags.py", line 123, in render
    return compiled_parent._render(context)
  File "/usr/lib/python2.7/dist-packages/django/template/base.py", line 134, in _render
    return self.nodelist.render(context)
  File "/usr/lib/python2.7/dist-packages/django/template/base.py", line 840, in render
    bit = self.render_node(node, context)
  File "/usr/lib/python2.7/dist-packages/django/template/base.py", line 854, in render_node
    return node.render(context)
  File "/usr/lib/python2.7/dist-packages/django/template/loader_tags.py", line 62, in render
    result = block.nodelist.render(context)
  File "/usr/lib/python2.7/dist-packages/django/template/base.py", line 840, in render
    bit = self.render_node(node, context)
  File "/usr/lib/python2.7/dist-packages/django/template/base.py", line 854, in render_node
    return node.render(context)
  File "/usr/lib/python2.7/dist-packages/django/template/defaulttags.py", line 203, in render
    nodelist.append(node.render(context))
  File "/usr/lib/python2.7/dist-packages/django/template/loader_tags.py", line 155, in render
    return self.render_template(self.template, context)
  File "/usr/lib/python2.7/dist-packages/django/template/loader_tags.py", line 137, in render_template
    output = template.render(context)
  File "/usr/lib/python2.7/dist-packages/django/template/base.py", line 140, in render
    return self._render(context)
  File "/usr/lib/python2.7/dist-packages/django/template/base.py", line 134, in _render
    return self.nodelist.render(context)
  File "/usr/lib/python2.7/dist-packages/django/template/base.py", line 840, in render
    bit = self.render_node(node, context)
  File "/usr/lib/python2.7/dist-packages/django/template/base.py", line 854, in render_node
    return node.render(context)
  File "/usr/lib/python2.7/dist-packages/django/template/defaulttags.py", line 504, in render
    six.iteritems(self.extra_context)])
  File "/usr/lib/python2.7/dist-packages/django/template/base.py", line 585, in resolve
    obj = self.var.resolve(context)
  File "/usr/lib/python2.7/dist-packages/django/template/base.py", line 735, in resolve
    value = self._resolve_lookup(context)
  File "/usr/lib/python2.7/dist-packages/django/template/base.py", line 789, in _resolve_lookup
    current = current()
  File "/usr/lib/python2.7/dist-packages/maasserver/models/nodegroup.py", line 274, in get_state
    images = get_boot_images(self)
  File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/twisted.py", line 148, in wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/clusterrpc/boot_images.py", line 93, in get_boot_images
    return call.wait(30).get("images")
  File "/usr/lib/python2.7/dist-packages/crochet/_eventloop.py", line 219, in wait
    result.raiseException()
  File "<string>", line 2, in raiseException
ConnectionDone: Connection was closed cleanly.
ERROR 2015-07-08 12:48:46,083 maasserver Unable to get RPC connection for cluster 'Cluster master' (87685582-8844-48f2-a7a9-cae73fdb578f)
ERROR 2015-07-08 12:48:46,084 maasserver Unable to get RPC connection for cluster 'Cluster master' (87685582-8844-48f2-a7a9-cae73fdb578f)
ERROR 2015-07-08 12:48:46,086 maasserver Unable to get RPC connection for cluster 'Cluster master' (87685582-8844-48f2-a7a9-cae73fdb578f)
ERROR 2015-07-08 12:48:46,089 maasserver Unable to get RPC connection for cluster 'Cluster master' (87685582-8844-48f2-a7a9-cae73fdb578f)

The master cluster has been operating fine for the last few weeks. We have not had any issues with it. This appeared out of the blue. Following the errors, which appear to be related to images. I began to remove the Wiley images and reimport. After doing this, the master cluster reconnected and was synced as expected.

I cleared the contents of /var/lib/maas/boot-resources/cache and tried to import the wiley images again. After importing , i received the same message as above.

For now I removed the 15.10 images and filing this bug. Hoping to get to the bottom of the problem.

Tags: hyperscale amp

Related branches

Revision history for this message
Sean Feole (sfeole) wrote :

Since updating to 1.7.5 of MAAS, version 1.8.0+bzr4001-0ubuntu2~trusty1 has hit the stable ppa. So, i'm still a little hesitant to do a somewhat massive upgrade since this maas controller manages much of our lab hardware.

Revision history for this message
Sean Feole (sfeole) wrote :
Download full text (6.2 KiB)

here is a 2nd attempt,

again, reverting back and removing the 15.10 images temporarily resolves the problem.

INFO 2015-07-08 13:36:35,766 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:14.10:i386:hwe-u: to_add=[] to_remove=[]
INFO 2015-07-08 13:36:35,766 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:15.10:i386:hwe-w: to_add=[] to_remove=[]
INFO 2015-07-08 13:36:35,767 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:15.10:amd64:hwe-w: to_add=[u'20150706'] to_remove=[]
INFO 2015-07-08 13:36:35,867 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:14.04:i386:hwe-u: to_add=[] to_remove=[]
INFO 2015-07-08 13:36:35,867 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:14.04:i386:hwe-t: to_add=[] to_remove=[]
INFO 2015-07-08 13:36:35,868 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:14.04:i386:hwe-v: to_add=[] to_remove=[]
INFO 2015-07-08 13:36:35,868 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:14.10:arm64:hwe-u: to_add=[u'20150608'] to_remove=[]
INFO 2015-07-08 13:36:35,963 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:14.10:armhf:generic-lpae: to_add=[u'20150608'] to_remove=[]
INFO 2015-07-08 13:36:36,058 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:12.04:amd64:hwe-t: to_add=[] to_remove=[]
INFO 2015-07-08 13:36:36,058 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:14.04:amd64:hwe-v: to_add=[u'20150706'] to_remove=[]
INFO 2015-07-08 13:36:36,153 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:14.04:ppc64el:hwe-u: to_add=[] to_remove=[]
INFO 2015-07-08 13:36:36,154 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:12.04:amd64:hwe-p: to_add=[] to_remove=[]
INFO 2015-07-08 13:36:36,154 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:12.04:amd64:hwe-q: to_add=[] to_remove=[]
INFO 2015-07-08 13:36:36,154 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:12.04:amd64:hwe-r: to_add=[] to_remove=[]
INFO 2015-07-08 13:36:36,155 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:12.04:amd64:hwe-s: to_add=[] to_remove=[]
INFO 2015-07-08 13:36:36,155 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:13.10:i386:hwe-s: to_add=[] to_remove=[]
INFO 2015-07-08 13:36:36,155 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:12.04:i386:hwe-q: to_add=[] to_remove=[]
INFO 2015-07-08 13:36:36,156 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:15.04:armhf:hwe-v: to_add=[u'20150707'] to_remove=[]
INFO 2015-07-08 13:36:36,250 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:14.04:amd64:hwe-u: to_add=[u'20150706'] to_remove=[]
INFO 2015-07-08 13:36:36,344 sstreams com.ubuntu.maas:daily:v2:download/com.ubuntu.maas.daily:v2:boot:15.04:amd64:hwe-v: to_add=[u'20150707'] to_remove=[]

==> /var/log/maas/maas.log <==
Jul 8 13:50:35 localhost maas.bootresources: [INFO] Finished importing of boot images ...

Read more...

Revision history for this message
Gavin Panella (allenap) wrote :

This is a problem when messages passed between cluster and region become too large.

Changed in maas:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Gavin Panella (allenap) wrote :

Sean, if you're in a position to cowboy-patch the installation, can you give http://paste.ubuntu.com/11843690/ a go. After applying it to all machines running maas-regiond and/or maas-clusterd, restart those services and let me know if you're able to import Wily images. Ta!

Revision history for this message
Sean Feole (sfeole) wrote :

Hey Gavin, that worked!!

Thanks, after importing the images. I'm good to go. I appreciate the timely response!

==> /var/log/maas/maas.log <==
Jul 8 18:23:06 localhost maas.bootresources: [INFO] Finished importing of boot images from 1 source(s).
Jul 8 18:23:06 localhost maas.import-images: [INFO] Started importing boot images.

==> /var/log/maas/maas-django.log <==
ERROR 2015-07-08 18:23:06,545 twisted {}

==> /var/log/maas/maas.log <==
Jul 8 18:34:19 localhost maas.import-images: [INFO] Writing boot image metadata and iSCSI targets.
Jul 8 18:34:20 localhost maas.import-images: [INFO] Installing boot images snapshot /var/lib/maas/boot-resources/snapshot-20150708-222308
Jul 8 18:34:31 localhost maas.import-images: [INFO] Updating boot image iSCSI targets.
Jul 8 18:34:32 localhost maas.import-images: [INFO] Cleaning up old snapshots and cache.
Jul 8 18:34:32 localhost maas.import-images: [INFO] Finished importing boot images.

Revision history for this message
Brad Figg (brad-figg) wrote :

This is on 1.8:

2015-07-09 09:26:07-0700 [ClusterClient,client] Unhandled failure during AMP request. This is probably a bug. Please ensure that this error is handled within application code.
 Traceback (most recent call last):
   File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 913, in ampBoxReceived
     self._commandReceived(box)
   File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 892, in _commandReceived
     deferred.addCallback(self._safeEmit)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 306, in addCallback
     callbackKeywords=kw)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 295, in addCallbacks
     self._runCallbacks()
 --- <exception caught here> ---
   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 924, in _safeEmit
     aBox._sendTo(self.boxSender)
   File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 577, in _sendTo
     proto.sendBox(self)
   File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 2153, in sendBox
     self.transport.write(box.serialize())
   File "/usr/lib/python2.7/dist-packages/twisted/protocols/amp.py", line 555, in serialize
     raise TooLong(False, True, v, k)
 twisted.protocols.amp.TooLong:

Revision history for this message
Sean Feole (sfeole) wrote :

Hey Gavin, It would appear that I have the same issue with maas 1.8, just as brad has mentioned. after importing, Trusty,Utopic,Vivid,Wiley for armhf/arm64/amd64, the cluster disconnects again. Will the patch work for maas 1.8 as well?

summary: - Master Cluster fails to connect after importing wiley images
+ Master Cluster fails to connect after importing multiple images and
+ multiple subarchs in 1.7 and 1.8
Revision history for this message
Gavin Panella (allenap) wrote :

> Will the patch work for maas 1.8 as well?

It should apply cleanly to 1.8, and apply with a little fuzz to 1.7.

Raghuram Kota (rkota)
tags: added: hs-arm64
tags: removed: hs-arm64
Changed in maas:
milestone: none → 1.9.0
Changed in maas:
status: Triaged → Fix Committed
Changed in maas:
assignee: nobody → Blake Rouse (blake-rouse)
Gavin Panella (allenap)
tags: added: amp
no longer affects: maas/1.7
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.