[2.5] Stale WebSocket connections do not automatically reconnect

Bug #1802325 reported by Mark Shuttleworth
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Mike Pontillo

Bug Description

Am unable to add a device from the Devices listing; clicking the "Add Device" button does nothing.

Tags: ui

Related branches

Changed in maas:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 2.5.0rc1
tags: added: ui
Revision history for this message
Anthony Dillon (ya-bo-ng) wrote :

I can't seem to replicate the issue. On beta4 I can add a device fine although the form could do with some love.

Does the Add device panel not open for you? What browser are you using?

Revision history for this message
Anthony Dillon (ya-bo-ng) wrote :
Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Mike Pontillo (mpontillo) wrote :

I reproduced this issue after a suspend/resume cycle while the MAAS UI was in-use in Google Chrome.

I see the "Loading..." indicator on the devices tab, and it doesn't go away.

Changed in maas:
status: Incomplete → Confirmed
Revision history for this message
Mike Pontillo (mpontillo) wrote :

In the browser's inspector, I see the following text:

Active resource loading counts reached to a per-frame limit while the tab is in background. Network requests will be delayed until a previous loading finishes, or the tab is foregrounded. See https://www.chromestatus.com/feature/5527160148197376 for more details

Revision history for this message
Mike Pontillo (mpontillo) wrote :

It seems the websocket request is wedged - "not finished yet", according to the inspector.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Could possibly be related to bug #1802390.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Unfortunately, this doesn't have the same root cause as bug #1802390. After fixing that related issue (which caused websocket reconnects to fail more visibly) I can reliably reproduce this bug with the following steps:

(1) Log into the MAAS UI while connected to WiFi (leave it connected to the machines tab)
(2) Close laptop lid, wait for websocket TCP connection to time out
(3) Plug laptop into Ethernet port (causing client IP address to change upon wakeup)
(4) Click "Devices" tab. Click "Add device" button. Observe that nothing happens.

I'm not totally sure that all these steps are /required/ to see the issue, but I can easily reproduce the problem this way.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Reproduced the issue without changing networks or IP addresses - simply letting the websocket connection time out seemed to be enough.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

I think the root cause of this issue is that WebSocket.onclose() doesn't fire when a WebSocket connection times out. If a WebSocket connection is *explicitly* closed, MAAS receives the WebSocket.onclose() event from the browser and reconnects. If the underlying TCP connection remains open but stops communicating (such as might happen in the event of a network interruption or change), the MAAS UI gets into this state.

If I use the Chrome inspector to get a reference to the WebSocket object and call ws.onclose({}), I observe that a reconnect happens and the button works again.

We could work around this issue by developing an active keepalive mechanism to ensure the websocket attempts to reconnect when a timeout occurs.

Changed in maas:
status: Confirmed → Triaged
summary: - [2.5b4] Add device button doesn't work
+ [2.5] Stale WebSocket connections do not automatically reconnect
Revision history for this message
Mark Shuttleworth (sabdfl) wrote : Re: [Bug 1802325] [NEW] [2.5b4] Add device button doesn't work

Just to update you, what I see is that the form seems to take a very
long (and weirdly random) amount of time to open. I press the button,
nothing happens. Nothing happens for tens of seconds, perhaps minutes.
Then *pling* the form appears. The idea that it's some sort of
connection timeout seems reasonable.

Mark

Changed in maas:
assignee: nobody → Mike Pontillo (mpontillo)
Revision history for this message
Mike Pontillo (mpontillo) wrote :

Thanks for the additional details. I'm moving forward with this bug assuming it's a WebSocket timeout issue that can be addressed by being more proactive about connectivity checking. That said, I've been doing some reading on more general issue with the WebSocket approach[1]. I don't think now is the time to revisit that choice, but it made me think of a few more questions for you:

 - Does this issue only occur for you when you leave MAAS open in a browser tab for a long time, or does it happen on first login?

 - What browsers do you see this issue with? Are you using any browser plugins that might be relevant? (especially ad blocking, privacy, etc)

 - Are any HTTP proxies involved in your communication with this MAAS? Any use of a transparent proxy, privacy filter, etc?

 - Any other special network conditions? (i.e. happens every time you reconnect to a VPN and access MAAS again?)

[1]: https://samsaffron.com/archive/2015/12/29/websockets-caution-required

Changed in maas:
status: Triaged → In Progress
Changed in maas:
milestone: 2.5.0rc1 → 2.5.0rc2
Revision history for this message
Mike Pontillo (mpontillo) wrote :

After experimenting with a partial fix for the timeout issue I reproduced when triaging this, I think there are multiple layers to this problem. Even with the more aggressive timeout/reconnect, I still see a problem with the "Add device" button in certain cases, especially after the WebSocket disconnects and subsequently reconnects.

When you click the "Add device" button, the MAAS UI attempts to fetch all the subnets and domains, so that it can populate the drop-down boxes for specifying the DNS name and static IP address of the new device.

In previous releases of MAAS, this data was preloaded with the device listing. In MAAS 2.5, as part of (well-intentioned) attempts to improve the performance, this was changed; the MAAS UI now waits until "Add device" is clicked to load these objects. This made the originally-described bug more visible, especially in cases when the WebSocket has timed out, and thus the UI must wait to attempt to fetch the data. Additionally, it exposed the UI to another class of issues: race conditions while trying to keep that cache coherent between clicks of "Add device", whereby the managers for those objects might be unloaded and then reloaded again, depending on the order of UI interactions.

Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → In Progress
Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Mike Pontillo (mpontillo) wrote :

Quick update on this: in addition to adding better timeout/retry behavior, we've separately landed performance improvements with regard to fetching subnets, which will also impact this specific issue (the "Add device" button).

I think there is still room for improvement with regard to error handling in cases where the WebSocket disconnects and reconnects. In my testing, the behavior in the release candidate is much better.

If you see further issues, let's address them under a separate bug report. Thanks!

Revision history for this message
Mark Shuttleworth (sabdfl) wrote : Re: [Bug 1802325] Re: [2.5] Stale WebSocket connections do not automatically reconnect

Unfortunately, I still see exactly the same behaviour in 2.5.0-rc1 on
the Garage MAAS.

Go to the Devices page, click on Add Device, nothing happens. This bug
is not fixed.

Mark

Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.