Merge ~andersson123/autopkgtest-cloud:improve-handling-cache-amqp-failures into autopkgtest-cloud:master

Proposed by Tim Andersson
Status: Needs review
Proposed branch: ~andersson123/autopkgtest-cloud:improve-handling-cache-amqp-failures
Merge into: autopkgtest-cloud:master
Diff against target: 93 lines (+41/-5)
3 files modified
charms/focal/autopkgtest-web/webcontrol/helpers/exceptions.py (+11/-0)
charms/focal/autopkgtest-web/webcontrol/helpers/utils.py (+17/-0)
charms/focal/autopkgtest-web/webcontrol/request/submit.py (+13/-5)
Reviewer Review Type Date Requested Status
Skia Approve
Brian Murray Needs Information
Review via email: mp+458799@code.launchpad.net
To post a comment you must log in.
Revision history for this message
Tim Andersson (andersson123) wrote :

Mmmm, I think I can write this a bit better

Revision history for this message
Tim Andersson (andersson123) wrote :

Okay, amended, please review

Revision history for this message
Tim Andersson (andersson123) wrote :

ah, nah, actually that's my bad, one sec I amend

Revision history for this message
Brian Murray (brian-murray) wrote :

We (you!) previously did some development around raising appropriate exit codes when submit.py fails. What will the exit code of this TimeoutError be?

review: Needs Information
Revision history for this message
Paride Legovini (paride) wrote :

It is not completely clear to me how this improves the UX.

As of today: if rabbitmq is broken, users experience a timeout error after a while, because the request can't be enqueued. Isn't this "right" already?

Is it actually useful to users that the "behind the scenes" error is surfaced?

(Is this useful for other reasons I'm not getting?)

Revision history for this message
Tim Andersson (andersson123) wrote :

I amended this as requested :)

I just had a thought though, I think I'll move the timeout class to helpers/utils.py

Revision history for this message
Tim Andersson (andersson123) wrote :

Okay, I reworked this mp a little bit, please re-review

Revision history for this message
Brian Murray (brian-murray) wrote :

> It is not completely clear to me how this improves the UX.
>
> As of today: if rabbitmq is broken, users experience a timeout error after a
> while, because the request can't be enqueued. Isn't this "right" already?
>
> Is it actually useful to users that the "behind the scenes" error is surfaced?
>
> (Is this useful for other reasons I'm not getting?)

I asked for this feature because it wasn't obvious that rabbimq was the culprit or something else and because it took an extended period of time to debug. So by surfacing where the error actually is we'll know what part of the service to look at.

Revision history for this message
Tim Andersson (andersson123) wrote :

I think what you stated is actually done in the latest version of the diff. It surfaces the error to /var/log/apache2/error.log, but not to the user via the webpage. Let me know if this is appropriate.

Revision history for this message
Skia (hyask) wrote :

See inline comment. I'm aligned with Paride's view of surfacing the error up to user, and not hide it in some obscure log file that is not monitored.

Moreover, I think our all_exception handler is not behaving correctly in the sense that the error is not logged anymore in our files. That would be another much needed improvement.

review: Needs Fixing
Revision history for this message
Tim Andersson (andersson123) :
Revision history for this message
Skia (hyask) wrote :

Thanks!

review: Approve
Revision history for this message
Tim Andersson (andersson123) wrote :

I had to fix a merge conflict, and this also requires testing somehow. I'll wait till staging is fully functional again and test this, then merge it.

Unmerged commits

f679675... by Tim Andersson

fix: web: make send_amqp_request function have a timeout

We were recently having some issues with production, where
requesting a test via the webpage would result in the page
eventually telling the user the server has timed out. This
was because the send_amqp_request function was failing, but
the library used to send the request has no internal timeout.

send_amqp_request was failing because the rabbitmq-server ran
out of memory.

This commit amends the above issue by adding a timeout for the
send_amqp_request function is the form of an signal handler,
via the `signal` python module.

This commit also introduces a new exception parent class,
InternalException. And a child class of this class, QueueDead.

This commit also introduces a new child class of WebControlException,
QueueDead.

This exception is raised when the rabbitmq connection times out, as
described above. Now, when this happens, the exception is raised and the
error message defined in the QueueDead declaration is displayed to the
user.

Failed
[FAILED] pre_commit:0 (build)
[WAITING] unit_tests:0 (build)
[WAITING] build_charms:0 (build)
13 of 3 results

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/charms/focal/autopkgtest-web/webcontrol/helpers/exceptions.py b/charms/focal/autopkgtest-web/webcontrol/helpers/exceptions.py
2index e94164e..96a206a 100644
3--- a/charms/focal/autopkgtest-web/webcontrol/helpers/exceptions.py
4+++ b/charms/focal/autopkgtest-web/webcontrol/helpers/exceptions.py
5@@ -103,3 +103,14 @@ class InvalidArgs(WebControlException):
6 ),
7 400,
8 )
9+
10+
11+class QueueDead(WebControlException):
12+ def __init__(self):
13+ super().__init__(
14+ (
15+ "Rabbitmq unresponsive. Perhaps it needs restarting. "
16+ "Please contact the QA team on #ubuntu-quality (highlight qa-help)."
17+ ),
18+ 500,
19+ )
20diff --git a/charms/focal/autopkgtest-web/webcontrol/helpers/utils.py b/charms/focal/autopkgtest-web/webcontrol/helpers/utils.py
21index 9b95cc8..67eb3c2 100644
22--- a/charms/focal/autopkgtest-web/webcontrol/helpers/utils.py
23+++ b/charms/focal/autopkgtest-web/webcontrol/helpers/utils.py
24@@ -8,6 +8,7 @@ import logging
25 import os
26 import pathlib
27 import random
28+import signal
29 import sqlite3
30 import time
31 import typing
32@@ -84,6 +85,22 @@ def get_supported_releases():
33 )
34
35
36+class timeout:
37+ def __init__(self, seconds=1, error_message="Timeout"):
38+ self.seconds = seconds
39+ self.error_message = error_message
40+
41+ def handle_timeout(self, signum, frame):
42+ raise TimeoutError(self.error_message)
43+
44+ def __enter__(self):
45+ signal.signal(signal.SIGALRM, self.handle_timeout)
46+ signal.alarm(self.seconds)
47+
48+ def __exit__(self, type, value, traceback):
49+ signal.alarm(0)
50+
51+
52 def setup_key(app, path):
53 """Create or load app.secret_key for cookie encryption."""
54 try:
55diff --git a/charms/focal/autopkgtest-web/webcontrol/request/submit.py b/charms/focal/autopkgtest-web/webcontrol/request/submit.py
56index e7f1613..7e4943f 100644
57--- a/charms/focal/autopkgtest-web/webcontrol/request/submit.py
58+++ b/charms/focal/autopkgtest-web/webcontrol/request/submit.py
59@@ -23,10 +23,14 @@ from helpers.exceptions import (
60 ForbiddenRequest,
61 InvalidArgs,
62 NotFound,
63+ QueueDead,
64 RequestInQueue,
65 RequestRunning,
66 )
67-from helpers.utils import get_autopkgtest_cloud_conf
68+from helpers.utils import (
69+ get_autopkgtest_cloud_conf,
70+ timeout,
71+)
72
73 # Launchpad REST API base
74 LP = "https://api.launchpad.net/1.0/"
75@@ -348,10 +352,14 @@ class Submit:
76 password=self.amqp_creds.password,
77 ) as amqp_con:
78 with amqp_con.channel() as ch:
79- ch.basic_publish(
80- amqp.Message(body, delivery_mode=2), # persistent
81- routing_key=queue,
82- )
83+ try:
84+ with timeout(seconds=60):
85+ ch.basic_publish(
86+ amqp.Message(body, delivery_mode=2), # persistent
87+ routing_key=queue,
88+ )
89+ except TimeoutError as e:
90+ raise QueueDead from e
91 return params["uuid"]
92
93 @classmethod

Subscribers

People subscribed via source and target branches