Add service w/ watchdog to handle usd-importer failures
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
git-ubuntu |
Fix Released
|
High
|
Bryce Harrington |
Bug Description
Various error situations can cause the importer to hang (see LP: #1745211) or fail (this bug). Bug LP: #1765219 dealt with some system failure conditions, such as out-of-disk-space, but there could be other various situations causing the importer to stop.
Currently, these failures are dealt with by manual monitoring and restarting the importer script, but a more robust (if a bit brute-force) solution would be to invoke the script via a systemd service, with a watchdog to detect if the importer's main loop is operational and if not to restart the service.
Error detection is currently handled manually as well, by visual inspection of the screen session for stack traces or evidence of hangs. With the introduction of a service daemon, the script output would be logged to a (logrotate'd) file. A new error detection/reporting process would need to be added to email the relevant log snippet to the administration mailing list.
Installation of the service daemon script will eventually need to be done by the snapd installation process, but initially we'll just let the script be manually installed by the system admins.
[Original Report]
Hi,
I found the importer down with the following message:
Examining publishes in debian since 2019-08-04 04:39:52
Traceback (most recent call last):
File "/snap/
cli_main()
File "/snap/
only_
File "/snap/
request_
File "/snap/
dist = launchpad.
File "/snap/
shim_
File "/snap/
representation = self._root.
File "/snap/
response, content = self._request(url, extra_headers=
File "/snap/
str(url), method=method, body=data, headers=headers)
File "/snap/
url, method=method, body=body, headers=headers)
File "/snap/
cachekey,
File "/snap/
LaunchpadOA
File "/snap/
redirections, cachekey)
File "/snap/
conn, request_uri, method, body, headers
File "/snap/
response = conn.getresponse()
File "/snap/
response.
File "/snap/
version, status, reason = self._read_status()
File "/snap/
line = str(self.
File "/snap/
return self._sock.
File "/snap/
return self.read(nbytes, buffer)
File "/snap/
return self._sslobj.
File "/snap/
v = self._sslobj.
ConnectionReset
P.S. I haven't found another impoirt tag bug with that signature, feel free to dup if there is one.
Related branches
- Server Team CI bot: Needs Fixing (continuous-integration)
- Bryce Harrington: Pending requested
-
Diff: 125 lines (+89/-13)2 files modifieddoc/README.testing (+3/-0)
snap-wrappers/wrappers/git-ubuntu-self-test (+86/-13)
- Server Team CI bot: Approve (continuous-integration)
- Robie Basak: Needs Information
-
Diff: 298 lines (+212/-0)7 files modifiedbin/failure-email.sh (+12/-0)
bin/import-source-packages.py (+10/-0)
doc/usd-importer-service.md (+164/-0)
setup.py (+1/-0)
snap/snapcraft.yaml (+2/-0)
usd-importer-failure-email@.service (+8/-0)
usd-importer.service (+15/-0)
tags: | added: import |
Changed in usd-importer: | |
status: | New → Triaged |
importance: | Undecided → High |
assignee: | nobody → Bryce Harrington (bryce) |
summary: |
- importer failed with "Connection reset by peer" + Add service w/ watchdog to handle usd-importer failures |
description: | updated |
Fixed in the following commit (and refined in a couple subsequent commits). Change is landed to production, and snaps updated.
commit 5d104bce612db56 f3274c667ee7b00 8da5e68c18
Author: Bryce Harrington <email address hidden>
Date: Thu Oct 3 07:19:19 2019 -0700
Implement a systemd watchdog daemon to run import- source- packages. py
Git Ubuntu's package importing functionality is invoked via the source- packages. py script. Previously, this script would be
import-
manually started, and on error needed manual intervention.
Instead, wrap the script in a systemd service that starts it up
initially and restarts it on crash. A watchdog timer is used to detect
if the script has hung, and restarts it after a suitable delay.
Another service is added for sending emails when the service crashes,
extracting status from the journal. Errors can also be reviewed using
journalctl normally.
By default, everything is configured to be installable in production,
but configuration considerations are covered in documentation. There
are no unit tests for this, however some testing/validation tips are
identified in the documentation.
LP: #1838954