Merge ~cjwatson/launchpad:charm-librarian-doc-migration into launchpad:master

Proposed by Colin Watson
Status: Merged
Approved by: Colin Watson
Approved revision: 03315672e08d3120410a73e295958b5918d21c0f
Merge reported by: Otto Co-Pilot
Merged at revision: not available
Proposed branch: ~cjwatson/launchpad:charm-librarian-doc-migration
Merge into: launchpad:master
Diff against target: 80 lines (+72/-0)
1 file modified
charm/launchpad-librarian/README.md (+72/-0)
Reviewer Review Type Date Requested Status
Jürgen Gmach Approve
Simone Pelosi Approve
Review via email: mp+442615@code.launchpad.net

Commit message

charm: Document launchpad-librarian migration process

Description of the change

This is loosely based on our last production migration, although that was lifting and shifting a pre-cloud deployment into a manual VM so a number of the details differ.

To post a comment you must log in.
Revision history for this message
Simone Pelosi (pelpsi) wrote :

LGTM!

review: Approve
Revision history for this message
Jürgen Gmach (jugmac00) wrote :

Who is the target audience of this document?

At the current high level state, I assume only you would be able to perform such a migration.

When you do the actual migration, I would love to join a meeting and take additional notes.

Or maybe even better, we could have a dry-run before the actual migration, as then there would be more time for questions.

review: Approve
Revision history for this message
Colin Watson (cjwatson) wrote :

Honestly? Mostly my future self plus IS, for now - but I wanted to have the notes somewhere more useful than my private collection augmented by searching through Mattermost history, and I didn't want to let the perfect be the enemy of the good. I figured that if we started to establish a pattern of writing down roughly how to redeploy things, even if it's incomplete, then refining that documentation over time as we gain experience will be much easier.

We normally do this sort of migration in some suitable MM channel, so observing is no problem. The last one was in ~launchpad-ps5-migration and you can look back through that channel for a bunch of interesting history - I put together much of this document using that.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/charm/launchpad-librarian/README.md b/charm/launchpad-librarian/README.md
2index 6567dc0..bd2a657 100644
3--- a/charm/launchpad-librarian/README.md
4+++ b/charm/launchpad-librarian/README.md
5@@ -23,3 +23,75 @@ You will normally want to mount a persistent volume on
6 `/srv/launchpad/librarian/`. (Even when writing uploads to Swift, this is
7 currently used as a temporary spool; it is therefore not currently valid to
8 deploy more than one unit of this charm.)
9+
10+## Migrating between instances
11+
12+Only one instance of the librarian may be active at any one time, and very
13+little downtime is acceptable on production. This means that we have to be
14+especially careful when redeploying. The general procedure is as follows:
15+
16+1. Deploy a new unit with `active=false`. This will run the librarian in
17+ more or less a read-only mode: downloads are possible, and cron jobs that
18+ would modify the database, the contents of the librarian, or the contents
19+ of Swift are disabled. If uploads happen they will be spooled locally,
20+ but that's low-risk since only Launchpad itself uploads to the librarian.
21+ This allows testing connectivity.
22+
23+1. Ensure that a Ceph volume is mounted persistently on
24+ `/srv/launchpad/librarian/`. On production this should be a 2 TiB volume
25+ to allow some breathing room if uploading to Swift is temporarily
26+ unavailable.
27+
28+1. On the librarian unit, run `systemctl status
29+ launchpad-librarian@1.service` to ensure that the librarian is running.
30+ (If it crashes then it will restart automatically, so make sure that it's
31+ been running for at least a few minutes.) You may need to ensure that
32+ the appropriate firewall rules exist to give it access to the Launchpad
33+ database, Launchpad's XML-RPC appserver, Swift, and some other details;
34+ for Canonical's production deployment, it should be enough to add the new
35+ unit to `services/lp/librarian/servers` in our firewall configuration.
36+
37+1. Find a librarian URL of something at least a day old from Launchpad (the
38+ `.dsc` of an older source package in Ubuntu will do) and check that you
39+ can fetch it from any of the public download ports of the new unit.
40+ There is one public download port per worker, assigned sequentially
41+ starting from `port_download_base`. This checks basic database and Swift
42+ connectivity.
43+
44+1. Use `rsync` to copy the temporary spool from the old unit; on pre-Juju
45+ production instances this lived in
46+ `/srv/launchpadlibrarian.net/production/librarian/`, while on instances
47+ of this charm it lives in `/srv/launchpad/librarian/`. The `rsync`
48+ process should run as the `launchpad` user on the new unit, and should
49+ _not_ use the `--delete` option; extra copies of files aren't a problem,
50+ and will be cleaned up by automatic garbage collection after the
51+ migration is complete. Keep this running in a loop throughout the
52+ migration; once it has caught up it should only take a minute or so per
53+ iteration.
54+
55+1. As `stg-launchpad@launchpad-bastion-ps5.internal`, run `lpndt
56+ service-stop cron-fdt` to disable all cron jobs, then (after a minute)
57+ `lpndt service-stop buildd-manager` to stop `buildd-manager`.
58+
59+1. Comment out the `librarian-gc` and `librarian-feed-swift` cron jobs on
60+ the old unit (if it was deployed using this charm and is in a different
61+ Juju application, you can do this by setting `active=false` using Juju),
62+ and wait for the associated processes to stop.
63+
64+1. Switch the `haproxy` frontends over to the new unit. On production,
65+ you'll need to update the IP addresses of the `dl_librarian_[1-6]`,
66+ `ul_librarian_[1-6]`, `dl_librarian_internal_[1-6]`, and
67+ `ul_librarian_internal_[1-6]` servers.
68+
69+1. Ensure that librarian access via the web frontend still works.
70+
71+1. Set `active=true` on the new unit using Juju.
72+
73+1. Check that logs from `librarian-feed-swift` (and later `librarian-gc`,
74+ which only runs daily) look good.
75+
76+1. Stop the `rsync` loop.
77+
78+1. As `stg-launchpad@launchpad-bastion-ps5.internal`, run `lpndt
79+ service-start buildd-manager` to start `buildd-manager, then (after a
80+ minute) `lpndt service-start cron-fdt` to enable all cron jobs.

Subscribers

People subscribed via source and target branches

to status/vote changes: