OCI multiarch upload leads to temporary invalid manifest

Bug #1929693 reported by Thomas Bechtold
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Tom Wardill

Bug Description

I'm building a multiarch base OCI image. The build for the different archs takes a different amount of time to the web-ui for the OCI build shows under "upload status" the value "Partial".
And that's correct because (in my case) amd64 image got uploaded, arm64 is still building.

The problem is now, that (in my example) arm64 can not use the edge tag anymore because the image is not there:

$ docker pull --platform arm64 toabctl/ubtest2:base-21.10_edge
base-21.10_edge: Pulling from toabctl/ubtest2
no matching manifest for linux/arm64 in the manifest list entries

But amd64 works because the image is already there:

$ docker pull --platform amd64 toabctl/ubtest2:base-21.10_edge
base-21.10_edge: Pulling from toabctl/ubtest2
Digest: sha256:63a789e2a398bbf8c9419666ce9af07dd7b79fbd250f669c5c532ef3679a14f0
Status: Image is up to date for toabctl/ubtest2:base-21.10_edge
docker.io/toabctl/ubtest2:base-21.10_edge

So this is a race (which can be a very long race I think depending on the build/upload speed).

Here's the full manifest:

$ docker manifest inspect toabctl/ubtest2:base-21.10_edge
{
   "schemaVersion": 2,
   "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
   "manifests": [
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 446,
         "digest": "sha256:b50f003e829cdd530e77a570b94ead370fab9de263b8cdf47bf43102c3a2ee2b",
         "platform": {
            "architecture": "amd64",
            "os": "linux"
         }
      }
   ]
}

Tags: lp-oci

Related branches

tags: added: oci
Revision history for this message
Thomas Bechtold (toabctl) wrote :

This gets more worse when one build worked and already pushed and another build failed (see eg. https://launchpad.net/~cloud-images-release-managers/cloud-images/+oci/ubuntu-base/+recipe/21.04/ for an example and the attached screenshot where s390x failed to upload).

In that case, some architectures are uploaded to the target registry and others are not. That leads to an inconsistent state on the registry where some arch images are up-to-date and others are still old.

Revision history for this message
Colin Watson (cjwatson) wrote :
Download full text (4.6 KiB)

We talked about this a bit in our last team meeting, and I promised to write up the conclusions.

Design
======

There seems to have been some disagreement about exactly what we're trying to achieve, e.g. a suggestion that this is by design and that we want to avoid a given tag showing image versions from different build sets across architectures (which would inherently conflict with fixing this bug).

My design statement is that we want the behaviour to be as close to the observed behaviour of the snap store as possible within the semantics of OCI registries: as such, uploading a new image to a tag for one architecture should in general not affect other architectures. If you're following a tag such as "edge" that's used as a direct upload target for builds, then it is expected that that tag may sometimes refer to builds from a different build set across architectures. People who want a consistent view, in the sense of only ever seeing builds from a single build set for the same tag on different architectures, should use some other tag (such as "stable" or "candidate") that's maintained by promoting full build sets from elsewhere.

This corresponds to the behaviour of the snap store (where "edge" is typically a direct upload target for builds and may skew across architectures, but less risky channels such as "stable" are normally supposed to be maintained by promotion of build sets from elsewhere rather than by direct uploads). It also corresponds to the behaviour of the primary Ubuntu archive, where e.g. "impish-proposed" is a direct upload target for builds and may skew across architectures in various ways, but "impish" is kept consistent by CI machinery.

There is one wrinkle: if a Launchpad recipe was previously configured to build for a certain set of architectures and then is later reconfigured to build for a smaller set, then I think that should probably cause the removed architectures to be removed from the corresponding multi-arch manifests in registries (at least when later builds happen). This isn't a firm design requirement in the way that the above is, but it seems likely to be what most people want to happen. So, if there are no objections, I'd suggest that we should preserve only those entries in the multi-arch manifest for a given registry/image/tag that correspond to architectures that are still configured in the recipe.

Implementation
==============

The way the current implementation works is quite confusing, and as a result I spent a while being unable to see why this bug was happening until I cross-referenced it with some logs.

It looks as though `OCIRegistryClient.makeMultiArchManifest` fetches the current manifest from the registry and mutates it, which would achieve roughly the design I outlined above (with the exception of removing architectures when they've been removed from the recipe's configuration). However, what actually happens on each upload is that we first call `OCIRegistryClient.upload`, which pushes a single-architecture manifest to the registry, and then in a second step we call `OCIRegistryClient.uploadManifestList`, which fetches the current manifest from the registry and augments it (which in p...

Read more...

Changed in launchpad:
status: New → Triaged
importance: Undecided → High
tags: added: lp-oci
removed: oci
Tom Wardill (twom)
Changed in launchpad:
status: Triaged → In Progress
assignee: nobody → Tom Wardill (twom)
Colin Watson (cjwatson)
Changed in launchpad:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.