Launchpad itself

Code review comment for ~pelpsi/launchpad:duplicate_ci_jobs

Revision history for this message

Simone Pelosi (pelpsi) wrote on 2023-04-18:

> I see that you had to make quite extensive changes to account for one of my
> earlier comments asking if the structure of `stages` was right. For the
> record, it's totally fine to push back on me if I ask this sort of question
> and dealing with it seems to have massive fallout. Looking at this, I think
> we need to step back and think about it a bit more.
>
> The problem with what we've ended up with here is that the format of the data
> structure in `CIBuild.stages` (as opposed to whatever intermediate objects we
> use in the process of constructing that) has changed, and I don't think that
> can be the right solution to this problem. That data structure is passed to
> the builder, so changing its format will almost certainly break the ability to
> run CI jobs on the Launchpad build farm.
>
> The structure that ends up in `CIBuild.stages` is more or less a cleaned-up
> version of `pipeline` from
> https://lpci.readthedocs.io/en/latest/configuration.html. It's supposed to be
> a list of stages, each stage being a list of jobs, and each job being a tuple
> of (job_name, job_index). Within a stage, each job is executed (the
> specification allows for this to be in parallel, but in practice right now
> it's in series); all jobs in the stage are run, but the stage fails if any of
> those jobs fail. Each stage is executed in sequence, with subsequent stages
> only executing if previous stages succeeded. We need to preserve something
> compatible with this, as it's part of the build dispatch protocol; and we need
> to make sure that the structure continues to reflect what's in the `pipeline`
> configuration we've read, even if we're filtering it by series.
>
> But there's another problem here. As mentioned above, the intended semantics
> are that jobs in later pipeline stages aren't run if earlier stages fail, and
> it's not really obvious how that works with multi-series pipelines given the
> current arrangements. Consider this hypothetical pipeline to build an
> artifact once and then test it on multiple series (which might make sense for
> artifacts like snaps that are generally supposed to be installable on multiple
> series, but that might have non-trivial interactions with the host system):
>
> pipeline:
> - build-snap
> - test-snap
> jobs:
> build-snap:
> series: focal
> architectures: [amd64, arm64]
> build-snaps: [snapcraft]
> run: snapcraft
> test-snap:
> matrix:
> - series: bionic
> architectures: [amd64]
> - series: focal
> architectures: [amd64, arm64]
> - series: jammy
> architectures: [amd64, arm64]
> run: ...
>
> This needs to run `build-snap` on focal first, for each of amd64 and arm64;
> and then if and only if that succeeds, it needs to run `test-snap` on each of
> bionic (only on amd64, for the sake of a more interesting example), focal, and
> jammy. If we dispatch entirely separate jobs to the build farm that only
> contain the CI jobs for each of those releases, this isn't going to work as
> intended.
>
> What I think `CIBuild.requestBuildsForRefs` needs to do here is to work out a
> list of build farm jobs it needs to request, each of which will run an
> appropriate subset of the pipeline depending on the architecture. Each build
> farm job needs a compatible-enough `DistroArchSeries` (but note that this
> doesn't have to be equal to the CI jobs it'll be running - it just specifies
> the environment in which `lpci` will be executed, and it's then `lpci`'s job
> to execute the CI jobs themselves in containers of the right series): that
> could reasonably be the latest of the specified series for the given
> architecture. It would then request that build farm job with a list of stages
> filtered by architecture. So the example I gave above would request two build
> farm jobs with jammy/amd64 and jammy/arm64 as its `DistroArchSeries` (this
> doesn't matter too much, but picking the latest one seems reasonable). The
> amd64 build farm job would have this as its `stages`:
>
> [[("build-snap", 0)], [("test-snap", 0), ("test-snap", 1), ("test-snap",
> 2)]]
>
> ... while the arm64 build farm job would have this as its `stages`:
>
> [[("build-snap", 0)], [("test-snap", 1), ("test-snap", 2)]]
>
> Given this model, there are going to be some pipeline specifications that we
> can't execute right now. For example, one that asks for a build to run on
> amd64 and then to be tested on multiple architectures (let's say it's building
> a pure-Python wheel) is conceptually reasonable, but we just don't have a way
> to orchestrate the sequencing of the necessary build farm jobs right now.
> We'll probably need to figure this out at some point, but for the moment I
> think it would be fine (though not ideal) for such cases to log an error and
> not request any builds.
>
> This is complicated stuff even to try to explain! Feel free to ask me if you
> need me to clarify anything here.

It makes sense, I reverted back the first implementation and I added a test case to cover the example in this comment and to see if the actual behavior matches with the expected one.

> I see that you had to make quite extensive changes to account for one of my
> earlier comments asking if the structure of `stages` was right.  For the
> record, it's totally fine to push back on me if I ask this sort of question
> and dealing with it seems to have massive fallout.  Looking at this, I think
> we need to step back and think about it a bit more.
> 
> The problem with what we've ended up with here is that the format of the data
> structure in `CIBuild.stages` (as opposed to whatever intermediate objects we
> use in the process of constructing that) has changed, and I don't think that
> can be the right solution to this problem.  That data structure is passed to
> the builder, so changing its format will almost certainly break the ability to
> run CI jobs on the Launchpad build farm.
> 
> The structure that ends up in `CIBuild.stages` is more or less a cleaned-up
> version of `pipeline` from
> https://lpci.readthedocs.io/en/latest/configuration.html.  It's supposed to be
> a list of stages, each stage being a list of jobs, and each job being a tuple
> of (job_name, job_index).  Within a stage, each job is executed (the
> specification allows for this to be in parallel, but in practice right now
> it's in series); all jobs in the stage are run, but the stage fails if any of
> those jobs fail.  Each stage is executed in sequence, with subsequent stages
> only executing if previous stages succeeded.  We need to preserve something
> compatible with this, as it's part of the build dispatch protocol; and we need
> to make sure that the structure continues to reflect what's in the `pipeline`
> configuration we've read, even if we're filtering it by series.
> 
> But there's another problem here.  As mentioned above, the intended semantics
> are that jobs in later pipeline stages aren't run if earlier stages fail, and
> it's not really obvious how that works with multi-series pipelines given the
> current arrangements.  Consider this hypothetical pipeline to build an
> artifact once and then test it on multiple series (which might make sense for
> artifacts like snaps that are generally supposed to be installable on multiple
> series, but that might have non-trivial interactions with the host system):
> 
>   pipeline:
>     - build-snap
>     - test-snap
>   jobs:
>     build-snap:
>       series: focal
>       architectures: [amd64, arm64]
>       build-snaps: [snapcraft]
>       run: snapcraft
>     test-snap:
>       matrix:
>         - series: bionic
>           architectures: [amd64]
>         - series: focal
>           architectures: [amd64, arm64]
>         - series: jammy
>           architectures: [amd64, arm64]
>       run: ...
> 
> This needs to run `build-snap` on focal first, for each of amd64 and arm64;
> and then if and only if that succeeds, it needs to run `test-snap` on each of
> bionic (only on amd64, for the sake of a more interesting example), focal, and
> jammy.  If we dispatch entirely separate jobs to the build farm that only
> contain the CI jobs for each of those releases, this isn't going to work as
> intended.
> 
> What I think `CIBuild.requestBuildsForRefs` needs to do here is to work out a
> list of build farm jobs it needs to request, each of which will run an
> appropriate subset of the pipeline depending on the architecture.  Each build
> farm job needs a compatible-enough `DistroArchSeries` (but note that this
> doesn't have to be equal to the CI jobs it'll be running - it just specifies
> the environment in which `lpci` will be executed, and it's then `lpci`'s job
> to execute the CI jobs themselves in containers of the right series): that
> could reasonably be the latest of the specified series for the given
> architecture.  It would then request that build farm job with a list of stages
> filtered by architecture.  So the example I gave above would request two build
> farm jobs with jammy/amd64 and jammy/arm64 as its `DistroArchSeries` (this
> doesn't matter too much, but picking the latest one seems reasonable).  The
> amd64 build farm job would have this as its `stages`:
> 
>   [[("build-snap", 0)], [("test-snap", 0), ("test-snap", 1), ("test-snap",
> 2)]]
> 
> ... while the arm64 build farm job would have this as its `stages`:
> 
>   [[("build-snap", 0)], [("test-snap", 1), ("test-snap", 2)]]
> 
> Given this model, there are going to be some pipeline specifications that we
> can't execute right now.  For example, one that asks for a build to run on
> amd64 and then to be tested on multiple architectures (let's say it's building
> a pure-Python wheel) is conceptually reasonable, but we just don't have a way
> to orchestrate the sequencing of the necessary build farm jobs right now.
> We'll probably need to figure this out at some point, but for the moment I
> think it would be fine (though not ideal) for such cases to log an error and
> not request any builds.
> 
> This is complicated stuff even to try to explain!  Feel free to ask me if you
> need me to clarify anything here.

It makes sense, I reverted back the first implementation and I added a test case to cover the example in this comment and to see if the actual behavior matches with the expected one.

« Back to merge proposal