storeBuildInfo is sometimes ineffective

Bug #717969 reported by William Grant
52
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
William Grant

Bug Description

As seen in OOPS-1869BUILDMASTER1, https://launchpad.net/~leenx-leet/+archive/drupal-mercury/+buildjob/2188875 managed to get into MANUALDEPWAIT without any of the other fields set (dependencies, datebuilt, builder, etc.). It looked like this:

{"pocket": "Release", "current_source_publication_link": "https://launchpad.net/api/devel/~leenx-leet/+archive/drupal-mercury/+sourcepub/1451976", "can_be_rescored": false, "build_log_url": null, "datebuilt": null, "datecreated": "2011-01-22T17:15:34.099091+00:00", "title": "i386 build of php-uploadprogress 1.0.1-4 in ubuntu lucid RELEASE", "distribution_link": "https://launchpad.net/api/devel/ubuntu", "changesfile_url": null, "upload_log_url": null, "web_link": "https://launchpad.net/~leenx-leet/+archive/drupal-mercury/+buildjob/2188875", "buildstate": "Dependency wait", "can_be_retried": true, "archive_link": "https://launchpad.net/api/devel/~leenx-leet/+archive/drupal-mercury", "dependencies": null, "date_first_dispatched": "2011-01-22T17:15:52.945725+00:00", "http_etag": "\"38bdc23b64d668d910687c720f6595894f372a05-4475179bff1d73327284e8ad797f57f42a81d7d6\"", "self_link": "https://launchpad.net/api/devel/~leenx-leet/+archive/drupal-mercury/+buildjob/2188875", "arch_tag": "i386", "score": null, "resource_type_link": "https://launchpad.net/api/devel/#build"}

buildd-manager.log-20110123.bz2 shows that it was dispatched and depwaited on kutjera. _handleStatus_DEPFAIL's build_info_stored was clearly run, which very strongly suggests that storeBuildInfo was too. Yet none of the attributes that it sets are set.

Is there a commit missing somewhere after storeBuildInfo's got_log callback?

Related branches

Revision history for this message
Julian Edwards (julian-edwards) wrote :

I concur that it's probably a missing commit(), we need to figure out where. It's also worth working out why the buildd-manager has not recovered the bad build, and fix that too.

tags: added: buildd-manager
removed: soyuz-build
Revision history for this message
Julian Edwards (julian-edwards) wrote :

By recovered, I mean that there's a lot of code in the manager that checks what *we think* is on a builder, versus what *it thinks* is building.

If the builder is IDLE, we reset the local job.
If the builder is building something we don't recognise, it gets reset.

tags: added: boobytrap
Revision history for this message
Colin Watson (cjwatson) wrote :

Julian thinks that a build being simultaneously "Currently building" and "Finished at ..." may be another case of this bug. Screenshot attached. The API representation is:

  {"pocket": "Release", "current_source_publication_link": "https://launchpad.net/api/devel/ubuntu/+archive/primary/+sourcepub/1946170", "can_be_rescored": false, "build_log_url": "https://launchpad.net/ubuntu/+source/clang/2.9-11ubuntu1/+build/2792745/+files/buildlog_ubuntu-oneiric-armel.clang_2.9-11ubuntu1_BUILDING.txt.gz", "datebuilt": "2011-09-19T01:09:46.179013+00:00", "datecreated": "2011-09-18T21:55:41.115144+00:00", "title": "armel build of clang 2.9-11ubuntu1 in ubuntu oneiric RELEASE", "distribution_link": "https://launchpad.net/api/devel/ubuntu", "changesfile_url": null, "upload_log_url": null, "web_link": "https://launchpad.net/ubuntu/+source/clang/2.9-11ubuntu1/+build/2792745", "buildstate": "Currently building", "can_be_retried": false, "archive_link": "https://launchpad.net/api/devel/ubuntu/+archive/primary", "dependencies": null, "date_first_dispatched": "2011-09-18T22:01:46.322739+00:00", "http_etag": "\"d6a91335eaec20fb98c8ad492f814ec0dbf44365-f576d05917bb703dc058781e4e1f27e4b04d63f3\"", "self_link": "https://launchpad.net/api/devel/ubuntu/+source/clang/2.9-11ubuntu1/+build/2792745", "arch_tag": "armel", "score": 1755, "resource_type_link": "https://launchpad.net/api/devel/#build"}

Changed in launchpad:
status: Triaged → In Progress
assignee: nobody → Jeroen T. Vermeulen (jtv)
Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

I believe the relevant part of the call tree goes something like this:

  → SlaveScanner.startCycle
    → LoopingCall(self.singleCycle)
    → self.stopping_deferred = self.loop.start()
    → SlaveScanner.singleCycle
      → Add errback: SlaveScanner._scanFailed
      → SlaveScanner.scan
        → Builder.updateStatus or defer.succeed
        → Add callback: status_updated
          → transaction.commit
          → Builder.updateBuild
            → BuildFarmJobBehaviorBase.updateBuild
              → Builder.slaveStatus
                → BuilderSlave.status
                  → /!\ With timeout: call remote: "status"
                → Add callback: got_status
                  → BaseBuildBehavior.updateSlaveStatus
              → Add callback: got_status
                  → BuildFarmJobBehaviorBase.updateBuild_WAITING
                    → PackageBuildDerived.handleStatus
                      → PackageBuildDerived._handleStatus_DEPFAIL
                        → self.status = BuildStatus.MANUALDEPWAIT ← THIS HAPPENS
                        → PackageBuild.storeBuildInfo
                          → PackageBuild.getLogFromSlave
                            → Builder.transferSlaveFileToLibrarian
                              → BuilderSlave.getFile
                                → downloadPage
                              → Add callback: got_file
                                → [Stores file in librarian]
                          → Add callback: got_log
                            → Set dependencies, date_finished etc. ← THIS DOESN'T HAPPEN
                        → Add callback: build_info_stored
                          → BinaryPackageBuild.notify ← THIS HAPPENS
                            → [Send email]
                          → Builder.cleanSlave
                            → BuilderSlave.clean
                              → With timeout: call remote: "clean"
                          → Add callback: BuildQueue.destroySelf
                            → specific_job.cleanUp
                              → [No-op for non-Old jobs]
        → Add callback: build_updated
          → transaction.commit
          → Builder.isAvailable
            → Add callback: got_available

Note how BuilderSlave.status creates its own deferred, and the whole chain inside status_updated seems to hang off that deferred. While that deferred is waiting, won't status_updated simply return with the remainder of its business pending an answer from the slave, passing the baton to build_updated which commits a partially complete state?

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

Apparently not: BuilderSlave.status returns the Deferred it creates, and when that happens, the outer Deferred won't continue to execute until the nested one has completed.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

Another thing that puzzles me (just iterating the list...) is why singleCycle registers an errback _before_ starting the scan. As I understand twisted, that will only catch errors from the previous cycle.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

It was a mistake in my diagram: the scan does happen before the errback is registered.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

We (Gavin, Jeroen, Julian) just had a call about this. The other thing we're worried about is that there are callback boundaries between where the build is updated and where the changes are committed. Those are where Twisted is supposed to give time to interaction with other builders, using the same database store. Any commit or abort will affect all ongoing threads of execution.

We believe that the existing code is simply wrong in this regard. Model changes made in a callback must be committed in the same callback. It looks like the changes to the build should be committed right after they're made.

Our plan for this is to ensure (through rollbacks, database access policies, barbed wire, anything we can get really) that no step in the buildmaster's twisted code changes database objects without explicitly committing the changes.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

To keep the analysis record going: as far as I can see, getLogFromSlave is asynchronous. (I'd like to say it "blocks," but of course that's not technically correct since it returns control to the reactor while its callback chain waits).

My understanding is that at that point, a different builder could go through build_updated and accidentally commit the pending changes from handleStatus_DEPFAIL as a side effect. Then at some point control would return, getting us to got_log where the build's status is updated. Things continue up to the "clean" call to the slave; at that point another builder could conceivably go through an error and abort the pending changes from got_log.

If that is what happened, I think it would explain why the log shows no error for this build: other builds would have committed one part of this build's changes and rolled back another.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

The branch I just attached puts the buildd-manager in a read-only transaction policy, and wraps the few isolated exceptions to that policy in read-write policies. There's a lot more testing to do though before we can be reasonably confident that it'll work.

Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
Changed in launchpad:
status: In Progress → Fix Committed
tags: added: qa-untestable
removed: qa-needstesting
Changed in launchpad:
status: Fix Committed → In Progress
Revision history for this message
Steve McInerney (spm) wrote :

Not sure where we're at with this bug?
https://bugs.launchpad.net/launchpad/+bug/868917 we trigged a critical nagios alert about a month ago; which is still critical and red.
Is there any need to continue to have this nagios alert if it can be ignored for so long?
Or should it be dropped from nagios to be a scriptactivity like alert that can email dev's?

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

I hadn't realized this was setting off alerts. Pre-landing Q/A for the code fix is blocked, and I've been distracted by “more Critical” bugs (the “most Critical” of which should be out of the way soon). Unfortunately all these bugs have profound underlying architectural causes that are not easy to address.

I'll ask Julian to check again if we can get the resources we need to move past the blockage. Meanwhile, the code fix isn't going to fix the existing broken record in bug 868917; it'll have to be done by hand. Will that at least stop the red light blinking for now? I doubt we'd be able to find a single reliable recipe for identifying or fixing these broken records.

Revision history for this message
Julian Edwards (julian-edwards) wrote : Re: [Bug 717969] Re: storeBuildInfo is sometimes ineffective

On Friday 04 November 2011 03:52:02 you wrote:
> Not sure where we're at with this bug?
> https://bugs.launchpad.net/launchpad/+bug/868917 we trigged a critical
> nagios alert about a month ago; which is still critical and red. Is there
> any need to continue to have this nagios alert if it can be ignored for so
> long? Or should it be dropped from nagios to be a scriptactivity like alert
> that can email dev's?

Steve - we need the alert still. When the build is corrupted you need to run
the build-killing SQL that you have on your crib sheet.

Cheers.

Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
removed: qa-untestable
Changed in launchpad:
status: In Progress → Fix Committed
Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

Staging build manager is building away happily. Builds are succeeding, there's the occasional legitimate failure, and logs are accessible in the librarian.

tags: added: qa-ok
removed: qa-needstesting
William Grant (wgrant)
Changed in launchpad:
status: Fix Committed → In Progress
Revision history for this message
Robert Collins (lifeless) wrote :

@julian what sql from their crib sheet?
12:42 < lifeless> 'Steve - we need the alert still. When the build is corrupted you need to run
12:42 < lifeless> the build-killing SQL that you have on your crib sheet.'
12:42 < lifeless> whats the build killing SQL you have?
12:42 < spm> yes. that's the problem. we don't have build killing sql.

Revision history for this message
Tom Haddon (mthaddon) wrote :
Revision history for this message
Julian Edwards (julian-edwards) wrote :

Correct!

Curtis Hovey (sinzui)
Changed in launchpad:
assignee: Jeroen T. Vermeulen (jtv) → nobody
status: In Progress → Triaged
William Grant (wgrant)
tags: added: soyuz-build
William Grant (wgrant)
Changed in launchpad:
assignee: nobody → William Grant (wgrant)
status: Triaged → In Progress
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
removed: qa-ok
Changed in launchpad:
status: In Progress → Fix Committed
William Grant (wgrant)
tags: added: qa-untestable
removed: qa-needstesting
William Grant (wgrant)
Changed in launchpad:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.