Launchpad itself

storeBuildInfo is sometimes ineffective

Bug #717969 reported by William Grant on 2011-02-13

This bug affects 8 people

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Fix Released	Critical	William Grant

Bug Description

As seen in OOPS-1869BUILDMASTER1, https://launchpad.net/~leenx-leet/+archive/drupal-mercury/+buildjob/2188875 managed to get into MANUALDEPWAIT without any of the other fields set (dependencies, datebuilt, builder, etc.). It looked like this:

{"pocket": "Release", "current_source_publication_link": "https://launchpad.net/api/devel/~leenx-leet/+archive/drupal-mercury/+sourcepub/1451976", "can_be_rescored": false, "build_log_url": null, "datebuilt": null, "datecreated": "2011-01-22T17:15:34.099091+00:00", "title": "i386 build of php-uploadprogress 1.0.1-4 in ubuntu lucid RELEASE", "distribution_link": "https://launchpad.net/api/devel/ubuntu", "changesfile_url": null, "upload_log_url": null, "web_link": "https://launchpad.net/~leenx-leet/+archive/drupal-mercury/+buildjob/2188875", "buildstate": "Dependency wait", "can_be_retried": true, "archive_link": "https://launchpad.net/api/devel/~leenx-leet/+archive/drupal-mercury", "dependencies": null, "date_first_dispatched": "2011-01-22T17:15:52.945725+00:00", "http_etag": "\"38bdc23b64d668d910687c720f6595894f372a05-4475179bff1d73327284e8ad797f57f42a81d7d6\"", "self_link": "https://launchpad.net/api/devel/~leenx-leet/+archive/drupal-mercury/+buildjob/2188875", "arch_tag": "i386", "score": null, "resource_type_link": "https://launchpad.net/api/devel/#build"}

buildd-manager.log-20110123.bz2 shows that it was dispatched and depwaited on kutjera. _handleStatus_DEPFAIL's build_info_stored was clearly run, which very strongly suggests that storeBuildInfo was too. Yet none of the attributes that it sets are set.

Is there a commit missing somewhere after storeBuildInfo's got_log callback?

Tags:

Related branches

lp:~jtv/launchpad/pre-717969

Merged into lp:launchpad at revision 14085

Stuart Bishop (community): Approve on 2011-09-29

lp:~jtv/launchpad/bug-717969

Merged into lp:launchpad at revision 14459

Julian Edwards (community): Approve on 2011-12-08

lp:~wgrant/launchpad/commit-ALL-the-build-changes

Merged into lp:launchpad at revision 16445

Steve Kowalik (community): Approve (code) on 2013-01-23

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2011-03-02:

I concur that it's probably a missing commit(), we need to figure out where. It's also worth working out why the buildd-manager has not recovered the bad build, and fix that too.

tags:

added: buildd-manager
removed: soyuz-build

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2011-03-02:

By recovered, I mean that there's a lot of code in the manager that checks what *we think* is on a builder, versus what *it thinks* is building.

If the builder is IDLE, we reset the local job.
If the builder is building something we don't recognise, it gets reset.

Julian Edwards (julian-edwards) on 2011-06-01

tags:

added: boobytrap

Revision history for this message

Colin Watson (cjwatson) wrote on 2011-09-19:

clang-build.png Edit (245.0 KiB, image/png)

Julian thinks that a build being simultaneously "Currently building" and "Finished at ..." may be another case of this bug. Screenshot attached. The API representation is:

{"pocket": "Release", "current_source_publication_link": "https://launchpad.net/api/devel/ubuntu/+archive/primary/+sourcepub/1946170", "can_be_rescored": false, "build_log_url": "https://launchpad.net/ubuntu/+source/clang/2.9-11ubuntu1/+build/2792745/+files/buildlog_ubuntu-oneiric-armel.clang_2.9-11ubuntu1_BUILDING.txt.gz", "datebuilt": "2011-09-19T01:09:46.179013+00:00", "datecreated": "2011-09-18T21:55:41.115144+00:00", "title": "armel build of clang 2.9-11ubuntu1 in ubuntu oneiric RELEASE", "distribution_link": "https://launchpad.net/api/devel/ubuntu", "changesfile_url": null, "upload_log_url": null, "web_link": "https://launchpad.net/ubuntu/+source/clang/2.9-11ubuntu1/+build/2792745", "buildstate": "Currently building", "can_be_retried": false, "archive_link": "https://launchpad.net/api/devel/ubuntu/+archive/primary", "dependencies": null, "date_first_dispatched": "2011-09-18T22:01:46.322739+00:00", "http_etag": "\"d6a91335eaec20fb98c8ad492f814ec0dbf44365-f576d05917bb703dc058781e4e1f27e4b04d63f3\"", "self_link": "https://launchpad.net/api/devel/ubuntu/+source/clang/2.9-11ubuntu1/+build/2792745", "arch_tag": "armel", "score": 1755, "resource_type_link": "https://launchpad.net/api/devel/#build"}

Jeroen T. Vermeulen (jtv) on 2011-09-21

Changed in launchpad:
status:	Triaged → In Progress
assignee:	nobody → Jeroen T. Vermeulen (jtv)

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2011-09-22:

I believe the relevant part of the call tree goes something like this:

  → SlaveScanner.startCycle
    → LoopingCall(self.singleCycle)
    → self.stopping_deferred = self.loop.start()
    → SlaveScanner.singleCycle
      → Add errback: SlaveScanner._scanFailed
      → SlaveScanner.scan
        → Builder.updateStatus or defer.succeed
        → Add callback: status_updated
          → transaction.commit
          → Builder.updateBuild
            → BuildFarmJobBehaviorBase.updateBuild
              → Builder.slaveStatus
                → BuilderSlave.status
                  → /!\ With timeout: call remote: "status"
                → Add callback: got_status
                  → BaseBuildBehavior.updateSlaveStatus
              → Add callback: got_status
                  → BuildFarmJobBehaviorBase.updateBuild_WAITING
                    → PackageBuildDerived.handleStatus
                      → PackageBuildDerived._handleStatus_DEPFAIL
                        → self.status = BuildStatus.MANUALDEPWAIT ← THIS HAPPENS
                        → PackageBuild.storeBuildInfo
                          → PackageBuild.getLogFromSlave
                            → Builder.transferSlaveFileToLibrarian
                              → BuilderSlave.getFile
                                → downloadPage
                              → Add callback: got_file
                                → [Stores file in librarian]
                          → Add callback: got_log
                            → Set dependencies, date_finished etc. ← THIS DOESN'T HAPPEN
                        → Add callback: build_info_stored
                          → BinaryPackageBuild.notify ← THIS HAPPENS
                            → [Send email]
                          → Builder.cleanSlave
                            → BuilderSlave.clean
                              → With timeout: call remote: "clean"
                          → Add callback: BuildQueue.destroySelf
                            → specific_job.cleanUp
                              → [No-op for non-Old jobs]
        → Add callback: build_updated
          → transaction.commit
          → Builder.isAvailable
            → Add callback: got_available

Note how BuilderSlave.status creates its own deferred, and the whole chain inside status_updated seems to hang off that deferred. While that deferred is waiting, won't status_updated simply return with the remainder of its business pending an answer from the slave, passing the baton to build_updated which commits a partially complete state?

I believe the relevant part of the call tree goes something like this:

→ SlaveScanner.startCycle
    → LoopingCall(self.singleCycle)
    → self.stopping_deferred = self.loop.start()
    → SlaveScanner.singleCycle
      → Add errback: SlaveScanner._scanFailed
      → SlaveScanner.scan
        → Builder.updateStatus or defer.succeed
        → Add callback: status_updated
          → transaction.commit
          → Builder.updateBuild
            → BuildFarmJobBehaviorBase.updateBuild
              → Builder.slaveStatus
                → BuilderSlave.status
                  → /!\ With timeout: call remote: "status"
                → Add callback: got_status
                  → BaseBuildBehavior.updateSlaveStatus
              → Add callback: got_status
                  → BuildFarmJobBehaviorBase.updateBuild_WAITING
                    → PackageBuildDerived.handleStatus
                      → PackageBuildDerived._handleStatus_DEPFAIL
                        → self.status = BuildStatus.MANUALDEPWAIT ← THIS HAPPENS
                        → PackageBuild.storeBuildInfo
                          → PackageBuild.getLogFromSlave
                            → Builder.transferSlaveFileToLibrarian
                              → BuilderSlave.getFile
                                → downloadPage
                              → Add callback: got_file
                                → [Stores file in librarian]
                          → Add callback: got_log
                            → Set dependencies, date_finished etc.  ← THIS DOESN'T HAPPEN
                        → Add callback: build_info_stored
                          → BinaryPackageBuild.notify ← THIS HAPPENS
                            → [Send email]
                          → Builder.cleanSlave
                            → BuilderSlave.clean
                              → With timeout: call remote: "clean"
                          → Add callback: BuildQueue.destroySelf
                            → specific_job.cleanUp
                              → [No-op for non-Old jobs]
        → Add callback: build_updated
          → transaction.commit
          → Builder.isAvailable
            → Add callback: got_available

Note how BuilderSlave.status creates its own deferred, and the whole chain inside status_updated seems to hang off that deferred.  While that deferred is waiting, won't status_updated simply return with the remainder of its business pending an answer from the slave, passing the baton to build_updated which commits a partially complete state?

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2011-09-22:

Apparently not: BuilderSlave.status returns the Deferred it creates, and when that happens, the outer Deferred won't continue to execute until the nested one has completed.

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2011-09-27:

Another thing that puzzles me (just iterating the list...) is why singleCycle registers an errback _before_ starting the scan. As I understand twisted, that will only catch errors from the previous cycle.

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2011-09-27:

It was a mistake in my diagram: the scan does happen before the errback is registered.

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2011-09-27:

We (Gavin, Jeroen, Julian) just had a call about this. The other thing we're worried about is that there are callback boundaries between where the build is updated and where the changes are committed. Those are where Twisted is supposed to give time to interaction with other builders, using the same database store. Any commit or abort will affect all ongoing threads of execution.

We believe that the existing code is simply wrong in this regard. Model changes made in a callback must be committed in the same callback. It looks like the changes to the build should be committed right after they're made.

Our plan for this is to ensure (through rollbacks, database access policies, barbed wire, anything we can get really) that no step in the buildmaster's twisted code changes database objects without explicitly committing the changes.

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2011-09-30:

To keep the analysis record going: as far as I can see, getLogFromSlave is asynchronous. (I'd like to say it "blocks," but of course that's not technically correct since it returns control to the reactor while its callback chain waits).

My understanding is that at that point, a different builder could go through build_updated and accidentally commit the pending changes from handleStatus_DEPFAIL as a side effect. Then at some point control would return, getting us to got_log where the build's status is updated. Things continue up to the "clean" call to the slave; at that point another builder could conceivably go through an error and abort the pending changes from got_log.

If that is what happened, I think it would explain why the log shows no error for this build: other builds would have committed one part of this build's changes and rolled back another.

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2011-10-01:

#10

The branch I just attached puts the buildd-manager in a read-only transaction policy, and wraps the few isolated exceptions to that policy in read-write policies. There's a lot more testing to do though before we can be reasonably confident that it'll work.

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2011-10-01:

#11

Fixed in stable r14085 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/14085>.

tags:	added: qa-needstesting
Changed in launchpad:
status:	In Progress → Fix Committed

Jeroen T. Vermeulen (jtv) on 2011-10-01

tags:

added: qa-untestable
removed: qa-needstesting

Julian Edwards (julian-edwards) on 2011-10-03

Changed in launchpad:
status:	Fix Committed → In Progress

Revision history for this message

Steve McInerney (spm) wrote on 2011-11-04:

#12

Not sure where we're at with this bug?
https://bugs.launchpad.net/launchpad/+bug/868917 we trigged a critical nagios alert about a month ago; which is still critical and red.
Is there any need to continue to have this nagios alert if it can be ignored for so long?
Or should it be dropped from nagios to be a scriptactivity like alert that can email dev's?

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2011-11-04:

#13

I hadn't realized this was setting off alerts. Pre-landing Q/A for the code fix is blocked, and I've been distracted by “more Critical” bugs (the “most Critical” of which should be out of the way soon). Unfortunately all these bugs have profound underlying architectural causes that are not easy to address.

I'll ask Julian to check again if we can get the resources we need to move past the blockage. Meanwhile, the code fix isn't going to fix the existing broken record in bug 868917; it'll have to be done by hand. Will that at least stop the red light blinking for now? I doubt we'd be able to find a single reliable recipe for identifying or fixing these broken records.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2011-11-04: Re: [Bug 717969] Re: storeBuildInfo is sometimes ineffective

#14

On Friday 04 November 2011 03:52:02 you wrote:
> Not sure where we're at with this bug?
> https://bugs.launchpad.net/launchpad/+bug/868917 we trigged a critical
> nagios alert about a month ago; which is still critical and red. Is there
> any need to continue to have this nagios alert if it can be ignored for so
> long? Or should it be dropped from nagios to be a scriptactivity like alert
> that can email dev's?

Steve - we need the alert still. When the build is corrupted you need to run
the build-killing SQL that you have on your crib sheet.

Cheers.

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2011-12-09:

#15

Fixed in stable r14459 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/14459>.

tags:	added: qa-needstesting removed: qa-untestable
Changed in launchpad:
status:	In Progress → Fix Committed

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2011-12-13:

#16

Staging build manager is building away happily. Builds are succeeding, there's the occasional legitimate failure, and logs are accessible in the librarian.

tags:

added: qa-ok
removed: qa-needstesting

William Grant (wgrant) on 2012-01-03

Changed in launchpad:
status:	Fix Committed → In Progress

Revision history for this message

Robert Collins (lifeless) wrote on 2012-01-03:

#17

@julian what sql from their crib sheet?
12:42 < lifeless> 'Steve - we need the alert still. When the build is corrupted you need to run
12:42 < lifeless> the build-killing SQL that you have on your crib sheet.'
12:42 < lifeless> whats the build killing SQL you have?
12:42 < spm> yes. that's the problem. we don't have build killing sql.

Revision history for this message

Tom Haddon (mthaddon) wrote on 2012-01-04:

#18

I'm assuming he means https://wiki.canonical.com/InformationInfrastructure/IS/BuilddFixing#builds_that_kill_machines here

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2012-01-04:

#19

Correct!

Curtis Hovey (sinzui) on 2012-09-11

Changed in launchpad:
assignee:	Jeroen T. Vermeulen (jtv) → nobody
status:	In Progress → Triaged

William Grant (wgrant) on 2012-10-22

tags:

added: soyuz-build

William Grant (wgrant) on 2013-01-23

Changed in launchpad:
assignee:	nobody → William Grant (wgrant)
status:	Triaged → In Progress

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2013-01-23:

#20

Fixed in stable r16445 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/16445>.

tags:	added: qa-needstesting removed: qa-ok
Changed in launchpad:
status:	In Progress → Fix Committed

William Grant (wgrant) on 2013-01-23

tags:

added: qa-untestable
removed: qa-needstesting

William Grant (wgrant) on 2013-01-23

Changed in launchpad:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

clang-build.png Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.