celery hung after long git repository scan

Bug #1783315 reported by Colin Watson
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
Critical
Colin Watson

Bug Description

launchpad@ackee's celery hung today. Log excerpts:

[2018-07-24 06:39:12,597: INFO/Worker-2] Running <GitRefScanJob for ~documentfoundation/df-libreoffice/+git/core> (ID 43870243) in status Waiting
[2018-07-24 06:39:12,603: DEBUG3/Worker-2] commit <storm.zope.zstorm.StoreDataManager object at 0x7fc9c57e9650>
[2018-07-24 06:39:12,604: DEBUG3/Worker-2] commit
[2018-07-24 06:39:12,604: DEBUG3/Worker-2] new transaction
[2018-07-24 06:39:12,628: INFO/Worker-2] Starting new HTTP connection (1): git.launchpad.net
[2018-07-24 06:39:18,995: DEBUG/Worker-2] "GET /repo/2115/refs HTTP/1.1" 200 14755548
[2018-07-24 06:40:07,538: DEBUG/MainProcess] pidbox received method ping() [reply_to:{u'routing_key': u'39fbc8dc-3604-3738-a347-e6f482abf3c9', u'exchange': u'reply.celery.pidbox'} ticket:e8beeca3-cdbe-4daf-8ca2-7855189b4f01]
[2018-07-24 06:40:44,064: INFO/Worker-2] Requesting commit details for [u'f5b8531acc1e944c2e5a2bbee831e562fba89655']
[2018-07-24 06:42:28,485: DEBUG/MainProcess] basic.qos: prefetch_count->7
[2018-07-24 06:43:02,514: DEBUG/MainProcess] basic.qos: prefetch_count->6
[2018-07-24 06:44:12,577: WARNING/MainProcess] Soft time limit (300.0s) exceeded for lp.services.job.celeryjob.CeleryRunJobIgnoreResult[GitRefScanJob_43870243_1c10b906-f8e9-4edc-8b75-921f86234ee9]
[2018-07-24 06:44:58,615: DEBUG/MainProcess] basic.qos: prefetch_count->5
[2018-07-24 06:45:06,699: DEBUG/MainProcess] pidbox received method ping() [reply_to:{u'routing_key': u'b3a4a1d2-bca7-3d14-ab06-90b4dbe252ba', u'exchange': u'reply.celery.pidbox'} ticket:93eea71c-7d3c-458d-9d5e-474783a05251]
[2018-07-24 06:46:12,681: DEBUG/MainProcess] basic.qos: prefetch_count->4
[2018-07-24 06:48:00,775: DEBUG/MainProcess] basic.qos: prefetch_count->3
[2018-07-24 06:50:06,999: DEBUG/MainProcess] pidbox received method ping() [reply_to:{u'routing_key': u'73c37ffd-59aa-3fa6-a97c-09a49497ada3', u'exchange': u'reply.celery.pidbox'} ticket:9f356f5d-b925-48ae-912e-6f8402b2f383]
[2018-07-24 06:55:06,498: DEBUG/MainProcess] pidbox received method ping() [reply_to:{u'routing_key': u'9dd52797-6431-3d48-bd0c-2a0d39f8fe18', u'exchange': u'reply.celery.pidbox'} ticket:e0dcec74-33e4-4b5d-84d8-1b4a42fa2bf0]

... and then nothing else for several hours. The libreoffice repository has been fixed, but it's not obvious why (apparently) the timeout stalled celery.

Related branches

Revision history for this message
Colin Watson (cjwatson) wrote :

This happened again the very next time there was a soft timeout on a repository scan, so there's obviously something repeatably wrong here.

Colin Watson (cjwatson)
tags: added: git lp-code regression
Changed in launchpad:
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Colin Watson (cjwatson)
Revision history for this message
Colin Watson (cjwatson) wrote :

I can't be absolutely sure when this regressed, since such soft timeouts aren't very common; there've only been three since the last rollout. One thing I noticed was that a soft timeout on a PlainPackageCopyJob on 2018-07-16 didn't cause celery to hang, but soft timeouts on GitRefScanJob instances did. My current suspicion is that it has something to do with urlfetch's thread handling.

Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
Changed in launchpad:
status: In Progress → Fix Committed
Revision history for this message
Joel Sing (jsing) wrote :

This (or similar) seems to have recurred:

[2018-07-30 21:59:37,919: WARNING/MainProcess] Soft time limit (300.0s) exceeded for lp.services.job.celeryjob.CeleryRunJobIgnoreResult[GitRefScanJob_43982237_f031800a-c77b-463d-831f-d822a9740d7b]

No work appears to have been processed following that point (until celeryd was bounced).

William Grant (wgrant)
tags: added: qa-ok
removed: qa-needstesting
Colin Watson (cjwatson)
Changed in launchpad:
status: Fix Committed → Triaged
tags: removed: qa-ok
Revision history for this message
Colin Watson (cjwatson) wrote :

I can reproduce a somewhat similar hang locally by sending SIGSTOP to a single worker process and then firing off a succession of jobs. It seems that the file descriptor that celery uses to communicate with the worker is still writable, so celery sends a task to it (due to prefetching) and then gets stuck. This is essentially the situation described in https://celery.readthedocs.io/en/latest/userguide/optimizing.html#prefork-pool-prefetch-settings, and indeed running the worker with -Ofair does help.

However, that still leaves a task stuck in a reserved state on a worker that's effectively a zombie, and celery never seems to notice this; over time we'll just end up with all three worker processes stuck and then we'll be back to a similar situation, just less frequently. So we really need to notice that the worker process is stuck and replace it; I haven't quite worked out yet how to persuade celery to do that.

All this is independent of why the worker is hanging in the first place, but at least it explains why the entire pool is getting stuck.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.