Comment 5 for bug 1783315

Revision history for this message
Colin Watson (cjwatson) wrote :

I can reproduce a somewhat similar hang locally by sending SIGSTOP to a single worker process and then firing off a succession of jobs. It seems that the file descriptor that celery uses to communicate with the worker is still writable, so celery sends a task to it (due to prefetching) and then gets stuck. This is essentially the situation described in https://celery.readthedocs.io/en/latest/userguide/optimizing.html#prefork-pool-prefetch-settings, and indeed running the worker with -Ofair does help.

However, that still leaves a task stuck in a reserved state on a worker that's effectively a zombie, and celery never seems to notice this; over time we'll just end up with all three worker processes stuck and then we'll be back to a similar situation, just less frequently. So we really need to notice that the worker process is stuck and replace it; I haven't quite worked out yet how to persuade celery to do that.

All this is independent of why the worker is hanging in the first place, but at least it explains why the entire pool is getting stuck.