Comment 51 for bug 1377332

Revision history for this message
Stéphane Graber (stgraber) wrote :

Just had a chat with Serge and we've got a theory.

cgmanager sends tasks to cgproxy using ucreds. ucreds contain a pid, uid and gid, all of which must be valid. If however the pid gets killed and dies between the time the ucred is generated and the time it's sent over the socket, it'll be invalid and the kernel will refuse to send the message returning the odd error we're seeing. This then makes cgmanager give up and keeps cgproxy hanging in recvmsg.

The way around this is to have both cgmanager and cgproxy check for error on sendmsg and recvmsg, then check the errno and if that matches the "pid no longer exists" case, then just ignore that entry as it means the process has now died and so shouldn't be reported anyway.

There probably are some more similar races here and there in cgmanager/cgproxy when sending processes over ucreds, but getting a patch for that case (assuming the theory is right) shouldn't be terribly difficult.

It should also be reasonably easy to construct a testcase which hits that specific problem by spawning say around a thousand processes, then killing them all while doing gettasks in an infinite loop (which is a good approximation of what libual does today).