Comment 19 for bug 688541

Revision history for this message
Clint Byrum (clint-fewbar) wrote : Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)

On Mon, 2010-12-20 at 12:50 +0000, James Hunt wrote:
> After discussion with Scott, the best short-term solution would seem to
> be:
>
> 1) Modify /etc/init.d/umountfs to call the following in do_stop before
> calling umount/swapoff:
>
> "initctl emit unmount-filesystem"
>
> 2) Modify /etc/init.d/umountroot to call the following in do_stop before
> calling umount:
>
> "initctl emit unmount-root-filesystem"
>
>
> 3) Modify all upstart configs for services which are "slow" to stop such that they "stop on unmount-filesystem",
> rather than "stop on runlevel [016]".
>
> 4) Test!
>
> The overall effect of this being that when /etc/init.d/umountfs emits
> the unmount-filesystem event, it will block until any Upstart jobs which
> "stop on" those events have completed. Thus, /etc/init.d/umountfs will
> wait for the mysql Upstart job to finish before unmounting its
> filesystems.

Not much happens between rc-sysinit starting and sendsigs/umountfs. Is
slow even 1 second between SIGTERM and exiting? Shouldn't we just make
sure everything that is 'stop on runlevel [!2345]' or 'stop on runlevel
[016]' stops before we umount? bug #672177 may very well be caused
simply by killing the last service that had the deleted libc.so.6 open,
causing the fs to need to finish the deletion right then, which could be
waiting on a sync and many other files being flushed/etc. on a busy
rotational disk. This will cause something very tiny to take a second to
die.

I think we must transition *everything* that stops on runlevel [016] to
'stop on unmounting-filesystems', or get clever and find a way to wait
until upstart is done stopping everything it already wants to stop. I do
think that initctl list is flawed for this task, but it might be the
best chance at catching stragglers that we have.

In a message to ubuntu-devel I suggested that we have an abstract job,
'network-services', which most normal (non boot-critical) services
should follow.

https://lists.ubuntu.com/archives/ubuntu-devel/2010-December/032254.html

By taking this approach, we can at least ammend this fix if it has
unintended consequences.

There's also still the issue (which probably should be its own bug
report) that sendsigs will kill the children of already stopping jobs,
which it shouldn't do, and which it would still do in the suggested fix
since sendsigs runs before umountfs.