continuous re-exec can result in a build-up of inotify fds [SRU]

Bug #1338637 reported by James Hunt
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
upstart
Fix Released
High
James Hunt
upstart (Ubuntu)
Fix Released
Undecided
Unassigned
Trusty
Confirmed
Undecided
Unassigned

Bug Description

= PROBLEM =

The version of Upstart currently in trusty (1.12.1-0ubuntu4.2) suffers
from a couple of problems which in combination could make upgrading
difficult on systems with high uptimes.

The 2 issues are:

== bug 901038 ==

The issue here is that telinit in trusty is not fully synchronous. This
is not normally a problem. However if Upstart is re-exec'ed using
'telinit u', as is done by the maintainer scripts for the following packages...

  - libc6
  - libdbus-1-3
  - libjson-c2
  - libnih1
  - libnih-dbus1
  - libselinux1
  - libsepol1
  - upstart

... and if that operation is slow (it is normally extremely fast),
subsequent package upgrades (part of the same apt-get run) which *also*
need to restart Upstart may fail since 'telinit u' is unable to connect
to PID 1 as a result of Upstart still being in the process of
re-exec'ing from the last call to 'telinit u'.

== bug 1338637 ==

This bug only really affects server systems which have high uptimes. If
a re-exec is triggered via 'telinit u' as part of a package upgrade (see
above), Upstart will consume 2 additional inotify watches. Although this
doesn't affect the correct behaviour of Upstart, it does have two
repercussions:

a) It wastes inotify watches.
b) It slows down an Upstart re-exec ('telinit u').

Note that the slow-down for a vanilla Trusty server may not be
detectable unless the value of /proc/sys/fs/inotify/max_user_instances
has been raised above the default value of 128.

= FIX DETAILS =

== bug 901038 ==

The fix for this bug now makes 'telinit u' block until the re-exec
operation has completed fully.

Technically, it is not possible to make 'telinit u' synchronous since
because D-Bus connections cannot be serialised and since the 'telinit u'
request is made via a D-Bus connection, when Upstart re-exec's, it has
to sever all D-Bus connections, including the 'telinit u' D-Bus
connection. As such, 'telinit u' now performs the following operations:

  - Requests synchronously that Upstart re-exec itself.
  - Polls Upstart "forever" by attempting to connect to PID 1 and if
    that operation fails, waiting for a period, the retrying.

The code is well commented to explain this less-than-ideal but
nonetheless essential poll operation:

  http://bazaar.launchpad.net/~upstart-devel/upstart/trunk/view/head:/util/telinit.c#L171

== bug 1338637 ==

This was a simple bug fix.

= FIX AVAILABILITY =

TBD.

= IMPACT =

On a high uptime server, the more times that 'telinit u' is called (as
the result of normal apt-get updates to any of the packages listed under
bug 901038 above), the slower the operation will take to complete, and
thus the likelihood of bug 901038 being seen will increase.

= JUSTIFICATION =

System updates need to "just work". However, as outlined above, the
longer a systems uptime, the more likely it is to be affected by this
issue which increases the likelihood of a system update failure.

This issue needs to be fixed as soon as possible to minimise issues for
trusty users and administrators, particularly before any potential
upgrade to Utopic.

= TEST CASE =

== To demonstrate bug 1338637 ==

$ sudo telinit u
$ sudo ls -l /proc/1/fd|grep inotify|wc -l

The correct output from the above 2 commands should be simply "2".

However, on trusty systems, if the commands above are run after a fresh
boot, the value will in all likelihood be "4". Also, every subsequent
'telinit u' will increase the number displayed by 2.

== To demonstrate bug 901038 ==

for i in $(seq 10); do telinit u; done

There should be no output from the command-line above, but on a trusty
system the chances are that one or more lines will display like this:

$ for i in $(seq 5); do sudo telinit u;done
telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused
telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused
telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused

== To prove the overall problem has been fixed ==

1) Download the attached "force-reexec.sh" onto a trusty system.
2) Make executable:

   $ chmod 755 ./force-reexec.sh

3) Run script as root, specifying a number of iterations (100) and
   saving output to file "./typescript.bad":

   $ script -c 'sudo ./force-reexec.sh 100' typescript.bad

4) Upgrade to latest version of Upstart that fixes the 2 bugs.

5) Reboot.

6) Re-run the script

   $ script -c 'sudo ./force-reexec.sh 100' typescript.good

7) Check the results:

   - file "typescript.bad" will show an increasing number of watches
     and *MAY* show a slowly increasing restart time (depending on
     the value of /proc/sys/fs/inotify/max_user_instances (see above)).

  - file "typescript.good" should consistently show 2 watches and a
    restart time of 0 on average.

= TEST RESULTS =

See the 2 attached typescript files showing a run of force-reexec.sh on an affected system, and the other showing the output of force-reexec.sh on a system where the problem has been resolved.

[REGRESSION POTENTIAL]

The only theoretically potential issue is what happens if the continuous
poll performed by 'telinit u' never completes. However, this should
never happen since:

1) Upstart checks to ensure that it can serialise its own state *before*
   it actually performs the re-exec. If it cannot for some reason (the
   only possibility here is critically low memory), it will
   automatically degrade to a stateless re-exec. A stateless re-exec can
   only fail if the on-disk "/sbin/init" binary or associated libraries
   are somehow corrupted.

2) If, after checking that its own state can be serialised, the actual
   re-exec operation fails "mid-flight", again, Upstart will
   automatically revert to performing a simple stateless re-exec which
   can only fail if the on-disk "/sbin/init" binary or associated
   libraries are somehow corrupted.

= OTHER INFORMATION =

== Upgrade Procedure ==

Note that updating a system to the latest version of Upstart that fixes
the two bugs outlined above will stop the number of inotify watches
growing, but will NOT bring the value down to the expected "2".

As such, to correct the problem fully, it is necessary to reboot the
system after successfully upgrading the Upstart package.

== D-Bus Daemon ==

Note that although Upstart makes heavy use of D-Bus, it does not require
a D-Bus daemon to be running. Specifically, 'telinit u' communicates
with PID 1 via a private abstract D-Bus socket, so is immune from issues
with dbus-daemon(1).

== Work Around ==

If your system is affected by this issue, you can perform the following actions to work around the issue in a non-invasive manner by simply disabling all calls to 'telinit u' via package maintainer scripts):

$ sudo mkdir /root/bin
$ sudo ln -s /bin/true /root/bin/telinit
$ sudo su -
# (PATH=/root/bin:$PATH which telinit)
# (PATH=/root/bin:$PATH apt-get update && apt-get upgrade)
# exit
$ sudo /sbin/telinit u

Note the final required call to the real telinit to perform a real Upstart restart *after* apt-get has completed.

= Original Description =

$ sudo ls -al /proc/1/fd|grep anon|wc -l
2

$ i=0; while [ $i -lt 1024 ]; do sudo telinit u; i=$((i+1)); done

$ sudo ls -al /proc/1/fd|grep anon|wc -l
106

Related branches

Revision history for this message
James Hunt (jamesodhunt) wrote :

Using the async code, this can result in a crash and clearly indicates incorrect behaviour.

Changed in upstart:
assignee: nobody → James Hunt (jamesodhunt)
importance: Undecided → High
description: updated
Revision history for this message
James Hunt (jamesodhunt) wrote :

FTR, these fds are not related to D-Bus since the the same problem is seen running a Session Init.

The expected number of 'anon_inode:inotify' fds for the System Init running as PID 1 is 2:

- 1 for /etc/init.conf (even though it doesn't exist by default).
- 1 for /etc/init/.

The expected number of 'anon_inode:inotify' fds for a Session Init is normally 2:

- 1 for $HOME/.config/upstart/
- 1 for /usr/share/upstart/sessions/

However, it's possible to have just a single such fd by specifying --confdir:

  $ init --user --confdir /foo

Revision history for this message
James Hunt (jamesodhunt) wrote :

Problem affects all versions of upstart back to 1.9 (when ConfSource objects were first serialised). The issue is caused by insufficient checks in main on when to create a ConfSource.

Revision history for this message
James Hunt (jamesodhunt) wrote :

Raised bug 1338968 on the separate issue of the crash.

James Hunt (jamesodhunt)
Changed in upstart:
status: New → Fix Released
Revision history for this message
James Hunt (jamesodhunt) wrote :

This bug is fixed in upstart 1.13 but until the fix is backported to Ubuntu trusty, upgrading to ubuntu utopic can cause a delay on the upstart package upgrade. The duration of that delay is a function of the number of times upstart has been re-exec'd since boot.

To estimate the delay on your system, you can try doing this:

$ wget http://people.canonical.com/~jhunt/upstart/scripts/get_state.sh
$ bash -c 'time { bash get_state.sh|json_pp >/dev/null; }'

Note that if the delay is appreciable, you may be hit by bug 901038.

We hope to get both these issues resolved in trusty in the next few days...

James Hunt (jamesodhunt)
summary: - continuous re-exec can result in a build-up of inotify fds
+ continuous re-exec can result in a build-up of inotify fds [SRU]
Revision history for this message
James Hunt (jamesodhunt) wrote :
description: updated
Revision history for this message
James Hunt (jamesodhunt) wrote :
Revision history for this message
James Hunt (jamesodhunt) wrote :
description: updated
Revision history for this message
James Hunt (jamesodhunt) wrote :

The fix for bug 901038 is available upstream but is not yet in Utopic.

Since this is a fundamental system tool, we need to wait until this has had testing in utopic prior to progressing this SRU.

Plan:

1) Get lp:~jamesodhunt/upstart/bug-1360208 reviewed and landed upstream (hopefully this week).
2) Release upstart 1.13.3 (including the fix for bug 901038).
3) Land upstart 1.13.3 in utopic.
4) Wait for a few days.
5) Cherry-pick to trusty to allow this SRU to progress.

James Hunt (jamesodhunt)
description: updated
Revision history for this message
James Hunt (jamesodhunt) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in upstart (Ubuntu Trusty):
status: New → Confirmed
Changed in upstart (Ubuntu):
status: New → Confirmed
Revision history for this message
Steve Langasek (vorlon) wrote :

The upstream fix for this was included in release 1.13, so this bug is fixed for utopic and later.

Changed in upstart (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.