Merge ~jasonzio/cloud-init:dhcpLeaseRace into cloud-init:master

Proposed by Jason Zions
Status: Merged
Approved by: Ryan Harper
Approved revision: d8c760069378e9a096b91e655df17767ffb51f0d
Merge reported by: Server Team CI bot
Merged at revision: not available
Proposed branch: ~jasonzio/cloud-init:dhcpLeaseRace
Merge into: cloud-init:master
Diff against target: 233 lines (+84/-19)
6 files modified
cloudinit/net/dhcp.py (+31/-13)
cloudinit/net/tests/test_dhcp.py (+12/-3)
cloudinit/temp_utils.py (+2/-2)
cloudinit/tests/test_temp_utils.py (+17/-1)
cloudinit/util.py (+16/-0)
tests/unittests/test_util.py (+6/-0)
Reviewer Review Type Date Requested Status
Server Team CI bot continuous-integration Approve
Jason Zions (community) Approve
Ryan Harper Approve
Review via email: mp+360905@code.launchpad.net

Commit message

net: Wait for dhclient to daemonize before reading lease file

cloud-init uses dhclient to fetch the DHCP lease so it can extract
DHCP options. dhclient creates the leasefile, then writes to it;
simply waiting for the leasefile to appear creates a race between
dhclient and cloud-init. Instead, wait for dhclient to be parented by
init. At that point, we know it has written to the leasefile, so it's
safe to copy the file and kill the process.

cloud-init creates a temporary directory in which to execute dhclient,
and deletes that directory after it has killed the process. If
cloud-init abandons waiting for dhclient to daemonize, it will still
attempt to delete the temporary directory, but will not report an
exception should that attempt fail.

LP: #1794399

To post a comment you must log in.
Revision history for this message
Ryan Harper (raharper) wrote :

A couple of inline comments.

review: Needs Fixing
Revision history for this message
Jason Zions (jasonzio) wrote :

I addressed your change to temp_utils.tempdir() a little differently than you suggested. To be honest, the code I initially proposed does exactly the same thing (peels the 'suppress_rmtree_errors' argument, if specified, out of the kwargs, capturing it in a separate variable), but it does so by not even putting it into kwargs in the first place. Python supports the syntax I used precisely for this very scenario. The code I replaced it with (popping the value from the kwargs dict if present, supplying a default if not present) is a pretty blunt thing.

All of this is required due to tempfile.mkstemp() complaining if it's supplied with an option (suppress_rmtree_errors) it does not expect. I can't just leave that keyword param in kwargs.

review: Needs Resubmitting
Revision history for this message
Jason Zions (jasonzio) wrote :

> I addressed your change to temp_utils.tempdir() a little differently than you
> suggested. To be honest, the code I initially proposed does exactly the same
> thing (peels the 'suppress_rmtree_errors' argument, if specified, out of the
> kwargs, capturing it in a separate variable), but it does so by not even
> putting it into kwargs in the first place. Python supports the syntax I used
> precisely for this very scenario. The code I replaced it with (popping the
> value from the kwargs dict if present, supplying a default if not present) is
> a pretty blunt thing.
>
> All of this is required due to tempfile.mkstemp() complaining if it's supplied
> with an option (suppress_rmtree_errors) it does not expect. I can't just leave
> that keyword param in kwargs.

(Ignore the "review: resubmit" thing; I'm still learning launchpad.)

Revision history for this message
Jason Zions (jasonzio) :
review: Needs Fixing
Revision history for this message
Ryan Harper (raharper) wrote :

Hi Jason,

I'm sorry. I now see why you had suppress_rmtree_errors separate; it was only for the rmtree command itself. Now that I understand that it does indeed make sense to keep it out of the kwargs.

Could we update the variable to:

def tempdir(rmtree_ignore_errors=False, **kwargs)

I think that helps indicate that the parameter is for the rmtree call.

sorry again for missing this the first time.

Revision history for this message
Jason Zions (jasonzio) wrote :

> Could we update the variable to:
>
> def tempdir(rmtree_ignore_errors=False, **kwargs)
>
>
> I think that helps indicate that the parameter is for the rmtree call.

Done.

Revision history for this message
Ryan Harper (raharper) wrote :
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:d8c760069378e9a096b91e655df17767ffb51f0d
https://jenkins.ubuntu.com/server/job/cloud-init-ci/514/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    SUCCESS: Ubuntu LTS: Integration
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/514/rebuild

review: Approve (continuous-integration)
Revision history for this message
Ryan Harper (raharper) :
review: Approve
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

Commit message lints:
 - Line #2 has 307 too many characters. Line starts with: "cloud-init uses dhclient"... - Line #4 has 230 too many characters. Line starts with: "cloud-init creates a"...

review: Needs Fixing
Revision history for this message
Jason Zions (jasonzio) :
review: Approve
Revision history for this message
Server Team CI bot (server-team-bot) :
review: Approve (continuous-integration)
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

Commit message lints:
 - Line #8 has 4 too many characters. Line starts with: " "...

review: Needs Fixing
Revision history for this message
Server Team CI bot (server-team-bot) :
review: Approve (continuous-integration)
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

Commit message lints:
 - Line #8 has 4 too many characters. Line starts with: " "...

review: Needs Fixing
Revision history for this message
Server Team CI bot (server-team-bot) :
review: Approve (continuous-integration)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/cloudinit/net/dhcp.py b/cloudinit/net/dhcp.py
2index 0db991d..c98a97c 100644
3--- a/cloudinit/net/dhcp.py
4+++ b/cloudinit/net/dhcp.py
5@@ -9,6 +9,7 @@ import logging
6 import os
7 import re
8 import signal
9+import time
10
11 from cloudinit.net import (
12 EphemeralIPv4Network, find_fallback_nic, get_devicelist,
13@@ -127,7 +128,9 @@ def maybe_perform_dhcp_discovery(nic=None):
14 if not dhclient_path:
15 LOG.debug('Skip dhclient configuration: No dhclient command found.')
16 return []
17- with temp_utils.tempdir(prefix='cloud-init-dhcp-', needs_exe=True) as tdir:
18+ with temp_utils.tempdir(rmtree_ignore_errors=True,
19+ prefix='cloud-init-dhcp-',
20+ needs_exe=True) as tdir:
21 # Use /var/tmp because /run/cloud-init/tmp is mounted noexec
22 return dhcp_discovery(dhclient_path, nic, tdir)
23
24@@ -195,24 +198,39 @@ def dhcp_discovery(dhclient_cmd_path, interface, cleandir):
25 '-pf', pid_file, interface, '-sf', '/bin/true']
26 util.subp(cmd, capture=True)
27
28- # dhclient doesn't write a pid file until after it forks when it gets a
29- # proper lease response. Since cleandir is a temp directory that gets
30- # removed, we need to wait for that pidfile creation before the
31- # cleandir is removed, otherwise we get FileNotFound errors.
32+ # Wait for pid file and lease file to appear, and for the process
33+ # named by the pid file to daemonize (have pid 1 as its parent). If we
34+ # try to read the lease file before daemonization happens, we might try
35+ # to read it before the dhclient has actually written it. We also have
36+ # to wait until the dhclient has become a daemon so we can be sure to
37+ # kill the correct process, thus freeing cleandir to be deleted back
38+ # up the callstack.
39 missing = util.wait_for_files(
40 [pid_file, lease_file], maxwait=5, naplen=0.01)
41 if missing:
42 LOG.warning("dhclient did not produce expected files: %s",
43 ', '.join(os.path.basename(f) for f in missing))
44 return []
45- pid_content = util.load_file(pid_file).strip()
46- try:
47- pid = int(pid_content)
48- except ValueError:
49- LOG.debug(
50- "pid file contains non-integer content '%s'", pid_content)
51- else:
52- os.kill(pid, signal.SIGKILL)
53+
54+ ppid = 'unknown'
55+ for _ in range(0, 1000):
56+ pid_content = util.load_file(pid_file).strip()
57+ try:
58+ pid = int(pid_content)
59+ except ValueError:
60+ pass
61+ else:
62+ ppid = util.get_proc_ppid(pid)
63+ if ppid == 1:
64+ LOG.debug('killing dhclient with pid=%s', pid)
65+ os.kill(pid, signal.SIGKILL)
66+ return parse_dhcp_lease_file(lease_file)
67+ time.sleep(0.01)
68+
69+ LOG.error(
70+ 'dhclient(pid=%s, parentpid=%s) failed to daemonize after %s seconds',
71+ pid_content, ppid, 0.01 * 1000
72+ )
73 return parse_dhcp_lease_file(lease_file)
74
75
76diff --git a/cloudinit/net/tests/test_dhcp.py b/cloudinit/net/tests/test_dhcp.py
77index cd3e732..79e8842 100644
78--- a/cloudinit/net/tests/test_dhcp.py
79+++ b/cloudinit/net/tests/test_dhcp.py
80@@ -145,16 +145,20 @@ class TestDHCPDiscoveryClean(CiTestCase):
81 'subnet-mask': '255.255.255.0', 'routers': '192.168.2.1'}],
82 dhcp_discovery(dhclient_script, 'eth9', tmpdir))
83 self.assertIn(
84- "pid file contains non-integer content ''", self.logs.getvalue())
85+ "dhclient(pid=, parentpid=unknown) failed "
86+ "to daemonize after 10.0 seconds",
87+ self.logs.getvalue())
88 m_kill.assert_not_called()
89
90+ @mock.patch('cloudinit.net.dhcp.util.get_proc_ppid')
91 @mock.patch('cloudinit.net.dhcp.os.kill')
92 @mock.patch('cloudinit.net.dhcp.util.wait_for_files')
93 @mock.patch('cloudinit.net.dhcp.util.subp')
94 def test_dhcp_discovery_run_in_sandbox_waits_on_lease_and_pid(self,
95 m_subp,
96 m_wait,
97- m_kill):
98+ m_kill,
99+ m_getppid):
100 """dhcp_discovery waits for the presence of pidfile and dhcp.leases."""
101 tmpdir = self.tmp_dir()
102 dhclient_script = os.path.join(tmpdir, 'dhclient.orig')
103@@ -164,6 +168,7 @@ class TestDHCPDiscoveryClean(CiTestCase):
104 pidfile = self.tmp_path('dhclient.pid', tmpdir)
105 leasefile = self.tmp_path('dhcp.leases', tmpdir)
106 m_wait.return_value = [pidfile] # Return the missing pidfile wait for
107+ m_getppid.return_value = 1 # Indicate that dhclient has daemonized
108 self.assertEqual([], dhcp_discovery(dhclient_script, 'eth9', tmpdir))
109 self.assertEqual(
110 mock.call([pidfile, leasefile], maxwait=5, naplen=0.01),
111@@ -173,9 +178,10 @@ class TestDHCPDiscoveryClean(CiTestCase):
112 self.logs.getvalue())
113 m_kill.assert_not_called()
114
115+ @mock.patch('cloudinit.net.dhcp.util.get_proc_ppid')
116 @mock.patch('cloudinit.net.dhcp.os.kill')
117 @mock.patch('cloudinit.net.dhcp.util.subp')
118- def test_dhcp_discovery_run_in_sandbox(self, m_subp, m_kill):
119+ def test_dhcp_discovery_run_in_sandbox(self, m_subp, m_kill, m_getppid):
120 """dhcp_discovery brings up the interface and runs dhclient.
121
122 It also returns the parsed dhcp.leases file generated in the sandbox.
123@@ -197,6 +203,7 @@ class TestDHCPDiscoveryClean(CiTestCase):
124 pid_file = os.path.join(tmpdir, 'dhclient.pid')
125 my_pid = 1
126 write_file(pid_file, "%d\n" % my_pid)
127+ m_getppid.return_value = 1 # Indicate that dhclient has daemonized
128
129 self.assertItemsEqual(
130 [{'interface': 'eth9', 'fixed-address': '192.168.2.74',
131@@ -355,3 +362,5 @@ class TestEphemeralDhcpNoNetworkSetup(HttprettyTestCase):
132 self.assertEqual(fake_lease, lease)
133 # Ensure that dhcp discovery occurs
134 m_dhcp.called_once_with()
135+
136+# vi: ts=4 expandtab
137diff --git a/cloudinit/temp_utils.py b/cloudinit/temp_utils.py
138index c98a1b5..346276e 100644
139--- a/cloudinit/temp_utils.py
140+++ b/cloudinit/temp_utils.py
141@@ -81,7 +81,7 @@ def ExtendedTemporaryFile(**kwargs):
142
143
144 @contextlib.contextmanager
145-def tempdir(**kwargs):
146+def tempdir(rmtree_ignore_errors=False, **kwargs):
147 # This seems like it was only added in python 3.2
148 # Make it since its useful...
149 # See: http://bugs.python.org/file12970/tempdir.patch
150@@ -89,7 +89,7 @@ def tempdir(**kwargs):
151 try:
152 yield tdir
153 finally:
154- shutil.rmtree(tdir)
155+ shutil.rmtree(tdir, ignore_errors=rmtree_ignore_errors)
156
157
158 def mkdtemp(**kwargs):
159diff --git a/cloudinit/tests/test_temp_utils.py b/cloudinit/tests/test_temp_utils.py
160index ffbb92c..4a52ef8 100644
161--- a/cloudinit/tests/test_temp_utils.py
162+++ b/cloudinit/tests/test_temp_utils.py
163@@ -2,8 +2,9 @@
164
165 """Tests for cloudinit.temp_utils"""
166
167-from cloudinit.temp_utils import mkdtemp, mkstemp
168+from cloudinit.temp_utils import mkdtemp, mkstemp, tempdir
169 from cloudinit.tests.helpers import CiTestCase, wrap_and_call
170+import os
171
172
173 class TestTempUtils(CiTestCase):
174@@ -98,4 +99,19 @@ class TestTempUtils(CiTestCase):
175 self.assertEqual('/fake/return/path', retval)
176 self.assertEqual([{'dir': '/run/cloud-init/tmp'}], calls)
177
178+ def test_tempdir_error_suppression(self):
179+ """test tempdir suppresses errors during directory removal."""
180+
181+ with self.assertRaises(OSError):
182+ with tempdir(prefix='cloud-init-dhcp-') as tdir:
183+ os.rmdir(tdir)
184+ # As a result, the directory is already gone,
185+ # so shutil.rmtree should raise OSError
186+
187+ with tempdir(rmtree_ignore_errors=True,
188+ prefix='cloud-init-dhcp-') as tdir:
189+ os.rmdir(tdir)
190+ # Since the directory is already gone, shutil.rmtree would raise
191+ # OSError, but we suppress that
192+
193 # vi: ts=4 expandtab
194diff --git a/cloudinit/util.py b/cloudinit/util.py
195index 7800f7b..a8a232b 100644
196--- a/cloudinit/util.py
197+++ b/cloudinit/util.py
198@@ -2876,4 +2876,20 @@ def udevadm_settle(exists=None, timeout=None):
199 return subp(settle_cmd)
200
201
202+def get_proc_ppid(pid):
203+ """
204+ Return the parent pid of a process.
205+ """
206+ ppid = 0
207+ try:
208+ contents = load_file("/proc/%s/stat" % pid, quiet=True)
209+ except IOError as e:
210+ LOG.warning('Failed to load /proc/%s/stat. %s', pid, e)
211+ if contents:
212+ parts = contents.split(" ", 4)
213+ # man proc says
214+ # ppid %d (4) The PID of the parent.
215+ ppid = int(parts[3])
216+ return ppid
217+
218 # vi: ts=4 expandtab
219diff --git a/tests/unittests/test_util.py b/tests/unittests/test_util.py
220index 5a14479..0e71db8 100644
221--- a/tests/unittests/test_util.py
222+++ b/tests/unittests/test_util.py
223@@ -1171,4 +1171,10 @@ class TestGetProcEnv(helpers.TestCase):
224 self.assertEqual({}, util.get_proc_env(1))
225 self.assertEqual(1, m_load_file.call_count)
226
227+ def test_get_proc_ppid(self):
228+ """get_proc_ppid returns correct parent pid value."""
229+ my_pid = os.getpid()
230+ my_ppid = os.getppid()
231+ self.assertEqual(my_ppid, util.get_proc_ppid(my_pid))
232+
233 # vi: ts=4 expandtab

Subscribers

People subscribed via source and target branches