Merge ~dojordan/cloud-init:azuretimeouts into cloud-init:master

Proposed by Douglas Jordan
Status: Merged
Approved by: Chad Smith
Approved revision: dd1023fcc4f2930896deec003939f81f3127ee58
Merge reported by: Chad Smith
Merged at revision: 2d618e27687470a8a3649f44598819bdee8cdb03
Proposed branch: ~dojordan/cloud-init:azuretimeouts
Merge into: cloud-init:master
Diff against target: 172 lines (+24/-42)
3 files modified
cloudinit/sources/DataSourceAzure.py (+9/-22)
cloudinit/url_helper.py (+8/-5)
tests/unittests/test_datasource/test_azure.py (+7/-15)
Reviewer Review Type Date Requested Status
Chad Smith Approve
Server Team CI bot continuous-integration Approve
Review via email: mp+340546@code.launchpad.net

Commit message

Reduce AzurePreprovisioning HTTP timeouts.

Reducing timeout to 1 second as IMDS responds within a handful
of milliseconds. Also get rid of max_retries to prevent exiting
out of polling loop early due to IMDS outage / upgrade.

Reduce Azure PreProvisioning HTTP timeouts during polling to
avoid waiting an extra minute.

LP: #1752977

Description of the change

During Azure preprovisioning, the platform owned VM gets into a polling loop while we are waiting for the reprovsioning data file (the customer's ovf_env.xml). If the VM moves from one vnet to another, we will get a timeout exception on the request to IMDS and thus need to re-dhcp to get the new IP and network configuration. Currently, that timeout is 60seconds, and upon further testing IMDS responds within 5-20 milliseconds so we propose moving the timeout to 1 second in order to speed up the total deployment time for the customer.

In addition, due to IMDS server updates there is a chance it will be unreachable. If it is, then we could be re-dhcping multiple times, and it is possible to hit the current limit (5) especially with a max retry of 5. That would put the VM in an effectively useless state where it would not be possible to recover. We propose removing the max dhcp retry count for preprovisioning.

To post a comment you must log in.
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:04aa0f7174907789163dea204cd8053f72bc1b1e
https://jenkins.ubuntu.com/server/job/cloud-init-ci/810/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    SUCCESS: Ubuntu LTS: Integration
    SUCCESS: MAAS Compatability Testing
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/810/rebuild

review: Approve (continuous-integration)
Revision history for this message
Chad Smith (chad.smith) wrote :

Thanks for this Douglas.

   Now that timeouts are 1 second instead of 1 minute, let's drop that logging on exception as this will generate a high-frequency of these logs where the only real value is to see how long we were in the polling stage.

Let's instead add bookend comments for entrance into IMDS polling and exit from that Polling loop.
Then we'll easily see how long we waited in that preprovisioning state once instances do finally provision.

Revision history for this message
Chad Smith (chad.smith) :
Revision history for this message
Douglas Jordan (dojordan) wrote :

Hmm, is there a log level we can use that is disabled by default like verbose? Or is verbose logged too?
> Thanks for this Douglas.
>
> Now that timeouts are 1 second instead of 1 minute, let's drop that logging
> on exception as this will generate a high-frequency of these logs where the
> only real value is to see how long we were in the polling stage.
>
>
> Let's instead add bookend comments for entrance into IMDS polling and exit
> from that Polling loop.
> Then we'll easily see how long we waited in that preprovisioning state once
> instances do finally provision.

~dojordan/cloud-init:azuretimeouts updated
83c2198... by Douglas Jordan

Fix logging

764b1f3... by Douglas Jordan

Fix logging

832f67f... by Douglas Jordan

Fix logging

24c4467... by Douglas Jordan

Fix string formatting

ec66fe8... by Douglas Jordan

Fix string formatting

64f90cc... by Douglas Jordan

Fix string formatting

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

FAILED: Continuous integration, rev:24c4467b491ce71c520247c129a8c12f376e5a8a
https://jenkins.ubuntu.com/server/job/cloud-init-ci/816/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    FAILED: Ubuntu LTS: Integration

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/816/rebuild

review: Needs Fixing (continuous-integration)
~dojordan/cloud-init:azuretimeouts updated
7e1cded... by Douglas Jordan

Removing all debug logs

b2384b5... by Douglas Jordan

Fix merge

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

FAILED: Continuous integration, rev:b2384b57fb998cde2916059b62907c025ed1d9ce
https://jenkins.ubuntu.com/server/job/cloud-init-ci/817/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    FAILED: Ubuntu LTS: Integration

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/817/rebuild

review: Needs Fixing (continuous-integration)
~dojordan/cloud-init:azuretimeouts updated
9cbfb4b... by Douglas Jordan

Fix merge

e875718... by Douglas Jordan

Bouncing C-I with new log message

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

FAILED: Continuous integration, rev:e875718ce26e25d3999faaa3f4176bcce5cfa0a2
https://jenkins.ubuntu.com/server/job/cloud-init-ci/818/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    FAILED: Ubuntu LTS: Integration

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/818/rebuild

review: Needs Fixing (continuous-integration)
~dojordan/cloud-init:azuretimeouts updated
df7c764... by Douglas Jordan

Only catch UrlError and bring back log

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:df7c764d5604b4558e6d11275026d12788416c4c
https://jenkins.ubuntu.com/server/job/cloud-init-ci/821/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    SUCCESS: Ubuntu LTS: Integration
    SUCCESS: MAAS Compatability Testing
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/821/rebuild

review: Approve (continuous-integration)
~dojordan/cloud-init:azuretimeouts updated
580798c... by Douglas Jordan

Moving to readurl in preprovisioning case

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:580798c3a65c1e12fb3b75fccc59457483c6ee4b
https://jenkins.ubuntu.com/server/job/cloud-init-ci/843/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    SUCCESS: Ubuntu LTS: Integration
    SUCCESS: MAAS Compatability Testing
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/843/rebuild

review: Approve (continuous-integration)
Revision history for this message
Chad Smith (chad.smith) :
~dojordan/cloud-init:azuretimeouts updated
196550e... by Douglas Jordan

Fixing UTs

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

FAILED: Continuous integration, rev:196550eae86c00b025d6a4b464da877e95d8bd0e
https://jenkins.ubuntu.com/server/job/cloud-init-ci/881/
Executed test runs:
    SUCCESS: Checkout
    FAILED: Unit & Style Tests

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/881/rebuild

review: Needs Fixing (continuous-integration)
~dojordan/cloud-init:azuretimeouts updated
3e98773... by Douglas Jordan

Style fix

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

FAILED: Continuous integration, rev:3e9877373627e08fbfce9ee9ad9f5180eb4bcc29
https://jenkins.ubuntu.com/server/job/cloud-init-ci/882/
Executed test runs:
    SUCCESS: Checkout
    FAILED: Unit & Style Tests

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/882/rebuild

review: Needs Fixing (continuous-integration)
~dojordan/cloud-init:azuretimeouts updated
fb27875... by Douglas Jordan

Merge branch 'master' into azuretimeouts

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:fb27875d49526813054a9a5c317f36ad7b415020
https://jenkins.ubuntu.com/server/job/cloud-init-ci/883/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    SUCCESS: Ubuntu LTS: Integration
    SUCCESS: MAAS Compatability Testing
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/883/rebuild

review: Approve (continuous-integration)
~dojordan/cloud-init:azuretimeouts updated
ed5a580... by Douglas Jordan

Merge branch 'master' into azuretimeouts

dd1023f... by Douglas Jordan

Fixing exception_cb to match new contract

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:dd1023fcc4f2930896deec003939f81f3127ee58
https://jenkins.ubuntu.com/server/job/cloud-init-ci/914/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    SUCCESS: Ubuntu LTS: Integration
    SUCCESS: MAAS Compatability Testing
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/914/rebuild

review: Approve (continuous-integration)
Revision history for this message
Chad Smith (chad.smith) wrote :

Approved this latest round with exception_cb updates. Ran through a battery of test on Azure with _poll_imds when IMDS service is not present, and 404s vs 200s. Looping and retries make sense here.

Thanks!

review: Approve
Revision history for this message
Chad Smith (chad.smith) wrote :

An upstream commit landed for this bug.

To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=2d618e27

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/cloudinit/sources/DataSourceAzure.py b/cloudinit/sources/DataSourceAzure.py
2index 0bb7fad..0ee622e 100644
3--- a/cloudinit/sources/DataSourceAzure.py
4+++ b/cloudinit/sources/DataSourceAzure.py
5@@ -20,7 +20,7 @@ from cloudinit import net
6 from cloudinit.net.dhcp import EphemeralDHCPv4
7 from cloudinit import sources
8 from cloudinit.sources.helpers.azure import get_metadata_from_fabric
9-from cloudinit.url_helper import readurl, wait_for_url, UrlError
10+from cloudinit.url_helper import readurl, UrlError
11 from cloudinit import util
12
13 LOG = logging.getLogger(__name__)
14@@ -49,7 +49,6 @@ DEFAULT_FS = 'ext4'
15 AZURE_CHASSIS_ASSET_TAG = '7783-7084-3265-9085-8269-3286-77'
16 REPROVISION_MARKER_FILE = "/var/lib/cloud/data/poll_imds"
17 IMDS_URL = "http://169.254.169.254/metadata/reprovisiondata"
18-IMDS_RETRIES = 5
19
20
21 def find_storvscid_from_sysctl_pnpinfo(sysctl_out, deviceid):
22@@ -451,36 +450,24 @@ class DataSourceAzure(sources.DataSource):
23 headers = {"Metadata": "true"}
24 LOG.debug("Start polling IMDS")
25
26- def sleep_cb(response, loop_n):
27- return 1
28-
29- def exception_cb(msg, exception):
30+ def exc_cb(msg, exception):
31 if isinstance(exception, UrlError) and exception.code == 404:
32- return
33- LOG.warning("Exception during polling. Will try DHCP.",
34- exc_info=True)
35-
36+ return True
37 # If we get an exception while trying to call IMDS, we
38 # call DHCP and setup the ephemeral network to acquire the new IP.
39- raise exception
40+ return False
41
42 need_report = report_ready
43- for i in range(IMDS_RETRIES):
44+ while True:
45 try:
46 with EphemeralDHCPv4() as lease:
47 if need_report:
48 self._report_ready(lease=lease)
49 need_report = False
50- wait_for_url([url], max_wait=None, timeout=60,
51- status_cb=LOG.info,
52- headers_cb=lambda url: headers, sleep_time=1,
53- exception_cb=exception_cb,
54- sleep_time_cb=sleep_cb)
55- return str(readurl(url, headers=headers))
56- except Exception:
57- LOG.debug("Exception during polling-retrying dhcp" +
58- " %d more time(s).", (IMDS_RETRIES - i),
59- exc_info=True)
60+ return readurl(url, timeout=1, headers=headers,
61+ exception_cb=exc_cb, infinite=True).contents
62+ except UrlError:
63+ pass
64
65 def _report_ready(self, lease):
66 """Tells the fabric provisioning has completed
67diff --git a/cloudinit/url_helper.py b/cloudinit/url_helper.py
68index 36289af..03a573a 100644
69--- a/cloudinit/url_helper.py
70+++ b/cloudinit/url_helper.py
71@@ -16,7 +16,7 @@ import time
72
73 from email.utils import parsedate
74 from functools import partial
75-
76+from itertools import count
77 from requests import exceptions
78
79 from six.moves.urllib.parse import (
80@@ -172,7 +172,7 @@ def _get_ssl_args(url, ssl_details):
81 def readurl(url, data=None, timeout=None, retries=0, sec_between=1,
82 headers=None, headers_cb=None, ssl_details=None,
83 check_status=True, allow_redirects=True, exception_cb=None,
84- session=None):
85+ session=None, infinite=False):
86 url = _cleanurl(url)
87 req_args = {
88 'url': url,
89@@ -220,7 +220,8 @@ def readurl(url, data=None, timeout=None, retries=0, sec_between=1,
90 excps = []
91 # Handle retrying ourselves since the built-in support
92 # doesn't handle sleeping between tries...
93- for i in range(0, manual_tries):
94+ # Infinitely retry if infinite is True
95+ for i in count() if infinite else range(0, manual_tries):
96 req_args['headers'] = headers_cb(url)
97 filtered_req_args = {}
98 for (k, v) in req_args.items():
99@@ -229,7 +230,8 @@ def readurl(url, data=None, timeout=None, retries=0, sec_between=1,
100 filtered_req_args[k] = v
101 try:
102 LOG.debug("[%s/%s] open '%s' with %s configuration", i,
103- manual_tries, url, filtered_req_args)
104+ "infinite" if infinite else manual_tries, url,
105+ filtered_req_args)
106
107 if session is None:
108 session = requests.Session()
109@@ -263,7 +265,8 @@ def readurl(url, data=None, timeout=None, retries=0, sec_between=1,
110 # to continue retrying and False to break and re-raise the
111 # exception
112 break
113- if i + 1 < manual_tries and sec_between > 0:
114+ if (infinite and sec_between > 0) or \
115+ (i + 1 < manual_tries and sec_between > 0):
116 LOG.debug("Please wait %s seconds while we wait to try again",
117 sec_between)
118 time.sleep(sec_between)
119diff --git a/tests/unittests/test_datasource/test_azure.py b/tests/unittests/test_datasource/test_azure.py
120index da7da0c..3e8b791 100644
121--- a/tests/unittests/test_datasource/test_azure.py
122+++ b/tests/unittests/test_datasource/test_azure.py
123@@ -1177,7 +1177,8 @@ class TestAzureDataSourcePreprovisioning(CiTestCase):
124 url = 'http://{0}/metadata/reprovisiondata?api-version=2017-04-02'
125 host = "169.254.169.254"
126 full_url = url.format(host)
127- fake_resp.return_value = mock.MagicMock(status_code=200, text="ovf")
128+ fake_resp.return_value = mock.MagicMock(status_code=200, text="ovf",
129+ content="ovf")
130 dsa = dsaz.DataSourceAzure({}, distro=None, paths=self.paths)
131 self.assertTrue(len(dsa._poll_imds()) > 0)
132 self.assertEqual(fake_resp.call_args_list,
133@@ -1185,13 +1186,8 @@ class TestAzureDataSourcePreprovisioning(CiTestCase):
134 headers={'Metadata': 'true',
135 'User-Agent':
136 'Cloud-Init/%s' % vs()
137- }, method='GET', timeout=60.0,
138- url=full_url),
139- mock.call(allow_redirects=True,
140- headers={'Metadata': 'true',
141- 'User-Agent':
142- 'Cloud-Init/%s' % vs()
143- }, method='GET', url=full_url)])
144+ }, method='GET', timeout=1,
145+ url=full_url)])
146 self.assertEqual(m_dhcp.call_count, 1)
147 m_net.assert_any_call(
148 broadcast='192.168.2.255', interface='eth9', ip='192.168.2.9',
149@@ -1217,7 +1213,8 @@ class TestAzureDataSourcePreprovisioning(CiTestCase):
150 username = "myuser"
151 odata = {'HostName': hostname, 'UserName': username}
152 content = construct_valid_ovf_env(data=odata)
153- fake_resp.return_value = mock.MagicMock(status_code=200, text=content)
154+ fake_resp.return_value = mock.MagicMock(status_code=200, text=content,
155+ content=content)
156 dsa = dsaz.DataSourceAzure({}, distro=None, paths=self.paths)
157 md, ud, cfg, d = dsa._reprovision()
158 self.assertEqual(md['local-hostname'], hostname)
159@@ -1227,12 +1224,7 @@ class TestAzureDataSourcePreprovisioning(CiTestCase):
160 headers={'Metadata': 'true',
161 'User-Agent':
162 'Cloud-Init/%s' % vs()},
163- method='GET', timeout=60.0, url=full_url),
164- mock.call(allow_redirects=True,
165- headers={'Metadata': 'true',
166- 'User-Agent':
167- 'Cloud-Init/%s' % vs()},
168- method='GET', url=full_url)])
169+ method='GET', timeout=1, url=full_url)])
170 self.assertEqual(m_dhcp.call_count, 1)
171 m_net.assert_any_call(
172 broadcast='192.168.2.255', interface='eth9', ip='192.168.2.9',

Subscribers

People subscribed via source and target branches