Merge lp:~hloeung/ubuntu-repository-cache/health-check-retries-on-failure into lp:ubuntu-repository-cache

Proposed by Haw Loeung
Status: Merged
Approved by: Benjamin Allot
Approved revision: 332
Merged at revision: 327
Proposed branch: lp:~hloeung/ubuntu-repository-cache/health-check-retries-on-failure
Merge into: lp:ubuntu-repository-cache
Diff against target: 200 lines (+68/-39)
5 files modified
files/health_check.py (+54/-37)
templates/apache2/archive_ubuntu_com.conf (+2/-1)
tests/unit/test_apache.py (+8/-0)
tests/unit/test_health_check.py (+1/-1)
tests/unit/test_ubuntu_repository_cache.py (+3/-0)
To merge this branch: bzr merge lp:~hloeung/ubuntu-repository-cache/health-check-retries-on-failure
Reviewer Review Type Date Requested Status
Benjamin Allot Approve
Barry Price Approve
Review via email: mp+399257@code.launchpad.net

Commit message

Make health-check script/endpoint more resilient

u-r-c units are usually deployed in public clouds with the sync-host
set to the main Ubuntu archive servers. We can't control the network
connectivity/conditions so update health-check script to retry up to 3
times on failures.

Description of the change

Example testing in local Canonistack test environment:

| ubuntu@juju-87625f-hloeung-93:$ sudo iptables -I OUTPUT -d 91.189.88.142 -j DROP
| ubuntu@juju-87625f-hloeung-93:$ sudo iptables -I OUTPUT -d 91.189.88.152 -j DROP
| ubuntu@juju-87625f-hloeung-93:$ sudo cp ~/health_check.py health-check.py; sudo -u www-data ./health-check.py > /dev/null
| 2021-03-07 23:05:39 - [T2NHOOmc]: OK: #1: /ubuntu/dists/focal/InRelease (0.006s)
| 2021-03-07 23:05:44 - [T2NHOOmc]: ERROR: #1: /ubuntu/pool/main/: HTTPConnectionPool(host='localhost', port=80): Read timed out. (read timeout=5) (5.006s)
| 2021-03-07 23:05:49 - [T2NHOOmc]: ERROR: #2: /ubuntu/pool/main/: HTTPConnectionPool(host='localhost', port=80): Read timed out. (read timeout=5) (5.010s)
| 2021-03-07 23:05:54 - [T2NHOOmc]: ERROR: #3: /ubuntu/pool/main/: HTTPConnectionPool(host='localhost', port=80): Read timed out. (read timeout=5) (5.010s)
| 2021-03-07 23:05:54 - [T2NHOOmc]: 503 Service Unavailable (15.035s)

To post a comment you must log in.
Revision history for this message
Canonical IS Mergebot (canonical-is-mergebot) wrote :

This merge proposal is being monitored by mergebot. Change the status to Approved to merge.

328. By Haw Loeung

Fixed based on review

Revision history for this message
Barry Price (barryprice) wrote :

LGTM +1

review: Approve
Revision history for this message
Tom Haddon (mthaddon) wrote :

Some comments inline

329. By Haw Loeung

Fixed logging lines to let the logger build/format the strings per review feedback

Revision history for this message
Benjamin Allot (ballot) wrote :

I see various number of test failure when running the `make test`

https://pastebin.canonical.com/p/BWrn6vYB46/

I had 1 first, then 4 failures and kept getting 4 failures.

I looked into the first one:
```
tests/unit/test_apache.py::TestCharm::test_configure_health_check FAILED
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> traceback >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

self = <test_apache.TestCharm testMethod=test_configure_health_check>, check_call = <MagicMock name='check_call' id='139841716545280'>, call = <MagicMock name='call' id='139841716291136'>
seteuid = <MagicMock name='seteuid' id='139841716303232'>, setegid = <MagicMock name='setegid' id='139841716311232'>, chown = <MagicMock name='chown' id='139841715627488'>
write_file = <MagicMock name='write_file' id='139841715237360'>

    @mock.patch('charmhelpers.core.host.write_file')
    @mock.patch('os.chown')
    @mock.patch('os.setegid')
    @mock.patch('os.seteuid')
    @mock.patch('subprocess.call')
    @mock.patch('subprocess.check_call')
    def test_configure_health_check(self, check_call, call, seteuid, setegid, chown, write_file):
        config = hookenv.Config({'enable_healthcheck': True})
> self.assertTrue(apache.configure_health_check(config))
E AssertionError: False is not true

tests/unit/test_apache.py:63: AssertionError
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> entering PDB >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PDB post_mortem >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> /usr/lib/python3.8/unittest/case.py(765)assertTrue()
-> raise self.failureException(msg)
(Pdb) u
> /home/ballot/repos/charms/ubuntu-repository-cache/health-check-retries-on-failure/tests/unit/test_apache.py(63)test_configure_health_check()
-> self.assertTrue(apache.configure_health_check(config))
(Pdb) pp config
{'apache2_server_signature': 'On',
 'apache2_server_tokens': 'OS',
 'apache2_trace_enabled': 'Off',
 'display-host': 'archive.ubuntu.com',
 'enable_healthcheck': True,
 'path-base': 'ubuntu',
 'remoteip_logging': True,
 'sync-host': 'us.archive.ubuntu.com'}
```

Apparently the `if config.changed('enable_healthcheck'):` is not triggered by `config = hookenv.Config({'enable_healthcheck': True})`

330. By Haw Loeung

Fixed double since and new logging changes

331. By Haw Loeung

Fixed unit tests to reset charm-helpers config.changed states

332. By Haw Loeung

Mock CPU count to ensure unit tests also passes on other systems with more or less CPU cores

Revision history for this message
Benjamin Allot (ballot) wrote :

LGTM

review: Approve
Revision history for this message
Canonical IS Mergebot (canonical-is-mergebot) wrote :

Change successfully merged at revision 327

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'files/health_check.py'
2--- files/health_check.py 2021-03-03 23:27:56 +0000
3+++ files/health_check.py 2021-03-08 21:00:28 +0000
4@@ -18,7 +18,7 @@
5 'User-agent': 'ubuntu-repository-cache/healthcheck',
6 }
7 CHECK_HOST = 'http://localhost/'
8-CHECK_TIMEOUT = 10
9+CHECK_TIMEOUT = 5
10 CHECK_URLS = ['/ubuntu/dists/focal/InRelease', '/ubuntu/pool/main/']
11
12 # By default, check the flag file in the current path as the
13@@ -53,12 +53,54 @@
14 with s.makefile(mode='w') as dest:
15 dest.write(metrics)
16 except Exception as err:
17- logging.warning('Failed to send metrics: {}'.format(err))
18+ logging.warning('Failed to send metrics: %s', err)
19 # We don't care if it fails to send metrics.
20 pass
21
22
23-def main(disabled_flag=DISABLE_FLAG): # NOQA: C901
24+def check_url(url):
25+ start_time = time.time()
26+
27+ count = 0
28+ while count < 3:
29+ count += 1
30+
31+ status_code = 0
32+ start_time_url = time.time()
33+ msg = ''
34+ try:
35+ check_url = urljoin(CHECK_HOST, url)
36+ resp = requests.head(check_url, headers=CHECK_HEADERS, timeout=CHECK_TIMEOUT)
37+ status_code = resp.status_code
38+ if resp.ok:
39+ duration = time.time() - start_time_url
40+ logging.info('OK: #%s: %s (%.3f)', count, url, duration)
41+ break
42+ else:
43+ resp.raise_for_status()
44+ except requests.exceptions.HTTPError as err:
45+ duration = time.time() - start_time_url
46+ logging.error('ERROR: #%s: %s: %s (%.3f)', count, url, err, duration)
47+ msg = err
48+ except (
49+ requests.exceptions.ConnectTimeout,
50+ requests.exceptions.ConnectionError,
51+ requests.exceptions.Timeout,
52+ ) as err:
53+ duration = time.time() - start_time_url
54+ logging.error('ERROR: #%s: %s: %s (%.3f)', count, url, err, duration)
55+ msg = err
56+ except Exception as err:
57+ duration = time.time() - start_time_url
58+ # We don't want to show exactly what the error is, but we
59+ # do want to log it.
60+ logging.critical('ERROR: #%s: %s: %s (%.3f)', count, url, err, duration)
61+ msg = 'Unknown error'
62+
63+ return status_code, time.time() - start_time, msg
64+
65+
66+def main(disabled_flag=DISABLE_FLAG):
67 # logging lines include a unique identifier so it's easier to grep
68 # and group all related lines. 'pa' are alternate characters
69 # replacing '+' with 'p' and '/' with 'a'.
70@@ -76,7 +118,7 @@
71 print('Content-type: text/html\n')
72 disabled_time = os.stat(disabled_flag).st_mtime
73 print('Currently disabled for maintenance since {}'.format(time.ctime(int(disabled_time))), flush=True)
74- logging.info('DISABLED since {:.3f}'.format(disabled_time))
75+ logging.info('DISABLED since %.3f', disabled_time)
76 # Rather than 0 here, we want to make maintenance/disabled
77 # units stand out a little in the grafana graphs, so 5.03s with
78 # the average being ~200-300 ms to the backends.
79@@ -97,41 +139,16 @@
80 send_to_influx(render_influx(METRIC_NAME, 'status=DISABLED', 4.04))
81 return
82
83- start_total = time.time()
84+ start_time = time.time()
85
86 successes = {}
87 failures = {}
88 for url in CHECK_URLS:
89- status_code = 0
90- start_url = time.time()
91- try:
92- check_url = urljoin(CHECK_HOST, url)
93- resp = requests.head(check_url, headers=CHECK_HEADERS, timeout=CHECK_TIMEOUT)
94- status_code = resp.status_code
95- if resp.ok:
96- duration = time.time() - start_url
97- logging.info('OK: {} ({:.3f}s)'.format(url, duration))
98- successes[url] = [status_code, duration]
99- else:
100- resp.raise_for_status()
101- except requests.exceptions.HTTPError as err:
102- duration = time.time() - start_url
103- logging.error('ERROR: {}: {} ({:.3f}s)'.format(url, err, duration))
104- failures[url] = [status_code, duration, err]
105- except (
106- requests.exceptions.ConnectTimeout,
107- requests.exceptions.ConnectionError,
108- requests.exceptions.Timeout,
109- ) as err:
110- duration = time.time() - start_url
111- logging.error('ERROR: {}: {} ({:.3f}s)'.format(url, err, duration))
112- failures[url] = [0, duration, err]
113- except Exception as err:
114- duration = time.time() - start_url
115- # We don't want to show exactly what the error is, but we
116- # do want to log it.
117- logging.critical('ERROR: {}: {} ({:.3f}s)'.format(url, err, duration))
118- failures[url] = [0, duration, 'Unknown error']
119+ res = check_url(url)
120+ if res[0] == 200:
121+ successes[url] = res
122+ else:
123+ failures[url] = res
124
125 metrics = []
126 if failures:
127@@ -162,8 +179,8 @@
128 # request/connection on sending metrics to influx.
129 print('', flush=True)
130
131- duration = time.time() - start_total
132- logging.info('{} ({:.3f}s)'.format(status, duration))
133+ duration = time.time() - start_time
134+ logging.info('%s (%.3f)', status, duration)
135
136 for metric in metrics:
137 send_to_influx(metric)
138
139=== modified file 'templates/apache2/archive_ubuntu_com.conf'
140--- templates/apache2/archive_ubuntu_com.conf 2021-03-01 22:15:53 +0000
141+++ templates/apache2/archive_ubuntu_com.conf 2021-03-08 21:00:28 +0000
142@@ -41,7 +41,8 @@
143 Header append Cache-Control "no-cache"
144 Require all granted
145 </Directory>
146- CGIDScriptTimeout 20
147+ # health-check script CHECK_TIMEOUT * no. of retries * no. of URLs
148+ CGIDScriptTimeout 30
149 ProxyPass /_health-check/ !
150 </IfModule>
151 {% endif %}
152
153=== modified file 'tests/unit/test_apache.py'
154--- tests/unit/test_apache.py 2021-03-07 22:00:26 +0000
155+++ tests/unit/test_apache.py 2021-03-08 21:00:28 +0000
156@@ -48,9 +48,17 @@
157 self.addCleanup(patcher.stop)
158 self.mock_log.return_value = ''
159
160+ patcher = mock.patch('multiprocessing.cpu_count')
161+ self.mock_cpu_count = patcher.start()
162+ self.addCleanup(patcher.stop)
163+ self.mock_cpu_count.return_value = 2
164+
165 status.active.reset_mock()
166 status.blocked.reset_mock()
167 status.maintenance.reset_mock()
168+ # We also need to reset config.changed states.
169+ if os.path.exists('.juju-persistent-config'):
170+ os.unlink('.juju-persistent-config')
171
172 @mock.patch('charmhelpers.core.host.write_file')
173 @mock.patch('os.chown')
174
175=== modified file 'tests/unit/test_health_check.py'
176--- tests/unit/test_health_check.py 2021-02-16 23:48:45 +0000
177+++ tests/unit/test_health_check.py 2021-03-08 21:00:28 +0000
178@@ -63,7 +63,7 @@
179 os.mknod(disabled_flag)
180 os.utime(disabled_flag, times=(1613518897.123, 1613518897.123))
181 health_check.main(disabled_flag)
182- log_info.assert_called_with('DISABLED since 1613518897.123')
183+ log_info.assert_called_with('DISABLED since %.3f', 1613518897.123)
184 send_to_influx.assert_called_with('ubuntu_repository_cache_health,status=DISABLED value=5.03')
185
186 @mock.patch('files.health_check.send_to_influx')
187
188=== modified file 'tests/unit/test_ubuntu_repository_cache.py'
189--- tests/unit/test_ubuntu_repository_cache.py 2021-02-04 09:22:09 +0000
190+++ tests/unit/test_ubuntu_repository_cache.py 2021-03-08 21:00:28 +0000
191@@ -56,6 +56,9 @@
192 status.active.reset_mock()
193 status.blocked.reset_mock()
194 status.maintenance.reset_mock()
195+ # We also need to reset config.changed states.
196+ if os.path.exists('.juju-persistent-config'):
197+ os.unlink('.juju-persistent-config')
198
199 @mock.patch('charms.reactive.set_flag')
200 def test_set_active(self, set_flag):

Subscribers

People subscribed via source and target branches