Merge ~xavpaice/charm-hw-health:bug/1945151 into charm-hw-health:master

Proposed by Xav Paice
Status: Merged
Approved by: James Troup
Approved revision: ee25cd55b2082f585b561f7441161ae6b734bcef
Merged at revision: 1835e401fc8723363e818e802972794a3c3d35ac
Proposed branch: ~xavpaice/charm-hw-health:bug/1945151
Merge into: charm-hw-health:master
Diff against target: 47 lines (+14/-8)
1 file modified
src/files/ipmi/cron_ipmi_sensors.py (+14/-8)
Reviewer Review Type Date Requested Status
🤖 prod-jenkaas-bootstack (community) continuous-integration Approve
BootStack Reviewers Pending
BootStack Reviewers Pending
Review via email: mp+409180@code.launchpad.net

Commit message

    Retry check_ipmi_sensor on failure

    Alters the cron task which runs check_ipmi_sensor to sleep for a second
    then retry, to filter out intermittent issues where some BMCs show
    errors then immediately recover.

To post a comment you must log in.
Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :

A CI job is currently in progress. A follow up comment will be added when it completes.

Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

This merge proposal is being monitored by mergebot. Change the status to Approved to merge.

Revision history for this message
🤖 prod-jenkaas-bootstack (prod-jenkaas-bootstack) wrote :
review: Approve (continuous-integration)
Revision history for this message
Xav Paice (xavpaice) wrote :

FWIW, I believe the reason we use cron for this task rather than simply running the check from NRPE is that in some cases the task takes longer than NRPE's 10 seconds. This change therefore simply replicates a similar, though faster, retry that would normally be performed by Nagios if there's an error reported.

Revision history for this message
🤖 Canonical IS Merge Bot (canonical-is-mergebot) wrote :

Change successfully merged at revision 1835e401fc8723363e818e802972794a3c3d35ac

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/src/files/ipmi/cron_ipmi_sensors.py b/src/files/ipmi/cron_ipmi_sensors.py
2index 5b53f7d..f270cbb 100644
3--- a/src/files/ipmi/cron_ipmi_sensors.py
4+++ b/src/files/ipmi/cron_ipmi_sensors.py
5@@ -4,6 +4,7 @@
6 import os
7 import subprocess
8 import sys
9+from time import sleep
10
11 CHECK_IPMI_PID = "/var/run/nagios/check_ipmi_sensors.pid"
12 OUTPUT_FILE = "/var/lib/nagios/ipmi_sensors.out"
13@@ -14,6 +15,7 @@ NAGIOS_ERRORS = {
14 2: "CRITICAL",
15 3: "UNKNOWN",
16 }
17+RETRIES = 3
18
19
20 def write_output_file(output):
21@@ -51,14 +53,18 @@ def gather_metrics():
22 cmdline = [CMD]
23 if len(sys.argv) > 1:
24 cmdline.extend(sys.argv[1:])
25- try:
26- output = subprocess.check_output(cmdline).decode("utf8")
27- write_output_file(output)
28- except subprocess.CalledProcessError as error:
29- output = error.stdout.decode(errors="ignore")
30- write_output_file("{}: {}".format(NAGIOS_ERRORS[error.returncode], output))
31- except PermissionError as error:
32- write_output_file("UNKNOWN: {}".format(error))
33+ for _ in range(RETRIES):
34+ try:
35+ output = subprocess.check_output(cmdline).decode("utf8")
36+ break
37+ except subprocess.CalledProcessError as error:
38+ result = error.stdout.decode(errors="ignore")
39+ output = "{}: {}".format(NAGIOS_ERRORS[error.returncode], result)
40+ sleep(1)
41+ except PermissionError as error:
42+ output = "UNKNOWN: {}".format(error)
43+ break
44+ write_output_file(output)
45
46 # remove pid reference
47 os.remove(CHECK_IPMI_PID)

Subscribers

People subscribed via source and target branches

to all changes: