Merge ~andreserl/maas:lp1783889_smartctl into maas:master

Proposed by Andres Rodriguez
Status: Rejected
Rejected by: Blake Rouse
Proposed branch: ~andreserl/maas:lp1783889_smartctl
Merge into: maas:master
Diff against target: 27 lines (+8/-2)
1 file modified
src/metadataserver/builtin_scripts/smartctl.py (+8/-2)
Reviewer Review Type Date Requested Status
Jan Klare (community) Approve
MAAS Lander Approve
Lee Trager (community) Disapprove
Newell Jensen (community) Approve
Review via email: mp+351382@code.launchpad.net

Commit message

LP: #1783889 - Don't handle smartctl code 64 as an error

To post a comment you must log in.
Revision history for this message
Newell Jensen (newell-jensen) wrote :

LGTM, one inline suggestion for your comment.

review: Approve
Revision history for this message
Lee Trager (ltrager) wrote :

I don't think we want to ignore 64. According to the man page a return code of 64 means 'The device error log contains records of errors.' The device log could add an error at any time. For example if a user deploys Ubuntu to be used as a web server and a bad block is encountered it will be added to the log. This will cause an error of 64 when smartctl runs. Effectively that means the smartctl-validate test will not detect errors which are not considered catastrophic.

review: Disapprove
Revision history for this message
MAAS Lander (maas-lander) wrote :

UNIT TESTS
-b lp1783889_smartctl lp:~andreserl/maas/+git/maas into -b master lp:~maas-committers/maas

STATUS: SUCCESS
COMMIT: 93bfa1e28e7f8b806407ba4e84a2b064e1e349d7

review: Approve
Revision history for this message
Adam Beeman (abeeman) wrote :

I think it might still be necessary to ignore the exit status 64. We had some systems where the SSD's failed and even after we replaced them with new drives, we still get an exit status 64 on a healthy drive. Since the smartctl.py file has changed a bit going from 2.4 to 2.5.x, here's a new diff for 2.5.3:

diff --git a/src/metadataserver/builtin_scripts/smartctl.py b/src/metadataserver/builtin_scripts/smartctl.py
index be57fe06d..4ea8c6e7e 100644
--- a/src/metadataserver/builtin_scripts/smartctl.py
+++ b/src/metadataserver/builtin_scripts/smartctl.py
@@ -252,7 +252,9 @@ def check_smartctl(blockdevice, device=None):
     except CalledProcessError as e:
         # A return code of 4 means a smartctl command failed or a checksum
         # error was discovered. This is surprisingly common so ignore it.
- if e.returncode != 4 or not e.output:
+ # A return code of 64 means a smartctl command detect past errors,
+ # but not current ones; this may happen after replacing a drive.
+ if (e.returncode != 4 and e.returncode != 64) or not e.output:
             print(
                 'FAILURE: SMART tests have FAILED for: %s' % device_name)
             print('The test exited with return code %s! See the smarctl '

Revision history for this message
Jan Klare (j-klare) wrote :

lgtm, tested in our setup with maas 2.4.2-7034-g2f5deb8b8-0ubuntu1

review: Approve
Revision history for this message
Blake Rouse (blake-rouse) wrote :

Rejecting due to inactivity.

Revision history for this message
Freddy (fwieffering) wrote :

I'm not sure why this was closed? The bug is still open

Revision history for this message
Jan Klare (j-klare) wrote :

I still think this needs to be merged or addressed otherwise since a return code of 64 might just mean that smartctl found an error a very long time ago and the block involved was already deactivated by the SSD controller itself and replaced by one of the spare blocks. So the disk might be perfectly fine to use.

Revision history for this message
Lee Trager (ltrager) wrote :

smartctl returns 64 whenever an error is found in the SMART logs. We have
no way of knowing whether that error was found during testing or is an old
error. If we ignore return code 64 we may be ignoring early warning signs
of drives dying which the test is designed to catch. We've added the
ability in MAAS to override a failed test so if an administrator determines
a failure is a false positive they can still use the machine.

On Thu, Feb 13, 2020 at 12:43 PM Jan Klare <email address hidden> wrote:

> I still think this needs to be merged or addressed otherwise since a
> return code of 64 might just mean that smartctl found an error a very long
> time ago and the block involved was already deactivated by the SSD
> controller itself and replaced by one of the spare blocks. So the disk
> might be perfectly fine to use.
> --
> https://code.launchpad.net/~andreserl/maas/+git/maas/+merge/351382
> You are reviewing the proposed merge of ~andreserl/maas:lp1783889_smartctl
> into maas:master.
>
> Launchpad-Message-Rationale: Reviewer
> Launchpad-Message-For: ltrager
> Launchpad-Notification-Type: code-review
> Launchpad-Branch: ~andreserl/maas/+git/maas:lp1783889_smartctl
> Launchpad-Project: maas
>

Revision history for this message
Jan Klare (j-klare) wrote :

Thanks for the quick response, i appreciate you taking the time to discuss this. In the current implementation we explicitly ignore return code 4, which means that "Some SMART or other ATA command to the disk failed, or there was a checksum error in a SMART data structure". In this scenario we also have no way of knowing in which status the disk is. If we want to be consistent here and give the user a warning if we can not ensure the disk is fine, we should also remove this exception. Since i think removing this would be very inconvenient for all the users that have disks that are not smartctl compatible, i think we should not do this. From my current understanding the return code 64 means that there are errors in the logs, but the current check was fine. If the current check would have failed, the return code would either be 8 or 72 (please correct me if i am wrong here, i have only read the man pages for smartctl). I totally understand that it is important to let the user know when disks are failing, but i think the current implementation is very inconvenient for disks that have had an error a long time ago (which is pretty common during the lifetime of an SSD, but nothing to worry about).

Unmerged commits

93bfa1e... by Andres Rodriguez

LP: #1783889 - Don't handle smartctl code 64 as an error

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/src/metadataserver/builtin_scripts/smartctl.py b/src/metadataserver/builtin_scripts/smartctl.py
2index 6b747f7..c58126a 100644
3--- a/src/metadataserver/builtin_scripts/smartctl.py
4+++ b/src/metadataserver/builtin_scripts/smartctl.py
5@@ -156,7 +156,12 @@ def run_smartctl(storage, test=None):
6 print('INFO: Running `smartctl --xall %s` timed out!' % storage)
7 smartctl_passed = False
8 else:
9- if proc.returncode != 0 and proc.returncode != 4:
10+ # LP: #1783889 - smartctl returns code 64 if old indicators of
11+ # a failure are detected, but don't mean that the test have
12+ # actually failed. As such, if code 64 is returned don't treat
13+ # it as an error.
14+ if (proc.returncode != 0 and proc.returncode != 4 and
15+ proc.returncode != 64):
16 print('FAILURE: SMART tests have FAILED for: %s' % storage)
17 print('The test exited with return code %s! See the smarctl '
18 'manpage for information on the return code meaning. '
19@@ -169,7 +174,8 @@ def run_smartctl(storage, test=None):
20 if output is not None:
21 print('---------------------------------------------------\n')
22 print(output.decode('utf-8'))
23- return 0 if proc.returncode == 4 else proc.returncode
24+ return (0 if proc.returncode == 4 or proc.returncode == 64 else
25+ proc.returncode)
26
27 return 0 if smartctl_passed else 1
28

Subscribers

People subscribed via source and target branches