MAAS

Merge ~andreserl/maas:lp1783889_smartctl into maas:master

Proposed by Andres Rodriguez on 2018-07-26

Status:

Rejected

Rejected by:

Blake Rouse on 2019-09-20

Proposed branch:

~andreserl/maas:lp1783889_smartctl

Merge into:

Diff against target:

27 lines (+8/-2)

1 file modified

src/metadataserver/builtin_scripts/smartctl.py (+8/-2)

Related bugs:

Bug #1783889: COMMISSION S.M.A.R.T Tests fail unnecessarily on code 64 (past log entries)

Medium

Triaged

Remove

Link a bug report

Reviewer	Review Type	Date Requested	Status
Jan Klare (community)			Approve on 2019-07-12
MAAS Lander			Approve on 2018-07-29
Lee Trager (community)			Disapprove on 2018-07-26
Newell Jensen (community)		2018-07-26	Approve on 2018-07-26
Review via email: mp+351382@code.launchpad.net

Commit message

LP: #1783889 - Don't handle smartctl code 64 as an error

Revision history for this message

Newell Jensen (newell-jensen) wrote on 2018-07-26:

#

LGTM, one inline suggestion for your comment.

review: Approve

Revision history for this message

Lee Trager (ltrager) wrote on 2018-07-26:

#

I don't think we want to ignore 64. According to the man page a return code of 64 means 'The device error log contains records of errors.' The device log could add an error at any time. For example if a user deploys Ubuntu to be used as a web server and a bad block is encountered it will be added to the log. This will cause an error of 64 when smartctl runs. Effectively that means the smartctl-validate test will not detect errors which are not considered catastrophic.

review: Disapprove

Revision history for this message

MAAS Lander (maas-lander) wrote on 2018-07-29:

#

UNIT TESTS
-b lp1783889_smartctl lp:~andreserl/maas/+git/maas into -b master lp:~maas-committers/maas

STATUS: SUCCESS
COMMIT: 93bfa1e28e7f8b806407ba4e84a2b064e1e349d7

review: Approve

Revision history for this message

Adam Beeman (abeeman) wrote on 2019-05-17:

#

I think it might still be necessary to ignore the exit status 64. We had some systems where the SSD's failed and even after we replaced them with new drives, we still get an exit status 64 on a healthy drive. Since the smartctl.py file has changed a bit going from 2.4 to 2.5.x, here's a new diff for 2.5.3:

diff --git a/src/metadataserver/builtin_scripts/smartctl.py b/src/metadataserver/builtin_scripts/smartctl.py
index be57fe06d..4ea8c6e7e 100644
--- a/src/metadataserver/builtin_scripts/smartctl.py
+++ b/src/metadataserver/builtin_scripts/smartctl.py
@@ -252,7 +252,9 @@ def check_smartctl(blockdevice, device=None):
     except CalledProcessError as e:
         # A return code of 4 means a smartctl command failed or a checksum
         # error was discovered. This is surprisingly common so ignore it.
- if e.returncode != 4 or not e.output:
+ # A return code of 64 means a smartctl command detect past errors,
+ # but not current ones; this may happen after replacing a drive.
+ if (e.returncode != 4 and e.returncode != 64) or not e.output:
             print(
                 'FAILURE: SMART tests have FAILED for: %s' % device_name)
             print('The test exited with return code %s! See the smarctl '

Revision history for this message

Jan Klare (j-klare) wrote on 2019-07-12:

#

lgtm, tested in our setup with maas 2.4.2-7034-g2f5deb8b8-0ubuntu1

review: Approve

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2019-09-20:

#

Rejecting due to inactivity.

Revision history for this message

Freddy (fwieffering) wrote on 2020-01-16:

#

I'm not sure why this was closed? The bug is still open

Revision history for this message

Jan Klare (j-klare) wrote on 2020-02-13:

#

I still think this needs to be merged or addressed otherwise since a return code of 64 might just mean that smartctl found an error a very long time ago and the block involved was already deactivated by the SSD controller itself and replaced by one of the spare blocks. So the disk might be perfectly fine to use.

Revision history for this message

Lee Trager (ltrager) wrote on 2020-02-13:

#

smartctl returns 64 whenever an error is found in the SMART logs. We have
no way of knowing whether that error was found during testing or is an old
error. If we ignore return code 64 we may be ignoring early warning signs
of drives dying which the test is designed to catch. We've added the
ability in MAAS to override a failed test so if an administrator determines
a failure is a false positive they can still use the machine.

On Thu, Feb 13, 2020 at 12:43 PM Jan Klare <email address hidden> wrote:

> I still think this needs to be merged or addressed otherwise since a
> return code of 64 might just mean that smartctl found an error a very long
> time ago and the block involved was already deactivated by the SSD
> controller itself and replaced by one of the spare blocks. So the disk
> might be perfectly fine to use.
> --
> https://code.launchpad.net/~andreserl/maas/+git/maas/+merge/351382
> You are reviewing the proposed merge of ~andreserl/maas:lp1783889_smartctl
> into maas:master.
>
> Launchpad-Message-Rationale: Reviewer
> Launchpad-Message-For: ltrager
> Launchpad-Notification-Type: code-review
> Launchpad-Branch: ~andreserl/maas/+git/maas:lp1783889_smartctl
> Launchpad-Project: maas
>

Revision history for this message

Jan Klare (j-klare) wrote on 2020-02-14:

#

Thanks for the quick response, i appreciate you taking the time to discuss this. In the current implementation we explicitly ignore return code 4, which means that "Some SMART or other ATA command to the disk failed, or there was a checksum error in a SMART data structure". In this scenario we also have no way of knowing in which status the disk is. If we want to be consistent here and give the user a warning if we can not ensure the disk is fine, we should also remove this exception. Since i think removing this would be very inconvenient for all the users that have disks that are not smartctl compatible, i think we should not do this. From my current understanding the return code 64 means that there are errors in the logs, but the current check was fine. If the current check would have failed, the return code would either be 8 or 72 (please correct me if i am wrong here, i have only read the man pages for smartctl). I totally understand that it is important to let the user know when disks are failing, but i think the current implementation is very inconvenient for disks that have had an error a long time ago (which is pretty common during the lifetime of an SSD, but nothing to worry about).

Unmerged commits

93bfa1e... by Andres Rodriguez on 2018-07-26: LP: #1783889 - Don't handle smartctl code 64 as an error

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Andres Rodriguez

MAAS Committers

Matvej Jurbin

Shane Holloman