Merge lp:~jameinel/bzr/2.4-fdatasync-ENOTSUP-1075108 into lp:bzr/2.4

Proposed by John A Meinel
Status: Merged
Approved by: John A Meinel
Approved revision: no longer in the source branch.
Merged at revision: 6075
Proposed branch: lp:~jameinel/bzr/2.4-fdatasync-ENOTSUP-1075108
Merge into: lp:bzr/2.4
Prerequisite: lp:~jameinel/bzr/2.4-overridAttr-non-existant
Diff against target: 120 lines (+66/-1)
3 files modified
bzrlib/osutils.py (+14/-1)
bzrlib/tests/test_osutils.py (+44/-0)
doc/en/release-notes/bzr-2.4.txt (+8/-0)
To merge this branch: bzr merge lp:~jameinel/bzr/2.4-fdatasync-ENOTSUP-1075108
Reviewer Review Type Date Requested Status
Vincent Ladeuil Needs Fixing
Martin Packman (community) Approve
Review via email: mp+132877@code.launchpad.net

Commit message

Bug #1075108, handle when fdatasync returns EOPNOTSUPP, which should be considered a non-fatal error.

Description of the change

This branch is a fix for calling fdatasync() in cases where it would otherwise fail.

We introduced fdatasync() as a way to reduce the window for power failure crashes triggering data loss. However, it appears that if you are on an SSH mounted filesystem, we will get an ENOTSUP error.

This change makes it so that any IOError while calling fdatasync() is just logged quietly rather than failing. Similar to what we do if 'chmod()' fails.

We could be more strict in the type of errors that we suppress, but ENOTSUP doesn't exist on Windows, and it isn't like we would get useful errors. Instead we try fdatasync, but don't worry if it fails, because we used to not call it at all.

This depends on my overrideAttr change, because I wanted to have the tests run even on Windows, where fdatasync doesn't exist. And even just doing lots of 'if getattr()... is None: raise TestNotApplicable' makes the tests hard to read.

I can split it out if we decide not to land the overrideAttr change.

To post a comment you must log in.
Revision history for this message
Martin Packman (gz) wrote :

Seems like the best option. It's unfortunate that this will mask persistent errors such as the one in the bug, ideally we'd want to inform the user that their filesystem doesn't provide data safety rather than just logging repeatedly, but that's hard from a command line tool that is run many times.

review: Approve
Revision history for this message
Vincent Ladeuil (vila) wrote :

I have a bad feeling about ignoring errors there, the bug you're fixing is the first occurrence of fdatasync() failing in a very specific context. I'd rather trap only that specific case and keep failing for the others rather than silently ignoring them (the probability that someone notices the mutter() messages in the log is pretty much zero).

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/6/2012 12:43 PM, Vincent Ladeuil wrote:
> I have a bad feeling about ignoring errors there, the bug you're
> fixing is the first occurrence of fdatasync() failing in a very
> specific context. I'd rather trap only that specific case and keep
> failing for the others rather than silently ignoring them (the
> probability that someone notices the mutter() messages in the log
> is pretty much zero).
>

We can trap whatever we feel like, but note that we've done roughly
the same thing for chmod. If it fails, it just gets in a user's way,
rather than actually doing something useful for them.

I'm willing to be swayed, and I was hesitant as well. But the failure
mode here is that Dimiter can't use bzr *at all* because of ENOTSUP.
If we fail to fdatasync, we still have written the data.

I'd like to turn it around. Can you come up with a specific errno that
clearly indicates we have a problem that we should stop on? (That
wouldn't have triggered during the write, etc).

I like being cautious, but I also like not preventing someone from
actually getting their work done until we manage to add one more error
code into an exception clause.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (Cygwin)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlCY0JMACgkQJdeBCYSNAAPdEgCfQzNNtCTVW4vGNVRNE6Sl4A6t
c/IAmQEfqz9NDi+HSojfoOKV1x1r70ec
=4eMi
-----END PGP SIGNATURE-----

Revision history for this message
Vincent Ladeuil (vila) wrote :

> I like being cautious, but I also like not preventing someone from
> actually getting their work done until we manage to add one more error
> code into an exception clause.

My point was more about a valid fdatasync() error that would reveal a data loss and will be ignored with your patch.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/6/2012 1:59 PM, Vincent Ladeuil wrote:
>> I like being cautious, but I also like not preventing someone
>> from actually getting their work done until we manage to add one
>> more error code into an exception clause.
>
> My point was more about a valid fdatasync() error that would reveal
> a data loss and will be ignored with your patch.
>
>

Sure, if you can come up with an fdatasync() error code that is
clearly a data loss then I'm happy to exclude that one, and even work
harder on whitelisting. But if we get EACCESS, or EAGAIN, or EINTR, or
ENOTSUP, or ... we can just ignore the request as we weren't doing it
in the past either.
We log it in the case we need to do a retrospective. I certainly agree
that people won't see it from the beginning.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (Cygwin)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlCY4nkACgkQJdeBCYSNAAOEogCgzzY5hQE/jcFKfrbyWUtl0SbS
ynEAn0Fgn3aL0MObohHo+yjl6Jrn4+bq
=yCpc
-----END PGP SIGNATURE-----

Revision history for this message
Vincent Ladeuil (vila) wrote :

> Sure, if you can come up with an fdatasync() error code that is
> clearly a data loss

Better safe than sorry. Neither of us want to inspect fdatasync() implementation nor guess what could come out from it. What I'm saying is that your patch removes a safe guard around a mechanism added to *add* reliability. If you don't care about it, you may as well remove fdatasync() altogether rather than ignoring what it may report.

fdatasync() has been added in 2.4 and we got a single report that it failed and that's with an ssh mounted fs (a rather edge case since having ssh access our users are better served by bzr+ssh...).

It's a bit thin to suddenly switch to a mode where we ignore *all* errors instead of catching the unusual one.

Revision history for this message
Martin Packman (gz) wrote :

The point is if we raise an exception here, the user *is* sorry, not safe. It's similar to the exception on close problem but worse, the best we can really do is log. It would be nice to sensibly report in the UI if fdatasyn is unhappy, but continuing regardless in this case really is the safest option.

Revision history for this message
Vincent Ladeuil (vila) wrote :

> The point is if we raise an exception here, the user *is* sorry, not safe.

My point is that if an exception is raised the user will be sorry to not be informed that its data could not be saved on disk. From 'man fdatasync':

ERRORS
 <...>

       EIO An error occurred during synchronization.
<...>

If data is silently corrupted, the bug reports we'll get won't make sense, making it harder to help the user.

review: Needs Fixing
Revision history for this message
John A Meinel (jameinel) wrote :

sent to pqm by email

Revision history for this message
John A Meinel (jameinel) wrote :

Per Vincent's request I went with a whitelist of errnos that can be considered non-fatal.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'bzrlib/osutils.py'
2--- bzrlib/osutils.py 2011-11-29 20:20:02 +0000
3+++ bzrlib/osutils.py 2013-05-23 08:33:27 +0000
4@@ -2522,6 +2522,10 @@
5 else:
6 is_local_pid_dead = _posix_is_local_pid_dead
7
8+_maybe_ignored = ['EAGAIN', 'EINTR', 'ENOTSUP', 'EOPNOTSUPP', 'EACCES']
9+_fdatasync_ignored = [getattr(errno, name) for name in _maybe_ignored
10+ if getattr(errno, name, None) is not None]
11+
12
13 def fdatasync(fileno):
14 """Flush file contents to disk if possible.
15@@ -2531,7 +2535,16 @@
16 """
17 fn = getattr(os, 'fdatasync', getattr(os, 'fsync', None))
18 if fn is not None:
19- fn(fileno)
20+ try:
21+ fn(fileno)
22+ except IOError, e:
23+ # See bug #1075108, on some platforms fdatasync exists, but can
24+ # raise ENOTSUP. However, we are calling fdatasync to be helpful
25+ # and reduce the chance of corruption-on-powerloss situations. It
26+ # is not a mandatory call, so it is ok to suppress failures.
27+ trace.mutter("ignoring error calling fdatasync: %s" % (e,))
28+ if getattr(e, 'errno', None) not in _fdatasync_ignored:
29+ raise
30
31
32 def ensure_empty_directory_exists(path, exception_class):
33
34=== modified file 'bzrlib/tests/test_osutils.py'
35--- bzrlib/tests/test_osutils.py 2011-10-04 18:43:55 +0000
36+++ bzrlib/tests/test_osutils.py 2013-05-23 08:33:27 +0000
37@@ -22,6 +22,7 @@
38 import re
39 import socket
40 import sys
41+import tempfile
42 import time
43
44 from bzrlib import (
45@@ -426,6 +427,49 @@
46 self.assertTrue(-eighteen_hours < offset < eighteen_hours)
47
48
49+class TestFdatasync(tests.TestCaseInTempDir):
50+
51+ def do_fdatasync(self):
52+ f = tempfile.NamedTemporaryFile()
53+ osutils.fdatasync(f.fileno())
54+ f.close()
55+
56+ @staticmethod
57+ def raise_eopnotsupp(*args, **kwargs):
58+ raise IOError(errno.EOPNOTSUPP, os.strerror(errno.EOPNOTSUPP))
59+
60+ @staticmethod
61+ def raise_enotsup(*args, **kwargs):
62+ raise IOError(errno.ENOTSUP, os.strerror(errno.ENOTSUP))
63+
64+ def test_fdatasync_handles_system_function(self):
65+ self.overrideAttr(os, "fdatasync")
66+ self.do_fdatasync()
67+
68+ def test_fdatasync_handles_no_fdatasync_no_fsync(self):
69+ self.overrideAttr(os, "fdatasync")
70+ self.overrideAttr(os, "fsync")
71+ self.do_fdatasync()
72+
73+ def test_fdatasync_handles_no_EOPNOTSUPP(self):
74+ self.overrideAttr(errno, "EOPNOTSUPP")
75+ self.do_fdatasync()
76+
77+ def test_fdatasync_catches_ENOTSUP(self):
78+ enotsup = getattr(errno, "ENOTSUP", None)
79+ if enotsup is None:
80+ raise tests.TestNotApplicable("No ENOTSUP on this platform")
81+ self.overrideAttr(os, "fdatasync", self.raise_enotsup)
82+ self.do_fdatasync()
83+
84+ def test_fdatasync_catches_EOPNOTSUPP(self):
85+ enotsup = getattr(errno, "EOPNOTSUPP", None)
86+ if enotsup is None:
87+ raise tests.TestNotApplicable("No EOPNOTSUPP on this platform")
88+ self.overrideAttr(os, "fdatasync", self.raise_eopnotsupp)
89+ self.do_fdatasync()
90+
91+
92 class TestLinks(tests.TestCaseInTempDir):
93
94 def test_dereference_path(self):
95
96=== modified file 'doc/en/release-notes/bzr-2.4.txt'
97--- doc/en/release-notes/bzr-2.4.txt 2012-06-08 06:59:50 +0000
98+++ doc/en/release-notes/bzr-2.4.txt 2013-05-23 08:33:27 +0000
99@@ -35,6 +35,10 @@
100 * Cope with Unix filesystems, such as smbfs, where chmod gives 'permission
101 denied'. (Martin Pool, #606537)
102
103+* Fix a traceback when trying to checkout a tree that also has an entry
104+ with file-id `TREE_ROOT` somewhere other than at the root directory.
105+ (John Arbash Meinel, #830947)
106+
107 * When the ``limbo`` or ``pending-deletion`` directories exist, typically
108 because of an interrupted tree update, but are empty, bzr no longer
109 errors out, because there is nothing for the user to clean up. Also,
110@@ -55,6 +59,10 @@
111 * Prevent a traceback being printed to stderr when logging has problems and
112 accept utf-8 byte string without breaking. (Martin Packman, #714449)
113
114+* Some filesystems give ``EOPNOTSUPP`` when trying to call ``fdatasync``.
115+ This shouldn't be treated as a fatal error.
116+ (John Arbash Meinel, #1075108)
117+
118 * Use ``encoding_type='exact'`` for ``bzr testament`` so that on Windows
119 the sha hash of the long testament matches the sha hash in the short
120 form. (John Arbash Meinel, #1010339)

Subscribers

People subscribed via source and target branches