swift $file was corrupted during upload even with retries

Bug #1734144 reported by Remy van Elst
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Duplicity
Fix Released
Medium
Unassigned

Bug Description

Command:

duplicity --verbosity 4 --log-file /var/log/duplicity.log --volsize 25 --tempdir="/tmp" --file-prefix="app1.server8." --name="app1.server8." --exclude-device-files --allow-source-mismatch --num-retries 100 --exclude-filelist=/etc/cloudvps-boss/exclude.conf --full-if-older-than="7D" --no-encryption / swift://cloudvps-boss-backup

The backup halts with this error:

ERROR 44 'app1.server8.duplicity-full.20171122T021711Z.vol741.difftar.gz' 26197773 -1
. File app1.server8.duplicity-full.20171122T021711Z.vol741.difftar.gz was corrupted during upload.

I would suspect it to restart (due to --num-retries 100), but that does not happen.

Duplicity version: 0.7.15 (from source)
Ubuntu 16.04 with that default python (2.7..)

Revision history for this message
Joris van Eijden (joris-vaneijden) wrote :

There was a discussion in 2011 about this error being fatal and whether or not to retry: https://lists.nongnu.org/archive/html/duplicity-talk/2011-11/msg00037.html

TL;DR; duplicity tells backend to transfer file, backend says it did so, duplicity compares filesize to make sure and bails if it doesn't match.
Retry is done only when backend says it failed to transfer.

In your case the file size after a supposedly successful transfer is -1 bytes when 26197773 bytes was expected. For me the difference is usually 3-5 bytes.

As a side note I am using CloudVPS storage too and have noticed an increase of these corrupted transfers in the past two months or so.

Revision history for this message
edso (ed.so) wrote :

can you supply a proper stacktrace to see where the fatal error originates?

..ede/duply.net

Revision history for this message
Relst.nl (relst) wrote : Re: [Bug 1734144] Re: swift $file was corrupted during upload even with retries

That is the actual error. The others are all warnings. Do you want one with
-v 9? That will have to wait since I'll be able to access that machine next
week. What about the earlier mail thread where the file size difference is
discussed?

https://raymii.org

On Wed, Jan 3, 2018 at 2:55 PM, edso <email address hidden> wrote:

> can you supply a proper stacktrace to see where the fatal error
> originates?
>
> ..ede/duply.net
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1734144
>
> Title:
> swift $file was corrupted during upload even with retries
>
> Status in Duplicity:
> New
>
> Bug description:
> Command:
>
> duplicity --verbosity 4 --log-file /var/log/duplicity.log --volsize 25
> --tempdir="/tmp" --file-prefix="app1.server8." --name="app1.server8."
> --exclude-device-files --allow-source-mismatch --num-retries 100
> --exclude-filelist=/etc/cloudvps-boss/exclude.conf --full-if-older-
> than="7D" --no-encryption / swift://cloudvps-boss-backup
>
> The backup halts with this error:
>
> ERROR 44 'app1.server8.duplicity-full.20171122T021711Z.vol741.difftar.gz'
> 26197773 -1
> . File app1.server8.duplicity-full.20171122T021711Z.vol741.difftar.gz
> was corrupted during upload.
>
>
> I would suspect it to restart (due to --num-retries 100), but that does
> not happen.
>
> Duplicity version: 0.7.15 (from source)
> Ubuntu 16.04 with that default python (2.7..)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/duplicity/+bug/1734144/+subscriptions
>

Revision history for this message
Joris van Eijden (joris-vaneijden) wrote :

Code path:
 write_multivol(backup_type, tarblock_iter, man_outfp, sig_outfp, backend)
creates a local file tdp and then calls
 put(tdp, dest_filename, vol_num)
which first gets the local file's size
 putsize = tdp.getsize()
then tells the backend to store it
 backend.put(tdp, dest_filename)
then calls this function
 validate_block(putsize, dest_filename)
which gets the remote filesize
 info = backend.query_info([dest_filename])[dest_filename]
 size = info['size']
and compares it with the local one
 if size != orig_size:
and throws a fatal error if they don't match
 code_extra = "%s %d %d" % (util.escape(dest_filename), orig_size, size)
 log.FatalError(_("File %s was corrupted during upload.") % util.ufn(dest_filename),
                log.ErrorCode.volume_wrong_size, code_extra)

It does not retry transfering the file since the retry mechanism is inside `backend.put()`.

The swift client uses the 'content-length' response header to a head request for the file as the remote size.

So there's 2 possible scenarios happening. Either the objectstore actually failed to write the file completely while still returning an http code >= 200 and < 300, or the objectstore returns an inaccurate 'content-length' header.

Both cases seem to indicate something wrong with the objectstore. Perhaps adding retries to work around that is not the most prudent way forward.

Revision history for this message
Joris van Eijden (joris-vaneijden) wrote :

It looks like there is a small window of time where the size reported by the HEAD request for the 'content-length' is incorrect.
Duplicity says in the log it expects 262197953 bytes but the response says 262197945 bytes.
Manually verifiying the file later confirms it is 262197953 bytes.

Note that there also appears to be a varnish server before the objectstore, potentially influencing the header.

# Latest error reported by duplicity:
ERROR 44 'redacted.server.duplicity-full.20180103T071647Z.vol590.difftar.gpg' 262197953 262197945
. File redacted.server.duplicity-full.20180103T071647Z.vol590.difftar.gpg was corrupted during upload.

# Manually getting stats:
$ swift stat cloudvps-boss-backup redacted.server.duplicity-full.20180103T071647Z.vol590.difftar.gpg
       Account: *redacted*.objectstore.eu
     Container: cloudvps-boss-backup
        Object: redacted.server.duplicity-full.20180103T071647Z.vol590.difftar.gpg
  Content Type: application/octet-stream
Content Length: 262197953
 Last Modified: Thu, 04 Jan 2018 05:28:42 GMT
          ETag: ccde4ec46420555b92231f79c3bf04da
     X-Varnish: 1371704766
           Via: 1.1 varnish
 Accept-Ranges: bytes, bytes
   X-Timestamp: 1515043721.11970
    X-Trans-Id: txcc1d645658b4b1609cbd4-dfea3105a4
           Age: 0

# Manually downloading the file:
$ swift download cloudvps-boss-backup redacted.server.duplicity-full.20180103T071647Z.vol590.difftar.gpg
redacted.server.duplicity-full.20180103T071647Z.vol590.difftar.gpg [auth 2.054s, headers 4.720s, total 16.679s, 17.928 MB/s]

# Checking the size:
$ wc -c < redacted.server.duplicity-full.20180103T071647Z.vol590.difftar.gpg
262197953

Revision history for this message
Remy van Elst (raymii) wrote :

What Joris states can be correct. The version of swift that is running is quite vintage with 'customizations'. Varnish is however not used for authenticated requests (boss-backup, thus duplicity), that is bypassed directly to the swift proxy servers.

I've got no control over swift or the backend, since that's a different team. I do however have around 5% of over 8000 duplicity backups reporting this issue every day.

It would make duplicity more resillient, since both a retry of the version check, or a retry of the upload (as specified with numretries) would solve the issue. This backend might be wonky sometimes, and yes that should be fixed, but so might other backends. (https://aws.amazon.com/s3/faqs/ Amazon S3 Standard and Standard - IA are designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years. In addition, Amazon S3 is designed to sustain the concurrent loss of data in two facilities.).

Revision history for this message
Joris van Eijden (joris-vaneijden) wrote :

I finally got around to making a patch for this.

Revision history for this message
Joris van Eijden (joris-vaneijden) wrote :

Seems to work. This was in the log after the monthly full backup:

NOTICE 1
. Remote filesize 262191780 for 'redacted.server.duplicity-inc.20180310T023416Z.to.20180323T082227Z.vol412.difftar.gpg' does not match local size 262191778, retrying.

NOTICE 1
. Remote filesize 262193725 for 'redacted.server.duplicity-inc.20180310T023416Z.to.20180323T082227Z.vol413.difftar.gpg' does not match local size 262193722, retrying.

NOTICE 1
. Remote filesize 262185742 for 'redacted.server.duplicity-inc.20180310T023416Z.to.20180323T082227Z.vol415.difftar.gpg' does not match local size 262185743, retrying.

NOTICE 1
. Remote filesize 262189700 for 'redacted.server.duplicity-inc.20180310T023416Z.to.20180323T082227Z.vol435.difftar.gpg' does not match local size 262189699, retrying.

NOTICE 1
. Remote filesize 262199689 for 'redacted.server.duplicity-inc.20180310T023416Z.to.20180323T082227Z.vol443.difftar.gpg' does not match local size 262191490, retrying.

NOTICE 1
. Remote filesize 262201574 for 'redacted.server.duplicity-inc.20180310T023416Z.to.20180323T082227Z.vol444.difftar.gpg' does not match local size 262201575, retrying.

Revision history for this message
Kenneth Loafman (kenneth-loafman) wrote :

Great! Thanks for the verify. Will get it into trunk soon.

Changed in duplicity:
assignee: nobody → Kenneth Loafman (kenneth-loafman)
importance: Undecided → Medium
milestone: none → 0.7.18
status: New → In Progress
Revision history for this message
Doug (caniwi) wrote :

I am just piping in here, as I have been experiencing these rc = 44 errors with SWIFT object storage for a while. I applied the patch and still got returned a -1 as the returned length.

Synchronizing remote metadata to local cache...
Deleting local /var/duplicity/.cache/duplicity/<redacted>/duplicity-full-signatures.20180318T163509Z.sigtar (not authoritative at backend).
Last full backup date: Mon Mar 19 05:35:09 2018
Last full backup is too old, forcing full backup
Remote filesize -1 for 'duplicity-full.20180401T173037Z.vol14.difftar' does not match local size 1073735168, retrying.
Remote filesize -1 for 'duplicity-full.20180401T173037Z.vol14.difftar' does not match local size 1073735168, retrying.
Remote filesize -1 for 'duplicity-full.20180401T173037Z.vol14.difftar' does not match local size 1073735168, retrying.
File duplicity-full.20180401T173037Z.vol14.difftar was corrupted during upload.
Backup failed with rc = 44 at 2018-04-02T05:59:56+1200. About to retry automatically

I can see some recoveries from the Remote filesize of -1 in the logs.

I then coded in a loop of 60 but eventually it suffered the same fate.

This is on Ubuntu 14.04 with python-swiftclient 3.3.0.

Revision history for this message
Joris van Eijden (joris-vaneijden) wrote :

A size of -1 indicates that the backend server responded with a 404 "file not found" to a HEAD request to get the uploaded file's size.

You might be able to "fix" that by patching the _put method in duplicity/backends/swiftbackend.py to:
1. Check the local file size before uploading.
2. Put the 'self.conn.put_object' in a try/catch block.
3. Check the remote file size after uploading by calling the _query method inside another try/catch block.
4. Retry the upload if either an exception occurs or the file size doesn't match.

Revision history for this message
Kenneth Loafman (kenneth-loafman) wrote :

I have committed the patch supplied as a partial fix. I really need a
Swift user to code and test the suggested -1 fix.

On Wed, Apr 4, 2018 at 3:01 AM, Joris van Eijden <<email address hidden>
> wrote:

> A size of -1 indicates that the backend server responded with a 404
> "file not found" to a HEAD request to get the uploaded file's size.
>
> You might be able to "fix" that by patching the _put method in
> duplicity/backends/swiftbackend.py to:
> 1. Check the local file size before uploading.
> 2. Put the 'self.conn.put_object' in a try/catch block.
> 3. Check the remote file size after uploading by calling the _query method
> inside another try/catch block.
> 4. Retry the upload if either an exception occurs or the file size doesn't
> match.
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1734144
>
> Title:
> swift $file was corrupted during upload even with retries
>
> Status in Duplicity:
> In Progress
>
> Bug description:
> Command:
>
> duplicity --verbosity 4 --log-file /var/log/duplicity.log --volsize 25
> --tempdir="/tmp" --file-prefix="app1.server8." --name="app1.server8."
> --exclude-device-files --allow-source-mismatch --num-retries 100
> --exclude-filelist=/etc/cloudvps-boss/exclude.conf --full-if-older-
> than="7D" --no-encryption / swift://cloudvps-boss-backup
>
> The backup halts with this error:
>
> ERROR 44 'app1.server8.duplicity-full.20171122T021711Z.vol741.difftar.gz'
> 26197773 -1
> . File app1.server8.duplicity-full.20171122T021711Z.vol741.difftar.gz
> was corrupted during upload.
>
>
> I would suspect it to restart (due to --num-retries 100), but that does
> not happen.
>
> Duplicity version: 0.7.15 (from source)
> Ubuntu 16.04 with that default python (2.7..)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/duplicity/+bug/1734144/+subscriptions
>

Revision history for this message
Remy van Elst (raymii) wrote :

@Joris would it be possible to make the retry amount adhere to the --max-retries cli option instead of specifing a fixed amount?

@Kenneth I'll launch a few tests on our swift setup where the bug occurs (sometimes, not reproducable but with 2500+ duplicity backups daily it will arise sometime soon) and report the findings.

Revision history for this message
Doug (caniwi) wrote :

@kenneth-loafman - The patch file as attached to this ticket has been applied to a couple of servers running duplicity. All Ubuntu 14.04, python-swiftclient 3.3.0 and duplicity 0.7.17.

The log snippet I posted in note#10 is an extract from a server running the patch and my comments are related to the same server when running with the patch installed.

The Swift object store I am communicating to is running the Mitaka version of swift.

Revision history for this message
Kenneth Loafman (kenneth-loafman) wrote :

More fixes:
  - Number of retries is now same as globals.num_retries.
  - Added standoff delay of 0.5 sec per attempt.

Changed in duplicity:
status: In Progress → Fix Released
assignee: Kenneth Loafman (kenneth-loafman) → nobody
Changed in duplicity:
status: Fix Released → Fix Committed
Changed in duplicity:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.