manifest.from_string is too slow for large manifest files (65MB+). With patch.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Duplicity |
Fix Released
|
Medium
|
Unassigned |
Bug Description
duplicity 0.7.06 (December 07, 2015)
/usr/bin/python2 2.7.11+ (default, Apr 17 2016, 14:00:29)
Ubuntu 16.04 x86_64
Filesystem ext4
I have about 350GB data that is backed up to an external USB disk.
The initial backup took about 4.5 hours which is reasonable enough for me.
However when trying to run an incremental backup afterward I found that duplicity was 100% busy on 1 CPU core while no messages were shown. This was also true when changing the verbosity to 9.
I started adding log.Debug() statements to the code and in the end found the source of the problem to be in manifest.
To demonstrate the problem I created some test data using the following script:
#!/bin/bash
# Long paths create large manifests...
d="data/
# Make 100 subdirectories
mkdir -p "$d/"{0,
# Fill directories with files
for f in "$d/"??/; do dd bs="${BS:-4096}" count=$
And I run duplicity:
annotate-output duplicity --volsize=1 --no-encryption --no-compression --verbosity=5 --include="./data" --exclude="**" --name duptest1 . file://duptest1/ >> log.$(date +%Y%m%d.%H%M%S).txt
The initial full backup takes 10 minutes on my system and it creates a 65MB manifest and 1715 volumes.
$ ls -lrt duptest1/ | tail -4
-rw------- 1 xyz xyz 985600 Jun 21 12:34 duplicity-
-rw------- 1 xyz xyz 614574080 Jun 21 12:34 duplicity-
-rw------- 1 xyz xyz 292864 Jun 21 12:34 duplicity-
-rw------- 1 xyz xyz 66290532 Jun 21 12:34 duplicity-
Running the same duplicity command again does an incremental backup and it starts reading the manifest of the previous full backup. Without my change the total running time is more than 45 minutes and most of that time is spent in reading the 'same' manifest 3 times. With the change I made to manifest.
Running with annotate-output does add timestamps to the generated output and allows me to see where time is spend.
Incremental run with old code (40m26s):
-------
12:53:23 O: duplicity 0.7.06 (December 07, 2015)
12:53:23 O: Args: /usr/bin/duplicity --volsize=1 --no-encryption --no-compression --verbosity=5 --include=./data --exclude=** --name duptest1 . file://duptest1/
...
12:53:23 O: Local and Remote metadata are synchronized, no sync needed.
12:53:23 O: Processing local manifest /home/xyz/
13:06:54 O: Found 1715 volumes in manifest
13:06:54 O: Last full backup date: Tue Jun 21 12:25:01 2016
13:06:54 O: Deleting /tmp/duplicity-
13:06:54 O: Processing remote manifest duplicity-
13:18:15 O: Found 1715 volumes in manifest
13:18:16 O: Processing local manifest /home/xyz/
13:31:43 O: Found 1715 volumes in manifest
13:31:43 O: Using temporary directory /home/xyz/
13:31:43 O: Using temporary directory /home/xyz/
13:31:43 O: AsyncScheduler: instantiating at concurrency 0
13:31:43 O: A .
13:33:48 O: AsyncScheduler: running task synchronously (asynchronicity disabled)
13:33:48 O: Writing duplicity-
13:33:48 O: Deleting /tmp/duplicity-
13:33:48 O: AsyncScheduler: task completed successfully
13:33:48 O: Processed volume 1
13:33:48 O: --------------[ Backup Statistics ]--------------
13:33:48 O: StartTime 1466508703.79 (Tue Jun 21 13:31:43 2016)
13:33:48 O: EndTime 1466508828.32 (Tue Jun 21 13:33:48 2016)
13:33:48 O: ElapsedTime 124.53 (2 minutes 4.53 seconds)
13:33:48 O: SourceFiles 300111
13:33:48 O: SourceFileSize 1235808256 (1.15 GB)
13:33:48 O: NewFiles 1
13:33:48 O: NewFileSize 4096 (4.00 KB)
13:33:48 O: DeletedFiles 0
13:33:48 O: ChangedFiles 0
13:33:48 O: ChangedFileSize 0 (0 bytes)
13:33:48 O: ChangedDeltaSize 0 (0 bytes)
13:33:48 O: DeltaEntries 1
13:33:48 O: RawDeltaSize 0 (0 bytes)
13:33:48 O: TotalDestinatio
13:33:48 O: Errors 0
13:33:48 O: -------
13:33:48 O:
13:33:48 I: Finished with exitcode 0
$ ls -lrt duptest1/ | tail -4
-rw------- 1 xyz xyz 66290532 Jun 21 12:34 duplicity-
-rw------- 1 xyz xyz 10240 Jun 21 13:33 duplicity-
-rw------- 1 xyz xyz 512 Jun 21 13:33 duplicity-
-rw------- 1 xyz xyz 148 Jun 21 13:33 duplicity-
$ ls -lrt ~/.cache/
-rw------- 1 xyz xyz 35685146 Jun 21 12:34 duplicity-
-rw------- 1 xyz xyz 66290532 Jun 21 12:34 duplicity-
-rw------- 1 xyz xyz 119 Jun 21 13:33 duplicity-
-rw------- 1 xyz xyz 148 Jun 21 13:33 duplicity-
Incremental run with new code (2m24s):
-------
$ rm duptest1/
$ rm ~/.cache/
13:56:49 O: duplicity 0.7.06 (December 07, 2015)
13:56:49 O: Args: /usr/bin/duplicity --volsize=1 --no-encryption --no-compression --verbosity=5 --include=./data --exclude=** --name duptest1 . file://duptest1/
...
13:56:49 O: Local and Remote metadata are synchronized, no sync needed.
13:56:49 O: Processing local manifest /home/xyz/
13:56:56 O: Found 1715 volumes in manifest
13:56:56 O: Last full backup date: Tue Jun 21 12:25:01 2016
13:56:57 O: Deleting /tmp/duplicity-
13:56:57 O: Processing remote manifest duplicity-
13:57:03 O: Found 1715 volumes in manifest
13:57:03 O: Processing local manifest /home/xyz/
13:57:10 O: Found 1715 volumes in manifest
13:57:10 O: Using temporary directory /home/xyz/
13:57:10 O: Using temporary directory /home/xyz/
13:57:10 O: AsyncScheduler: instantiating at concurrency 0
13:57:10 O: A .
13:59:13 O: AsyncScheduler: running task synchronously (asynchronicity disabled)
13:59:13 O: Writing duplicity-
13:59:13 O: Deleting /tmp/duplicity-
13:59:13 O: AsyncScheduler: task completed successfully
13:59:13 O: Processed volume 1
13:59:13 O: --------------[ Backup Statistics ]--------------
13:59:13 O: StartTime 1466510230.45 (Tue Jun 21 13:57:10 2016)
13:59:13 O: EndTime 1466510353.28 (Tue Jun 21 13:59:13 2016)
13:59:13 O: ElapsedTime 122.83 (2 minutes 2.83 seconds)
13:59:13 O: SourceFiles 300111
13:59:13 O: SourceFileSize 1235808256 (1.15 GB)
13:59:13 O: NewFiles 1
13:59:13 O: NewFileSize 4096 (4.00 KB)
13:59:13 O: DeletedFiles 0
13:59:13 O: ChangedFiles 0
13:59:13 O: ChangedFileSize 0 (0 bytes)
13:59:13 O: ChangedDeltaSize 0 (0 bytes)
13:59:13 O: DeltaEntries 1
13:59:13 O: RawDeltaSize 0 (0 bytes)
13:59:13 O: TotalDestinatio
13:59:13 O: Errors 0
13:59:13 O: -------
13:59:13 O:
13:59:13 I: Finished with exitcode 0
$ ls -lrt duptest1/ | tail -4
-rw------- 1 xyz xyz 66290532 Jun 21 12:34 duplicity-
-rw------- 1 xyz xyz 10240 Jun 21 13:59 duplicity-
-rw------- 1 xyz xyz 512 Jun 21 13:59 duplicity-
-rw------- 1 xyz xyz 148 Jun 21 13:59 duplicity-
$ ls -lrt ~/.cache/
-rw------- 1 xyz xyz 35685146 Jun 21 12:34 duplicity-
-rw------- 1 xyz xyz 66290532 Jun 21 12:34 duplicity-
-rw------- 1 xyz xyz 121 Jun 21 13:59 duplicity-
-rw------- 1 xyz xyz 148 Jun 21 13:59 duplicity-
Changed in duplicity: | |
assignee: | nobody → Kenneth Loafman (kenneth-loafman) |
importance: | Undecided → Medium |
milestone: | none → 0.7.08 |
status: | New → In Progress |
Changed in duplicity: | |
assignee: | Kenneth Loafman (kenneth-loafman) → nobody |
status: | In Progress → Fix Committed |
Changed in duplicity: | |
status: | Fix Committed → Fix Released |
Could you supply a patch for the changes you made?
On Tue, Jun 21, 2016 at 9:05 AM, Kenneth Loafman <email address hidden>
wrote:
> ** Changed in: duplicity /bugs.launchpad .net/bugs/ 1594780 from_string is too slow for large manifest files (65MB+). from_string( self, aaaaaaaaaaaaaaa aaaaa/bbbbbbbbb bbbbbbbbbbb/ ccccccccccccccc ccccc/ddddddddd ddddddddddd/ eeeeeeeeeeeeeee eeeee/fffffffff fffffffffff/ ggggggggggggggg ggggg/hhhhhhhhh hhhhhhhhhhh/ iiiiiiiiiiiiiii iiiii" 1,2,3,4, 5,6,7,8, 9}{0,1, 2,3,4,5, 6,7,8,9} {COUNT: -3000} full.20160621T1 02501Z. vol1714. difftar full-signatures .20160621T10250 1Z.sigtar full.20160621T1 02501Z. vol1715. difftar full.20160621T1 02501Z. manifest
> Importance: Undecided => Medium
>
> ** Changed in: duplicity
> Status: New => In Progress
>
> ** Changed in: duplicity
> Milestone: None => 0.7.08
>
> ** Changed in: duplicity
> Assignee: (unassigned) => Kenneth Loafman (kenneth-loafman)
>
> --
> You received this bug notification because you are a bug assignee.
> https:/
>
> Title:
> manifest.
> With patch.
>
> Status in Duplicity:
> In Progress
>
> Bug description:
> duplicity 0.7.06 (December 07, 2015)
> /usr/bin/python2 2.7.11+ (default, Apr 17 2016, 14:00:29)
> Ubuntu 16.04 x86_64
> Filesystem ext4
>
> I have about 350GB data that is backed up to an external USB disk.
> The initial backup took about 4.5 hours which is reasonable enough for
> me.
>
> However when trying to run an incremental backup afterward I found
> that duplicity was 100% busy on 1 CPU core while no messages were
> shown. This was also true when changing the verbosity to 9.
>
> I started adding log.Debug() statements to the code and in the end
> found the source of the problem to be in manifest.
> s). The regexp loop is not very efficient and if the manifest file is
> large enough then it seems to hit some condition that causes python to
> slow down and use a lot of CPU (garbage collection???). I changed the
> loop to use re.finditer(s) and now it can process a large manifest
> file in 7 seconds instead of 13 minutes. And the manifest is read 3
> times (local, remote, local) meaning I'm hit with the delay 3 times
> over.
>
> To demonstrate the problem I created some test data using the
> following script:
>
> #!/bin/bash
> # Long paths create large manifests...
>
> d="data/
> # Make 100 subdirectories
> mkdir -p "$d/"{0,
> # Fill directories with files
> for f in "$d/"??/; do dd bs="${BS:-4096}" count=$
> </dev/urandom 2>/dev/null | split -b "${BS:-4096}" -a 5 -d - "$f/F_"; done
>
> And I run duplicity:
> annotate-output duplicity --volsize=1 --no-encryption --no-compression
> --verbosity=5 --include="./data" --exclude="**" --name duptest1 .
> file://duptest1/ >> log.$(date +%Y%m%d.%H%M%S).txt
>
> The initial full backup takes 10 minutes on my system and it creates a
> 65MB manifest and 1715 volumes.
> $ ls -lrt duptest1/ | tail -4
> -rw------- 1 xyz xyz 985600 Jun 21 12:34
> duplicity-
> -rw------- 1 xyz xyz 614574080 Jun 21 12:34
> duplicity-
> -rw------- 1 xyz xyz 292864 Jun 21 12:34
> duplicity-
> -rw------- 1 xyz xyz 66290532 Jun 21 12:34
> duplicity-
>
> Running the same dupl...