manifest.from_string is too slow for large manifest files (65MB+). With patch.

Bug #1594780 reported by B. Reitsma
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Duplicity
Fix Released
Medium
Unassigned

Bug Description

duplicity 0.7.06 (December 07, 2015)
/usr/bin/python2 2.7.11+ (default, Apr 17 2016, 14:00:29)
Ubuntu 16.04 x86_64
Filesystem ext4

I have about 350GB data that is backed up to an external USB disk.
The initial backup took about 4.5 hours which is reasonable enough for me.

However when trying to run an incremental backup afterward I found that duplicity was 100% busy on 1 CPU core while no messages were shown. This was also true when changing the verbosity to 9.

I started adding log.Debug() statements to the code and in the end found the source of the problem to be in manifest.from_string(self, s). The regexp loop is not very efficient and if the manifest file is large enough then it seems to hit some condition that causes python to slow down and use a lot of CPU (garbage collection???). I changed the loop to use re.finditer(s) and now it can process a large manifest file in 7 seconds instead of 13 minutes. And the manifest is read 3 times (local, remote, local) meaning I'm hit with the delay 3 times over.

To demonstrate the problem I created some test data using the following script:

#!/bin/bash
# Long paths create large manifests...
d="data/aaaaaaaaaaaaaaaaaaaa/bbbbbbbbbbbbbbbbbbbb/cccccccccccccccccccc/dddddddddddddddddddd/eeeeeeeeeeeeeeeeeeee/ffffffffffffffffffff/gggggggggggggggggggg/hhhhhhhhhhhhhhhhhhhh/iiiiiiiiiiiiiiiiiiii"
# Make 100 subdirectories
mkdir -p "$d/"{0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7,8,9}
# Fill directories with files
for f in "$d/"??/; do dd bs="${BS:-4096}" count=${COUNT:-3000} </dev/urandom 2>/dev/null | split -b "${BS:-4096}" -a 5 -d - "$f/F_"; done

And I run duplicity:
annotate-output duplicity --volsize=1 --no-encryption --no-compression --verbosity=5 --include="./data" --exclude="**" --name duptest1 . file://duptest1/ >> log.$(date +%Y%m%d.%H%M%S).txt

The initial full backup takes 10 minutes on my system and it creates a 65MB manifest and 1715 volumes.
$ ls -lrt duptest1/ | tail -4
-rw------- 1 xyz xyz 985600 Jun 21 12:34 duplicity-full.20160621T102501Z.vol1714.difftar
-rw------- 1 xyz xyz 614574080 Jun 21 12:34 duplicity-full-signatures.20160621T102501Z.sigtar
-rw------- 1 xyz xyz 292864 Jun 21 12:34 duplicity-full.20160621T102501Z.vol1715.difftar
-rw------- 1 xyz xyz 66290532 Jun 21 12:34 duplicity-full.20160621T102501Z.manifest

Running the same duplicity command again does an incremental backup and it starts reading the manifest of the previous full backup. Without my change the total running time is more than 45 minutes and most of that time is spent in reading the 'same' manifest 3 times. With the change I made to manifest.from_string the command finished in 2.5 minutes. I did add a few log.Info() statements to show what manifest is being processed and how many volumes are found in the manifest.
Running with annotate-output does add timestamps to the generated output and allows me to see where time is spend.

Incremental run with old code (40m26s):
---------------------------------------
12:53:23 O: duplicity 0.7.06 (December 07, 2015)
12:53:23 O: Args: /usr/bin/duplicity --volsize=1 --no-encryption --no-compression --verbosity=5 --include=./data --exclude=** --name duptest1 . file://duptest1/
...
12:53:23 O: Local and Remote metadata are synchronized, no sync needed.
12:53:23 O: Processing local manifest /home/xyz/.cache/duplicity/duptest1/duplicity-full.20160621T102501Z.manifest (66290532)
13:06:54 O: Found 1715 volumes in manifest
13:06:54 O: Last full backup date: Tue Jun 21 12:25:01 2016
13:06:54 O: Deleting /tmp/duplicity-m6Owfj-tempdir/mktemp-VPfxtq-2
13:06:54 O: Processing remote manifest duplicity-full.20160621T102501Z.manifest (66290532)
13:18:15 O: Found 1715 volumes in manifest
13:18:16 O: Processing local manifest /home/xyz/.cache/duplicity/duptest1/duplicity-full.20160621T102501Z.manifest (66290532)
13:31:43 O: Found 1715 volumes in manifest
13:31:43 O: Using temporary directory /home/xyz/.cache/duplicity/duptest1/duplicity-BqI_4M-tempdir
13:31:43 O: Using temporary directory /home/xyz/.cache/duplicity/duptest1/duplicity-7aQnRu-tempdir
13:31:43 O: AsyncScheduler: instantiating at concurrency 0
13:31:43 O: A .
13:33:48 O: AsyncScheduler: running task synchronously (asynchronicity disabled)
13:33:48 O: Writing duplicity-inc.20160621T102501Z.to.20160621T105323Z.vol1.difftar
13:33:48 O: Deleting /tmp/duplicity-m6Owfj-tempdir/mktemp-juKSZq-3
13:33:48 O: AsyncScheduler: task completed successfully
13:33:48 O: Processed volume 1
13:33:48 O: --------------[ Backup Statistics ]--------------
13:33:48 O: StartTime 1466508703.79 (Tue Jun 21 13:31:43 2016)
13:33:48 O: EndTime 1466508828.32 (Tue Jun 21 13:33:48 2016)
13:33:48 O: ElapsedTime 124.53 (2 minutes 4.53 seconds)
13:33:48 O: SourceFiles 300111
13:33:48 O: SourceFileSize 1235808256 (1.15 GB)
13:33:48 O: NewFiles 1
13:33:48 O: NewFileSize 4096 (4.00 KB)
13:33:48 O: DeletedFiles 0
13:33:48 O: ChangedFiles 0
13:33:48 O: ChangedFileSize 0 (0 bytes)
13:33:48 O: ChangedDeltaSize 0 (0 bytes)
13:33:48 O: DeltaEntries 1
13:33:48 O: RawDeltaSize 0 (0 bytes)
13:33:48 O: TotalDestinationSizeChange 512 (512 bytes)
13:33:48 O: Errors 0
13:33:48 O: -------------------------------------------------
13:33:48 O:
13:33:48 I: Finished with exitcode 0

$ ls -lrt duptest1/ | tail -4
-rw------- 1 xyz xyz 66290532 Jun 21 12:34 duplicity-full.20160621T102501Z.manifest
-rw------- 1 xyz xyz 10240 Jun 21 13:33 duplicity-new-signatures.20160621T102501Z.to.20160621T105323Z.sigtar
-rw------- 1 xyz xyz 512 Jun 21 13:33 duplicity-inc.20160621T102501Z.to.20160621T105323Z.vol1.difftar
-rw------- 1 xyz xyz 148 Jun 21 13:33 duplicity-inc.20160621T102501Z.to.20160621T105323Z.manifest

$ ls -lrt ~/.cache/duplicity/duptest1/ | tail -4
-rw------- 1 xyz xyz 35685146 Jun 21 12:34 duplicity-full-signatures.20160621T102501Z.sigtar.gz
-rw------- 1 xyz xyz 66290532 Jun 21 12:34 duplicity-full.20160621T102501Z.manifest
-rw------- 1 xyz xyz 119 Jun 21 13:33 duplicity-new-signatures.20160621T102501Z.to.20160621T105323Z.sigtar.gz
-rw------- 1 xyz xyz 148 Jun 21 13:33 duplicity-inc.20160621T102501Z.to.20160621T105323Z.manifest

Incremental run with new code (2m24s):
--------------------------------------
$ rm duptest1/duplicity-inc* duptest1/duplicity-new*
$ rm ~/.cache/duplicity/duptest1/duplicity-inc* ~/.cache/duplicity/duptest1/duplicity-new*
13:56:49 O: duplicity 0.7.06 (December 07, 2015)
13:56:49 O: Args: /usr/bin/duplicity --volsize=1 --no-encryption --no-compression --verbosity=5 --include=./data --exclude=** --name duptest1 . file://duptest1/
...
13:56:49 O: Local and Remote metadata are synchronized, no sync needed.
13:56:49 O: Processing local manifest /home/xyz/.cache/duplicity/duptest1/duplicity-full.20160621T102501Z.manifest (66290532)
13:56:56 O: Found 1715 volumes in manifest
13:56:56 O: Last full backup date: Tue Jun 21 12:25:01 2016
13:56:57 O: Deleting /tmp/duplicity-2QXOjZ-tempdir/mktemp-0fbFiW-2
13:56:57 O: Processing remote manifest duplicity-full.20160621T102501Z.manifest (66290532)
13:57:03 O: Found 1715 volumes in manifest
13:57:03 O: Processing local manifest /home/xyz/.cache/duplicity/duptest1/duplicity-full.20160621T102501Z.manifest (66290532)
13:57:10 O: Found 1715 volumes in manifest
13:57:10 O: Using temporary directory /home/xyz/.cache/duplicity/duptest1/duplicity-hEMuBl-tempdir
13:57:10 O: Using temporary directory /home/xyz/.cache/duplicity/duptest1/duplicity-4Laryd-tempdir
13:57:10 O: AsyncScheduler: instantiating at concurrency 0
13:57:10 O: A .
13:59:13 O: AsyncScheduler: running task synchronously (asynchronicity disabled)
13:59:13 O: Writing duplicity-inc.20160621T102501Z.to.20160621T115649Z.vol1.difftar
13:59:13 O: Deleting /tmp/duplicity-2QXOjZ-tempdir/mktemp-ZYfGWJ-3
13:59:13 O: AsyncScheduler: task completed successfully
13:59:13 O: Processed volume 1
13:59:13 O: --------------[ Backup Statistics ]--------------
13:59:13 O: StartTime 1466510230.45 (Tue Jun 21 13:57:10 2016)
13:59:13 O: EndTime 1466510353.28 (Tue Jun 21 13:59:13 2016)
13:59:13 O: ElapsedTime 122.83 (2 minutes 2.83 seconds)
13:59:13 O: SourceFiles 300111
13:59:13 O: SourceFileSize 1235808256 (1.15 GB)
13:59:13 O: NewFiles 1
13:59:13 O: NewFileSize 4096 (4.00 KB)
13:59:13 O: DeletedFiles 0
13:59:13 O: ChangedFiles 0
13:59:13 O: ChangedFileSize 0 (0 bytes)
13:59:13 O: ChangedDeltaSize 0 (0 bytes)
13:59:13 O: DeltaEntries 1
13:59:13 O: RawDeltaSize 0 (0 bytes)
13:59:13 O: TotalDestinationSizeChange 512 (512 bytes)
13:59:13 O: Errors 0
13:59:13 O: -------------------------------------------------
13:59:13 O:
13:59:13 I: Finished with exitcode 0

$ ls -lrt duptest1/ | tail -4
-rw------- 1 xyz xyz 66290532 Jun 21 12:34 duplicity-full.20160621T102501Z.manifest
-rw------- 1 xyz xyz 10240 Jun 21 13:59 duplicity-new-signatures.20160621T102501Z.to.20160621T115649Z.sigtar
-rw------- 1 xyz xyz 512 Jun 21 13:59 duplicity-inc.20160621T102501Z.to.20160621T115649Z.vol1.difftar
-rw------- 1 xyz xyz 148 Jun 21 13:59 duplicity-inc.20160621T102501Z.to.20160621T115649Z.manifest

$ ls -lrt ~/.cache/duplicity/duptest1/ | tail -4
-rw------- 1 xyz xyz 35685146 Jun 21 12:34 duplicity-full-signatures.20160621T102501Z.sigtar.gz
-rw------- 1 xyz xyz 66290532 Jun 21 12:34 duplicity-full.20160621T102501Z.manifest
-rw------- 1 xyz xyz 121 Jun 21 13:59 duplicity-new-signatures.20160621T102501Z.to.20160621T115649Z.sigtar.gz
-rw------- 1 xyz xyz 148 Jun 21 13:59 duplicity-inc.20160621T102501Z.to.20160621T115649Z.manifest

Revision history for this message
B. Reitsma (breitsma) wrote :
Changed in duplicity:
assignee: nobody → Kenneth Loafman (kenneth-loafman)
importance: Undecided → Medium
milestone: none → 0.7.08
status: New → In Progress
Revision history for this message
Kenneth Loafman (kenneth-loafman) wrote : Re: [Bug 1594780] Re: manifest.from_string is too slow for large manifest files (65MB+). With patch.
Download full text (10.7 KiB)

Could you supply a patch for the changes you made?

On Tue, Jun 21, 2016 at 9:05 AM, Kenneth Loafman <email address hidden>
wrote:

> ** Changed in: duplicity
> Importance: Undecided => Medium
>
> ** Changed in: duplicity
> Status: New => In Progress
>
> ** Changed in: duplicity
> Milestone: None => 0.7.08
>
> ** Changed in: duplicity
> Assignee: (unassigned) => Kenneth Loafman (kenneth-loafman)
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1594780
>
> Title:
> manifest.from_string is too slow for large manifest files (65MB+).
> With patch.
>
> Status in Duplicity:
> In Progress
>
> Bug description:
> duplicity 0.7.06 (December 07, 2015)
> /usr/bin/python2 2.7.11+ (default, Apr 17 2016, 14:00:29)
> Ubuntu 16.04 x86_64
> Filesystem ext4
>
> I have about 350GB data that is backed up to an external USB disk.
> The initial backup took about 4.5 hours which is reasonable enough for
> me.
>
> However when trying to run an incremental backup afterward I found
> that duplicity was 100% busy on 1 CPU core while no messages were
> shown. This was also true when changing the verbosity to 9.
>
> I started adding log.Debug() statements to the code and in the end
> found the source of the problem to be in manifest.from_string(self,
> s). The regexp loop is not very efficient and if the manifest file is
> large enough then it seems to hit some condition that causes python to
> slow down and use a lot of CPU (garbage collection???). I changed the
> loop to use re.finditer(s) and now it can process a large manifest
> file in 7 seconds instead of 13 minutes. And the manifest is read 3
> times (local, remote, local) meaning I'm hit with the delay 3 times
> over.
>
> To demonstrate the problem I created some test data using the
> following script:
>
> #!/bin/bash
> # Long paths create large manifests...
>
> d="data/aaaaaaaaaaaaaaaaaaaa/bbbbbbbbbbbbbbbbbbbb/cccccccccccccccccccc/dddddddddddddddddddd/eeeeeeeeeeeeeeeeeeee/ffffffffffffffffffff/gggggggggggggggggggg/hhhhhhhhhhhhhhhhhhhh/iiiiiiiiiiiiiiiiiiii"
> # Make 100 subdirectories
> mkdir -p "$d/"{0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7,8,9}
> # Fill directories with files
> for f in "$d/"??/; do dd bs="${BS:-4096}" count=${COUNT:-3000}
> </dev/urandom 2>/dev/null | split -b "${BS:-4096}" -a 5 -d - "$f/F_"; done
>
> And I run duplicity:
> annotate-output duplicity --volsize=1 --no-encryption --no-compression
> --verbosity=5 --include="./data" --exclude="**" --name duptest1 .
> file://duptest1/ >> log.$(date +%Y%m%d.%H%M%S).txt
>
> The initial full backup takes 10 minutes on my system and it creates a
> 65MB manifest and 1715 volumes.
> $ ls -lrt duptest1/ | tail -4
> -rw------- 1 xyz xyz 985600 Jun 21 12:34
> duplicity-full.20160621T102501Z.vol1714.difftar
> -rw------- 1 xyz xyz 614574080 Jun 21 12:34
> duplicity-full-signatures.20160621T102501Z.sigtar
> -rw------- 1 xyz xyz 292864 Jun 21 12:34
> duplicity-full.20160621T102501Z.vol1715.difftar
> -rw------- 1 xyz xyz 66290532 Jun 21 12:34
> duplicity-full.20160621T102501Z.manifest
>
> Running the same dupl...

Revision history for this message
B. Reitsma (breitsma) wrote :
Revision history for this message
Kenneth Loafman (kenneth-loafman) wrote :
Download full text (10.6 KiB)

Sorry, missed it.

On Fri, Jun 24, 2016 at 10:08 AM, B. Reitsma <email address hidden> wrote:

> The patch was already attached when i created the bug:
>
> https://bugs.launchpad.net/duplicity/+bug/1594780/+attachment/4687870/+files/duplicity_manifest.diff
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1594780
>
> Title:
> manifest.from_string is too slow for large manifest files (65MB+).
> With patch.
>
> Status in Duplicity:
> In Progress
>
> Bug description:
> duplicity 0.7.06 (December 07, 2015)
> /usr/bin/python2 2.7.11+ (default, Apr 17 2016, 14:00:29)
> Ubuntu 16.04 x86_64
> Filesystem ext4
>
> I have about 350GB data that is backed up to an external USB disk.
> The initial backup took about 4.5 hours which is reasonable enough for
> me.
>
> However when trying to run an incremental backup afterward I found
> that duplicity was 100% busy on 1 CPU core while no messages were
> shown. This was also true when changing the verbosity to 9.
>
> I started adding log.Debug() statements to the code and in the end
> found the source of the problem to be in manifest.from_string(self,
> s). The regexp loop is not very efficient and if the manifest file is
> large enough then it seems to hit some condition that causes python to
> slow down and use a lot of CPU (garbage collection???). I changed the
> loop to use re.finditer(s) and now it can process a large manifest
> file in 7 seconds instead of 13 minutes. And the manifest is read 3
> times (local, remote, local) meaning I'm hit with the delay 3 times
> over.
>
> To demonstrate the problem I created some test data using the
> following script:
>
> #!/bin/bash
> # Long paths create large manifests...
>
> d="data/aaaaaaaaaaaaaaaaaaaa/bbbbbbbbbbbbbbbbbbbb/cccccccccccccccccccc/dddddddddddddddddddd/eeeeeeeeeeeeeeeeeeee/ffffffffffffffffffff/gggggggggggggggggggg/hhhhhhhhhhhhhhhhhhhh/iiiiiiiiiiiiiiiiiiii"
> # Make 100 subdirectories
> mkdir -p "$d/"{0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7,8,9}
> # Fill directories with files
> for f in "$d/"??/; do dd bs="${BS:-4096}" count=${COUNT:-3000}
> </dev/urandom 2>/dev/null | split -b "${BS:-4096}" -a 5 -d - "$f/F_"; done
>
> And I run duplicity:
> annotate-output duplicity --volsize=1 --no-encryption --no-compression
> --verbosity=5 --include="./data" --exclude="**" --name duptest1 .
> file://duptest1/ >> log.$(date +%Y%m%d.%H%M%S).txt
>
> The initial full backup takes 10 minutes on my system and it creates a
> 65MB manifest and 1715 volumes.
> $ ls -lrt duptest1/ | tail -4
> -rw------- 1 xyz xyz 985600 Jun 21 12:34
> duplicity-full.20160621T102501Z.vol1714.difftar
> -rw------- 1 xyz xyz 614574080 Jun 21 12:34
> duplicity-full-signatures.20160621T102501Z.sigtar
> -rw------- 1 xyz xyz 292864 Jun 21 12:34
> duplicity-full.20160621T102501Z.vol1715.difftar
> -rw------- 1 xyz xyz 66290532 Jun 21 12:34
> duplicity-full.20160621T102501Z.manifest
>
> Running the same duplicity command again does an incremental backup and
> it starts reading the manifest of the previous full backup. Without my
> change the total running time is mo...

Changed in duplicity:
assignee: Kenneth Loafman (kenneth-loafman) → nobody
status: In Progress → Fix Committed
Changed in duplicity:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.