Bazaar

Merge lp:~weyrick/bzr/54624-warn-on-large-files into lp:bzr

54624-warn-on-large-files
Merge into bzr.dev

Proposed by Shannon Weyrick on 2011-08-08

Status:

Merged

Approved by:

Jonathan Riddell on 2011-08-19

Approved revision:

no longer in the source branch.

Merged at revision:

6086

Proposed branch:

lp:~weyrick/bzr/54624-warn-on-large-files

Merge into:

lp:bzr

Diff against target:

336 lines (+169/-10)

9 files modified

bzrlib/add.py (+44/-0)
bzrlib/builtins.py (+5/-1)
bzrlib/config.py (+40/-0)
bzrlib/help_topics/en/configuration.txt (+8/-0)
bzrlib/mutabletree.py (+13/-6)
bzrlib/osutils.py (+6/-3)
bzrlib/tests/blackbox/test_add.py (+24/-0)
bzrlib/tests/test_config.py (+20/-0)
doc/en/release-notes/bzr-2.5.txt (+9/-0)

To merge this branch:

bzr merge lp:~weyrick/bzr/54624-warn-on-large-files

Medium

Fix Released

Link a bug report

Reviewer	Date Requested	Status
martin suchanek (community)		Needs Fixing on 2012-06-02
Jonathan Riddell (community)		Needs Fixing on 2011-08-17
Jelmer Vernooij (community)	2011-08-08	Needs Information on 2011-08-08
Review via email: mp+70691@code.launchpad.net

Commit message

bzr add now skips large files in recursive mode. The default "large"
size is 20MB, and is configurable via the add.maximum_file_size
option. A value of 0 disables skipping. Named items passed to add are
never skipped. (Shannon Weyrick, #54624)

Description of the change

This branch proposes to add the feature requested in bug #54624: to skip (with warning) on large file addition, with configurable "large" threshold.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-08:

1MB might be a little small :)

Revision history for this message

Shannon Weyrick (weyrick) wrote on 2011-08-08:

I thought about this, and in the end I did 1 MB simply because that's
what the original bug mentioned. I wonder, what's the best way to get
a feel about a reasonable default from the current user base? Mailing
list?

On Sun, Aug 7, 2011 at 8:37 PM, Robert Collins
<email address hidden> wrote:
> 1MB might be a little small :)
>
> --
> https://code.launchpad.net/~weyrick/bzr/54624-warn-on-large-files/+merge/70691
> You are the owner of lp:~weyrick/bzr/54624-warn-on-large-files.
>

Revision history for this message

Shannon Weyrick (weyrick) wrote on 2011-08-08:

A nice way to figure out the default would be to get a histogram of file sizes in the bzr repos hosted on launchpad. Then we could pick something towards the top end of the decline in the curve. I don't suppose that info is readily available....

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-08:

On Mon, Aug 8, 2011 at 1:37 PM, Shannon Weyrick <email address hidden> wrote:
> A nice way to figure out the default would be to get a histogram of file sizes in the bzr repos hosted on launchpad. Then we could pick something towards the top end of the decline in the curve. I don't suppose that info is readily available....

Not preprocessed, but the vast majority of branches are public, so you
should be able to use bzrlib's API to read the inventories of those
branches - just the tip revision might be sufficient.

-Rob

Revision history for this message

Martin Pool (mbp) wrote on 2011-08-08:

Thanks for this Shannon, I'm sure it will save some users some grief.

I think the default 'large' should probably be set at the kind of size
where there's a fair chance people would regret adding it, because of
the impact on disk/memory usage/transfer time. 1MB shouldn't be a
problem for any of them and I think it's fairly likely people will
have assets that big (if not actual source files.) Computers are
bigger than they were when Vesta was new...

I think say 50MB would be nearer the mark.

I don't think it's especially worth scanning Launchpad branches unless
you really want to. You might just as well look in your own home
directory or perhaps ask on the list.

As far as implementation:

+ _DEFAULT_LARGE_FILE_THRESHOLD = 1<<20; # 1 MB

[nit] we don't normally have semicolons at the end of statements

[optional] A lot of _large_file_threshhold really could be generic
configuration code for "a non-negative integer with a default". You
could at least hoist this out; eventually it might be good if the
default was declared along with the option name.

[fix] Our convention now is that option names are grouped into the
area they relate to; I think this is only for add so it should be
probably add.large_file_threshhold; perhaps we can also make it more
obvious as something like add.maximum_file_size. (What do you think?)

+ os.path.getsize(abspath) > large_threshold):

[optional] This will do another system call to get the size and we
probably already have it and can use the cached value. But it will be
a cheap call, since the inode should already in cache.

+ "large_file_threshold of %i bytes)",

[tweak] For some reason we always say %d not %i.

[fix] This ought to be described in help_topics/en/configuration.txt

[fix] This probably also needs to be in the user documentation; I'm
not sure off hand where would be most appropriate.

Martin

Revision history for this message

Shannon Weyrick (weyrick) wrote on 2011-08-08:

> Thanks for this Shannon, I'm sure it will save some users some grief.
>
> I think the default 'large' should probably be set at the kind of size
> where there's a fair chance people would regret adding it, because of
> the impact on disk/memory usage/transfer time. 1MB shouldn't be a
> problem for any of them and I think it's fairly likely people will
> have assets that big (if not actual source files.) Computers are
> bigger than they were when Vesta was new...
>
> I think say 50MB would be nearer the mark.
>
> I don't think it's especially worth scanning Launchpad branches unless
> you really want to. You might just as well look in your own home
> directory or perhaps ask on the list.
>

I agree 1 MB is almost definitely too low, otoh 50 sounds high to me. I was thinking something on the order of 10-20MB - that almost definitely covers all source files and most normal resources from web projects as well. But, I'm not stuck on that number.

What would be real handy here is the first time this limit was reached, it could prompt the user if they wanted to raise/set it. Is there infrastructure in place in the ui for that kind of thing?

> As far as implementation:
>
> + _DEFAULT_LARGE_FILE_THRESHOLD = 1<<20; # 1 MB
>
> [nit] we don't normally have semicolons at the end of statements
>

np. obviously my c-like lang background showing here :)

> [optional] A lot of _large_file_threshhold really could be generic
> configuration code for "a non-negative integer with a default". You
> could at least hoist this out; eventually it might be good if the
> default was declared along with the option name.
>

Actually, I think what we really want is to not force the user to enter an integer for this, but rather "20M" or "1G". I could whip something up for this and stick it in Config.

Is there a central place defaults/option names are defined? I didn't see one.

> [fix] Our convention now is that option names are grouped into the
> area they relate to; I think this is only for add so it should be
> probably add.large_file_threshhold; perhaps we can also make it more
> obvious as something like add.maximum_file_size. (What do you think?)
>
> + os.path.getsize(abspath) > large_threshold):
>

Yes, maximum_file_size sounds better.

> [optional] This will do another system call to get the size and we
> probably already have it and can use the cached value. But it will be
> a cheap call, since the inode should already in cache.
>

I thought of this as well, and thought it should be cached on some level as the size is/will be used in the add. But I'll investigate using the bzr level cache - which I think I saw in the tree? I'll check.

> + "large_file_threshold of %i bytes)",
>
> [tweak] For some reason we always say %d not %i.
>
> [fix] This ought to be described in help_topics/en/configuration.txt
>
> [fix] This probably also needs to be in the user documentation; I'm
> not sure off hand where would be most appropriate.
>

No prob on these items.

Thanks for the help! I'll work on these changes as I have time over the next few days.

Shannon

> Thanks for this Shannon, I'm sure it will save some users some grief.
> 
> I think the default 'large' should probably be set at the kind of size
> where there's a fair chance people would regret adding it, because of
> the impact on disk/memory usage/transfer time.  1MB shouldn't be a
> problem for any of them and I think it's fairly likely people will
> have assets that big (if not actual source files.)  Computers are
> bigger than they were when Vesta was new...
> 
> I think say 50MB would be nearer the mark.
> 
> I don't think it's especially worth scanning Launchpad branches unless
> you really want to.  You might just as well look in your own home
> directory or perhaps ask on the list.
>

What would be real handy here is the first time this limit was reached, it could prompt the user if they wanted to raise/set it. Is there infrastructure in place in the ui for that kind of thing?

> As far as implementation:
> 
> +    _DEFAULT_LARGE_FILE_THRESHOLD = 1<<20; # 1 MB
> 
> [nit] we don't normally have semicolons at the end of statements
>

np. obviously my c-like lang background showing here :)

> [optional] A lot of _large_file_threshhold really could be generic
> configuration code for "a non-negative integer with a default".  You
> could at least hoist this out; eventually it might be good if the
> default was declared along with the option name.
>

Actually, I think what we really want is to not force the user to enter an integer for this, but rather "20M" or "1G". I could whip something up for this and stick it in Config.

Is there a central place defaults/option names are defined? I didn't see one.

> [fix] Our convention now is that option names are grouped into the
> area they relate to; I think this is only for add so it should be
> probably add.large_file_threshhold; perhaps we can also make it more
> obvious as something like add.maximum_file_size.  (What do you think?)
> 
> +                os.path.getsize(abspath) > large_threshold):
>

Yes, maximum_file_size sounds better.

> [optional] This will do another system call to get the size and we
> probably already have it and can use the cached value.  But it will be
> a cheap call, since the inode should already in cache.
>

> +                    "large_file_threshold of %i bytes)",
> 
> [tweak] For some reason we always say %d not %i.
> 
> [fix] This ought to be described in help_topics/en/configuration.txt
> 
> [fix] This probably also needs to be in the user documentation; I'm
> not sure off hand where would be most appropriate.
>

No prob on these items.

Thanks for the help! I'll work on these changes as I have time over the next few days.

Shannon

Revision history for this message

Jelmer Vernooij (jelmer) wrote on 2011-08-08:

Should this be this close to the core? I wonder if it makes more sense as a UI-level thing that's just part of "bzr add".

With the current approach, we'd have to introduce workarounds for things like the UDD importer, but I presume there will also be a fair number of packages for which "bzr merge-upstream" will now fail.

It also means all custom tree implementations will have to explicitly add support.

Revision history for this message

Jelmer Vernooij (jelmer) on 2011-08-08:

review: Needs Information

Revision history for this message

Martin Pool (mbp) wrote on 2011-08-09:

@jelmer Of course you're right, i'm glad you caught this. Perhaps the interface by which add reports back to the ui (I think there is a result or something) would be better.

Revision history for this message

Martin Pool (mbp) wrote on 2011-08-09:

On 9 August 2011 01:01, Shannon Weyrick <email address hidden> wrote:
>> Thanks for this Shannon, I'm sure it will save some users some grief.
>>
>> I think the default 'large' should probably be set at the kind of size
>> where there's a fair chance people would regret adding it, because of
>> the impact on disk/memory usage/transfer time. 1MB shouldn't be a
>> problem for any of them and I think it's fairly likely people will
>> have assets that big (if not actual source files.) Computers are
>> bigger than they were when Vesta was new...
>>
>> I think say 50MB would be nearer the mark.
>>
>> I don't think it's especially worth scanning Launchpad branches unless
>> you really want to. You might just as well look in your own home
>> directory or perhaps ask on the list.
>>
>
> I agree 1 MB is almost definitely too low, otoh 50 sounds high to me. I was thinking something on the order of 10-20MB - that almost definitely covers all source files and most normal resources from web projects as well. But, I'm not stuck on that number.

10-20MB is also fine with me.

> What would be real handy here is the first time this limit was reached, it could prompt the user if they wanted to raise/set it. Is there infrastructure in place in the ui for that kind of thing?

There is some infrastructure towards that - actually this reminds me
of something I should have said before, which is that rather than
trace.warning you should use ui_factory.warning - that takes a
structure identifier of the type of warning, with the idea that
eventually we can give the user an option to not show the warning any
more, or perhaps to change it.

> Actually, I think what we really want is to not force the user to enter an integer for this, but rather "20M" or "1G". I could whip something up for this and stick it in Config.

That would be great.

> Is there a central place defaults/option names are defined? I didn't see one.

There is option_registry in config.py.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-09:

On performance: we've found add to be pretty sensitive to syscall
volumes these days; I would personally want a test on a 50K file tree
before discounting the impact of additional syscalls.

Revision history for this message

Shannon Weyrick (weyrick) wrote on 2011-08-09:

Ok, thanks for the feedback everyone. To summarize then, here are my goals for completing this:

1) Ensure that skipping a large file doesn't happen directly in MutableTree.smart_add, but rather find a way to have the skip take place in the ui. This part may need more discussion - clearly we don't want to scan files in the ui _and_ in smart_add, but the latter is where all the scanning takes place now. AFAICT, by the time control returns to the ui, the adds have taken place. Only a count of files added is returned right now. Even if it returned the full add list, it would have to make calls to unadd/revert them at that point, I believe. Doesn't seem ideal, but let me know if that's desirable. Otherwise, I'm thinking either an optional callback of some sort (maybe similar to the existing AddAction - i.e. a new SkipAction?), or possibly making some code from _SmartAddHelper more central. Other ideas from those more knowledgeable with the codebase obviously welcome here.

2) Ensure that the stat for filesize is cached. Benchmark on large import.

3) Add a simple parser for specifying values in "human readable" sizes (i.e. "500M") to Config

4) Default limit to 20M

5) Possibly prompt user if the limit is hit and they haven't overridden the default

6) Use config name "add.maximum_file_size"

7) Document in configuration.txt and user docs

Revision history for this message

Shannon Weyrick (weyrick) wrote on 2011-08-12:

All the goals have been met in the current patch, except 5 and 2. 5 I'm not doing for now. 2 is a larger problem - some informal tests importing large trees show about 23% slowdown with the extra stat. Unfortunately I don't see a simple path towards sharing stat calls - they seem to be spread throughout the code now with no central way to cache. I'm looking into this further now.

Revision history for this message

John A Meinel (jameinel) wrote on 2011-08-12:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/12/2011 4:55 PM, Shannon Weyrick wrote:
> All the goals have been met in the current patch, except 5 and 2. 5
> I'm not doing for now. 2 is a larger problem - some informal tests
> importing large trees show about 23% slowdown with the extra stat.
> Unfortunately I don't see a simple path towards sharing stat calls -
> they seem to be spread throughout the code now with no central way to
> cache. I'm looking into this further now.
>

Is it possible to put it in as part of get_file_with_stat or something
along those lines? That uses 'fstat' to make sure to stat the object
that is already in memory.

I do realize that add does things slightly differently, however we
already have the stat object around there, because we need to know if
the object is a file or a directory.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk5FQbAACgkQJdeBCYSNAAOUoQCfXHL1daxOCO0opAic7nLjiMu8
GwQAn0xfkHedGsN3/DXM8/0w8kK5oxfa
=S28e
-----END PGP SIGNATURE-----

Revision history for this message

Shannon Weyrick (weyrick) wrote on 2011-08-12:

>
> Is it possible to put it in as part of get_file_with_stat or something
> along those lines? That uses 'fstat' to make sure to stat the object
> that is already in memory.
>
> I do realize that add does things slightly differently, however we
> already have the stat object around there, because we need to know if
> the object is a file or a directory.
>

Yes currently there's a stat to determine the file kind, which is using osutils.file_kind. Actually, it only does this if the inventory entry didn't have the kind, but I haven't hit a case where that's true in my tests. But I don't see a trivial way to have the file size check share the stat with the one from file_kind ... So it seems add would have to use something other than osutils.file_kind. If get_file_with_stat will work for both, that may do it. I'll take a look, thanks.

Shannon

Revision history for this message

Shannon Weyrick (weyrick) wrote on 2011-08-12:

get_file_with_stat did not seem applicable, because it required a file_id which was not yet available since the file isn't added yet. Instead, I added a file_stat function (which may potentially cache results, but doesn't currently), and in turn made file_kind use this. _SmartAddHelper.add now uses file_stat which can pass the results on to the new AddAction. Therefore, there is a single stat call and my tests on a large tree (linux kernel, ~39k files) show these changes to be only about .01 - .03 seconds slower than bzr.dev on my system.

So, I believe all goals have been met, please review.

Shannon

Revision history for this message

Jelmer Vernooij (jelmer) wrote on 2011-08-13:

Hi Shannon,

Thanks for you patience with this. I think this is broadly ok, though I have a few more minor nitpicks:

We don't use camel casing for method names, so skipFile should be skip_file.

I'm not sure if it's part of the syntax guide, but we generally don't add parentheses where they aren't necessary, e.g. in if statements.

Please don't use relative imports - i.e. "from bzrlib import ui" rather than "import ui".

except without arguments can easily hide bugs (it also catches e.g. NameError) - or "eat" things like KeyboardInterrupts, making it impossible to use Ctrl+C. Please catch specific exceptions instead.

Revision history for this message

Shannon Weyrick (weyrick) wrote on 2011-08-15:

Jelmer, thanks I've made those changes. Let me know if there's something else.
Shannon

Revision history for this message

Jonathan Riddell (jr) wrote on 2011-08-17:

Unfortunately this doesn't work if you specify the file name on the bzr add command, in this case the code path seems to go through mutabletree.py _SmartAddHelper._add_one_and_parent().

Also I think there should be a command line way to force adding large files (although that don't need to be part of this patch).

review: Needs Fixing

Revision history for this message

Shannon Weyrick (weyrick) wrote on 2011-08-17:

Hi Jonathan,

I think your second comment is solved by the first. By that I mean, I
believe if you specify the file on the command line, it should never
be skipped because it's large because that's never what you'd want
(actually maybe a tab completion mistake, but that's not likely
enough). Therefore, an effective way to force adding a large file is
simply to specify it on the command line.

So if you had said "bzr add ." and it told you it skipped a few files,
you could follow it up with "bzr add foo bar" to get the large ones
you really wanted. Seems like a decent work flow to me, and solves the
problem of accidentally adding large files during a recursive import,
while still allowing flexibility with both the config option and the
explicit adds.

Shannon

On Wed, Aug 17, 2011 at 7:41 AM, Jonathan Riddell <email address hidden> wrote:
> Review: Needs Fixing
> Unfortunately this doesn't work if you specify the file name on the bzr add command, in this case the code path seems to go through mutabletree.py _SmartAddHelper._add_one_and_parent().
>
> Also I think there should be a command line way to force adding large files (although that don't need to be part of this patch).
>
> --
> https://code.launchpad.net/~weyrick/bzr/54624-warn-on-large-files/+merge/70691
> You are the owner of lp:~weyrick/bzr/54624-warn-on-large-files.
>

Revision history for this message

Jonathan Riddell (jr) wrote on 2011-08-19:

sent to pqm by email

Revision history for this message

Jonathan Riddell (jr) wrote on 2011-08-19:

sent to pqm by email

Revision history for this message

kesten broughton (dathomir) wrote on 2011-11-26:

Hi all,
I got bit by this bug, so thanks a lot for fixing it. A couple of thoughts from a newbie:
On file size, the problem might not be one large file - 50 Mb, but 100 1Mb files, say in a photo album.

As i thought about it, i wonder if the large multi-media files which might be assets for a code project belong in a code revision repo. You don't really change photos or movies the way you would code and none of the merging, etc works for video. Perhaps a tutorial suggesting different ways to back up non-code components of a project would be useful.

On the other hand, documentation and video tutorials might belong in a code dev repository. I guess I'm wondering if anyone has thought this through and come up with an elegant solution.

thanks again all,

kesten

Revision history for this message

martin suchanek (martin-suc) wrote on 2012-06-02:

Hi,

I am newbie. Could please anybody tell me how/where to setup add.maximum_file_size
because I have too many errors like following:
bzr: warning: skipping /etc/apache2/core.17269 (larger than add.maximum_file_size of 20000000 bytes)
files are bigger than 65 Mbytes.

thank you,
kind regards,
M.

review: Needs Fixing

Revision history for this message

Vincent Ladeuil (vila) wrote on 2012-06-02:

'bzr add -Oadd.maximum_file_size=66MB' should do it.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Alejandro Cornejo2

Bazaar Codereview Subscribers

Benoit Pierre

Gmood

Karl Bielefeldt

Mahmoud Hassan

Matt Nordhoff

Mohd Fikri Mohd Amin

MrJOHN

Shannon Weyrick

Václav Haisman

bzr PQM

vincenzo

to status/vote changes:

Alexander Belchenko

amandla2023

 === modified file 'bzrlib/add.py'
 --- bzrlib/add.py	2011-06-14 02:21:41 +0000
 +++ bzrlib/add.py	2011-08-19 22:28:34 +0000
@@ -17,9 +17,11 @@
  """Helper functions for adding files to working trees."""
  import sys
++import os
  from bzrlib import (
      osutils,
++    ui,
+     )
@@ -53,6 +55,48 @@
              self._to_file.write('adding %s\n' % _quote(path))
          return None
++    def skip_file(self, tree, path, kind, stat_value = None):
++        """Test whether the given file should be skipped or not.
++
++        The default action never skips. Note this is only called during
++        recursive adds
++
++        :param tree: The tree we are working in
++        :param path: The path being added
++        :param kind: The kind of object being added.
++        :param stat: Stat result for this file, if available already
++        :return bool. True if the file should be skipped (not added)
++        """
++        return False
++
++
++class AddWithSkipLargeAction(AddAction):
++    """A class that can decide to skip a file if it's considered too large"""
++
++    # default 20 MB
++    _DEFAULT_MAX_FILE_SIZE = 20000000
++    _optionName = 'add.maximum_file_size'
++    _maxSize = None
++
++    def skip_file(self, tree, path, kind, stat_value = None):
++        if kind != 'file':
++            return False
++        if self._maxSize is None:
++            config = tree.branch.get_config()
++            self._maxSize = config.get_user_option_as_int_from_SI(
++                self._optionName,
++                self._DEFAULT_MAX_FILE_SIZE)
++        if stat_value is None:
++            file_size = os.path.getsize(path);
++        else:
++            file_size = stat_value.st_size;
++        if self._maxSize > 0 and file_size > self._maxSize:
++            ui.ui_factory.show_warning(
++                "skipping %s (larger than %s of %d bytes)" %
++                (path, self._optionName,  self._maxSize))
++            return True
++        return False
++
  class AddFromBaseAction(AddAction):
      """This class will try to extract file ids from another tree."""
 === modified file 'bzrlib/builtins.py'
 --- bzrlib/builtins.py	2011-08-18 04:23:06 +0000
 +++ bzrlib/builtins.py	2011-08-19 22:28:34 +0000
@@ -674,6 +674,10 @@
      Any files matching patterns in the ignore list will not be added
      unless they are explicitly mentioned.
++
++    In recursive mode, files larger than the configuration option
++    add.maximum_file_size will be skipped. Named items are never skipped due
++    to file size.
      """
      takes_args = ['file*']
      takes_options = [
@@ -706,7 +710,7 @@
              action = bzrlib.add.AddFromBaseAction(base_tree, base_path,
                            to_file=self.outf, should_print=(not is_quiet()))
          else:
--            action = bzrlib.add.AddAction(to_file=self.outf,
++            action = bzrlib.add.AddWithSkipLargeAction(to_file=self.outf,
                  should_print=(not is_quiet()))
          if base_tree:
 === modified file 'bzrlib/config.py'
 --- bzrlib/config.py	2011-08-16 15:12:39 +0000
 +++ bzrlib/config.py	2011-08-19 22:28:34 +0000
@@ -75,6 +75,7 @@
  import os
  import string
  import sys
++import re
  from bzrlib.decorators import needs_write_lock
@@ -413,6 +414,45 @@
              # add) the final ','
              l = [l]
          return l
++
++    def get_user_option_as_int_from_SI(self,  option_name,  default=None):
++        """Get a generic option from a human readable size in SI units, e.g 10MB
++
++        Accepted suffixes are K,M,G. It is case-insensitive and may be followed
++        by a trailing b (i.e. Kb, MB). This is intended to be practical and not
++        pedantic.
++
++        :return Integer, expanded to its base-10 value if a proper SI unit is
++            found. If the option doesn't exist, or isn't a value in
++            SI units, return default (which defaults to None)
++        """
++        val = self.get_user_option(option_name)
++        if isinstance(val, list):
++            val = val[0]
++        if val is None:
++            val = default
++        else:
++            p = re.compile("^(\d+)([kmg])*b*$", re.IGNORECASE)
++            try:
++                m = p.match(val)
++                if m is not None:
++                    val = int(m.group(1))
++                    if m.group(2) is not None:
++                        if m.group(2).lower() == 'k':
++                            val *= 10**3
++                        elif m.group(2).lower() == 'm':
++                            val *= 10**6
++                        elif m.group(2).lower() == 'g':
++                            val *= 10**9
++                else:
++                    ui.ui_factory.show_warning('Invalid config value for "%s" '
++                                               ' value %r is not an SI unit.'
++                                                % (option_name, val))
++                    val = default
++            except TypeError:
++                val = default
++        return val
++
      def gpg_signing_command(self):
          """What program should be used to sign signatures?"""
 === modified file 'bzrlib/help_topics/en/configuration.txt'
 --- bzrlib/help_topics/en/configuration.txt	2011-08-16 15:12:39 +0000
 +++ bzrlib/help_topics/en/configuration.txt	2011-08-19 22:28:34 +0000
@@ -628,6 +628,14 @@
  If present, defines the ``--strict`` option default value for checking
  uncommitted changes before sending a merge directive.
++add.maximum_file_size
++~~~~~~~~~~~~~~~~~~~~~
++
++Defines the maximum file size the command line "add" operation will allow
++in recursive mode, with files larger than this value being skipped. You may
++specify this value as an integer (in which case it is interpreted as bytes),
++or you may specify the value using SI units, i.e. 10KB, 20MB, 1G. A value of 0
++will disable skipping.
  External Merge Tools
  --------------------
 === modified file 'bzrlib/mutabletree.py'
 --- bzrlib/mutabletree.py	2011-07-23 16:33:38 +0000
 +++ bzrlib/mutabletree.py	2011-08-19 22:28:34 +0000
@@ -582,8 +582,9 @@
          :param parent_ie: Parent inventory entry if known, or None.  If
              None, the parent is looked up by name and used if present, otherwise it
              is recursively added.
++        :param path:
          :param kind: Kind of new entry (file, directory, etc)
--        :param action: callback(tree, parent_ie, path, kind); can return file_id
++        :param inv_path:
          :return: Inventory entry for path and a list of paths which have been added.
          """
          # Nothing to do if path is already versioned.
@@ -628,7 +629,7 @@
              if (prev_dir is None or not is_inside([prev_dir], path)):
                  yield (path, inv_path, this_ie, None)
              prev_dir = path
--
++
      def __init__(self, tree, action, conflicts_related=None):
          self.tree = tree
          if action is None:
@@ -695,12 +696,18 @@
              # get the contents of this directory.
--            # find the kind of the path being added.
++            # find the kind of the path being added, and save stat_value
++            # for reuse
++            stat_value = None
              if this_ie is None:
--                kind = osutils.file_kind(abspath)
++                stat_value = osutils.file_stat(abspath)
++                kind = osutils.file_kind_from_stat_mode(stat_value.st_mode)
              else:
                  kind = this_ie.kind
--
++
++            # allow AddAction to skip this file
++            if self.action.skip_file(self.tree,  abspath,  kind,  stat_value):
++                continue
              if not InventoryEntry.versionable_kind(kind):
                  trace.warning("skipping %s (can't add file of kind '%s')",
                                abspath, kind)
@@ -769,7 +776,7 @@
                          # recurse into this already versioned subdir.
                          things_to_add.append((subp, sub_invp, sub_ie, this_ie))
                      else:
--                        # user selection overrides ignoes
++                        # user selection overrides ignores
                          # ignore while selecting files - if we globbed in the
                          # outer loop we would ignore user files.
                          ignore_glob = self.tree.is_ignored(subp)
 === modified file 'bzrlib/osutils.py'
 --- bzrlib/osutils.py	2011-08-12 12:18:34 +0000
 +++ bzrlib/osutils.py	2011-08-19 22:28:34 +0000
@@ -2178,15 +2178,18 @@
      return file_kind_from_stat_mode(mode)
  file_kind_from_stat_mode = file_kind_from_stat_mode_thunk
--
--def file_kind(f, _lstat=os.lstat):
++def file_stat(f, _lstat=os.lstat):
      try:
--        return file_kind_from_stat_mode(_lstat(f).st_mode)
++        # XXX cache?
++        return _lstat(f)
      except OSError, e:
          if getattr(e, 'errno', None) in (errno.ENOENT, errno.ENOTDIR):
              raise errors.NoSuchFile(f)
          raise
++def file_kind(f, _lstat=os.lstat):
++    stat_value = file_stat(f, _lstat)
++    return file_kind_from_stat_mode(stat_value.st_mode)
  def until_no_eintr(f, *a, **kw):
      """Run f(*a, **kw), retrying if an EINTR error occurs.
 === modified file 'bzrlib/tests/blackbox/test_add.py'
 --- bzrlib/tests/blackbox/test_add.py	2011-07-11 06:47:32 +0000
 +++ bzrlib/tests/blackbox/test_add.py	2011-08-19 22:28:34 +0000
@@ -239,3 +239,27 @@
          out, err = self.run_bzr(["add", "a", "b"], working_dir=u"\xA7")
          self.assertEquals(out, "adding a\n" "adding b\n")
          self.assertEquals(err, "")
++
++    def test_add_skip_large_files(self):
++        """Test skipping files larger than add.maximum_file_size"""
++        tree = self.make_branch_and_tree('.')
++        self.build_tree(['small.txt', 'big.txt', 'big2.txt'])
++        self.build_tree_contents([('small.txt', '0\n')])
++        self.build_tree_contents([('big.txt', '01234567890123456789\n')])
++        self.build_tree_contents([('big2.txt', '01234567890123456789\n')])
++        tree.branch.get_config().set_user_option('add.maximum_file_size', 5)
++        out = self.run_bzr('add')[0]
++        results = sorted(out.rstrip('\n').split('\n'))
++        self.assertEquals(['adding small.txt'],
++                          results)
++        # named items never skipped, even if over max
++        out, err = self.run_bzr(["add", "big2.txt"])
++        results = sorted(out.rstrip('\n').split('\n'))
++        self.assertEquals(['adding big2.txt'],
++                          results)
++        self.assertEquals(err, "")
++        tree.branch.get_config().set_user_option('add.maximum_file_size', 30)
++        out = self.run_bzr('add')[0]
++        results = sorted(out.rstrip('\n').split('\n'))
++        self.assertEquals(['adding big.txt'],
++                          results)
 === modified file 'bzrlib/tests/test_config.py'
 --- bzrlib/tests/test_config.py	2011-08-12 14:07:21 +0000
 +++ bzrlib/tests/test_config.py	2011-08-19 22:28:34 +0000
@@ -1035,6 +1035,26 @@
          # automatically cast to list
          self.assertEqual(['x'], get_list('one_item'))
++    def test_get_user_option_as_int_from_SI(self):
++        conf, parser = self.make_config_parser("""
++plain = 100
++si_k = 5k,
++si_kb = 5kb,
++si_m = 5M,
++si_mb = 5MB,
++si_g = 5g,
++si_gb = 5gB,
++""")
++        get_si = conf.get_user_option_as_int_from_SI
++        self.assertEqual(100, get_si('plain'))
++        self.assertEqual(5000, get_si('si_k'))
++        self.assertEqual(5000, get_si('si_kb'))
++        self.assertEqual(5000000, get_si('si_m'))
++        self.assertEqual(5000000, get_si('si_mb'))
++        self.assertEqual(5000000000, get_si('si_g'))
++        self.assertEqual(5000000000, get_si('si_gb'))
++        self.assertEqual(None, get_si('non-exist'))
++        self.assertEqual(42, get_si('non-exist-with-default',  42))
  class TestSupressWarning(TestIniConfig):
 === modified file 'doc/en/release-notes/bzr-2.5.txt'
 --- doc/en/release-notes/bzr-2.5.txt	2011-08-18 04:23:06 +0000
 +++ doc/en/release-notes/bzr-2.5.txt	2011-08-19 22:28:34 +0000
@@ -78,6 +78,11 @@
  * Relative local paths can now be specified in URL syntax by using the
    "file:" prefix.  (Jelmer Vernooij)
++* bzr add now skips large files in recursive mode. The default "large"
++  size is 20MB, and is configurable via the add.maximum_file_size
++  option. A value of 0 disables skipping. Named items passed to add are
++  never skipped. (Shannon Weyrick, #54624)
++
  Improvements
  ************
@@ -169,6 +174,10 @@
    no longer support it.
    (Martin Pool)
++* New method ``Config.get_user_option_as_int_from_SI`` added for expanding a
++  value in SI format (i.e. "20MB", "1GB") into its integer equivalent.
++  (Shannon Weyrick)
++
  * ``Transport`` now has a ``_parsed_url`` attribute instead of
    separate ``_user``, ``_password``, ``_port``, ``_scheme``, ``_host``
    and ``_path`` attributes. Proxies are provided for the moment but