Bazaar

Code review comment for lp:~jameinel/bzr/1.16-commit-fulltext

1.16-commit-fulltext
Merge into trunk-old

Revision history for this message

John A Meinel (jameinel) wrote on 2009-06-22:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Review: Approve
> I have a couple of comments, but basically:
>
> vote approve
>
> John A Meinel wrote:
> [...]
>> + def _add_text(self, key, parents, text, nostore_sha=None, random_id=False):
>> + """See VersionedFiles.add_text()."""
>> + self._index._check_write_ok()
>> + self._check_add(key, None, random_id, check_content=False)
>> + if text.__class__ is not str:
>> + raise errors.BzrBadParameterUnicode("text")
>
> Well... 'text' might not be unicode either ;)
>
> (I'm not really concerned, but ideally perhaps that exception would take the
> param value or type as well as param name...)

Sure. It was all that "_check_add" was doing, but _check_add does it by
iterating over the list of lines.

>
> [...]
>> @@ -1597,7 +1615,7 @@
>> if refs:
>> for ref in refs:
>> if ref:
>> - raise KnitCorrupt(self,
>> + raise errors.KnitCorrupt(self,
>
> Thanks for fixing this. pyflakes reports two other undefined names in this
> file:
>
> bzrlib/groupcompress.py:1543: undefined name 'progress'
> bzrlib/groupcompress.py:1689: undefined name 'RevisionNotPresent'
>
> Now would be a good time to fix those too :)

Done

Though of course it would be better to have tests that provoke the
behavior.... :)

>> === modified file 'bzrlib/knit.py'
> [...]
>> + line_bytes = ''.join(lines)
>> return self._add(key, lines, parents,
>> - parent_texts, left_matching_blocks, nostore_sha, random_id)
>> + parent_texts, left_matching_blocks, nostore_sha, random_id,
>> + line_bytes=line_bytes)
>
> You pass lines *and* line_bytes? That looks odd. I'm pretty sure it's
> intentional, but it's worth pointing this out and explaining the rationale
> in _add's docstring, I think.

Sure. It is because there are places that want lines (Content objects
require them, and our "diff" code requires them as well.) and the commit
code would rather have line_bytes (because writing a single string to
the gzip code is quite a bit faster than a bunch of short ones).

And this way, we have both types accessible when we need them, rather
than "re-casting" from whatever we have into the other type, and then
multiplying memory *again*.

I'll doc it better.

>
>> + def _add_text(self, key, parents, text, nostore_sha=None, random_id=False):
>> + """See VersionedFiles.add_text()."""
>> + self._index._check_write_ok()
>> + self._check_add(key, None, random_id, check_content=False)
>> + if text.__class__ is not str:
>> + raise errors.BzrBadParameterUnicode("text")
>> + if parents is None:
>> + # The caller might pass None if there is no graph data, but kndx
>> + # indexes can't directly store that, so we give them
>> + # an empty tuple instead.
>> + parents = ()
>> + return self._add(key, None, parents,
>> + None, None, nostore_sha, random_id,
>> + line_bytes=text)
>
> Hmm. It would be nice if this wasn't largely the same as _add_text in
> groupcompress.py.

All but the final "self._add" versus "record = FulltextContentFactory,
self._insert...".

It didn't seem like quite enough duplication to worry about. If you
really prefer I can refactor.

>
> [...]
>> - if lines:
>> - if lines[-1][-1] != '\n':
>> - # copy the contents of lines.
>> + no_eol = False
>> + # Note: line_bytes is not modified to add a newline, that is tracked
>> + # via the no_eol flag. 'lines' *is* modified, because that is the
>> + # general values needed by the Content code.
>
> Gosh I hate our no_eol flag... :)

Yeah, it really wasn't the best way to go about it. The Content code is
all sorts of confused because of it. (line-deltas are a diff *with* the
final newline always present, etc.)

Robert originally wrote Groupcompress with a better method (*always* add
a final newline and always strip it back off), but then we ended up
going with a binary format that doesn't actually need a trailing newline
anymore.

>
> [...]
>> @@ -1920,21 +1949,16 @@
>> function spends less time resizing the final string.
>> :return: (len, a StringIO instance with the raw data ready to read.)
>> """
>> - # Note: using a string copy here increases memory pressure with e.g.
>> - # ISO's, but it is about 3 seconds faster on a 1.2Ghz intel machine
>> - # when doing the initial commit of a mozilla tree. RBC 20070921
>> - bytes = ''.join(chain(
>
> itertools.chain is now unused in this module, delete its import.
>

Done.

> [...]
>> + for chunk in chunks:
>> + if type(chunk) != str:
>
> “is not”, seeing as you're touching this line anyway...

Done. I cleaned up a few more while I was at it.

>
>> === modified file 'bzrlib/tests/test_versionedfile.py'
> [...]
>
> The conditionals and repetition here are a bit ugly, but it'll do. Ideally
> this would be done with the proper test parameterisation tools, though.
>
> -Andrew.
>
>

So... self.key_length is handled by the parameterization, but I think I
can use 'self.get_simple_key()' which knows about self.key_length and at
least lets us hide the fact that it is doing crazy stuff with key width :)

I don't know a way to handle 'self.graph' better. If the VF supports
graph, then we want to have parents supplied. If the VF *doesn't* then
it generates an exception if you supply parents.

It at least was how other tests (like test_annotate) use the same 'if
self.graph' so I stuck with it.

So it is at least better, even if it isn't perfect.

I suppose we could add a test permutation that tried to turn all calls
to "add_lines" into "_add_text" and run against those. I don't feel it
is quite worth that.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAko/pRAACgkQJdeBCYSNAAPrzACgi5N62aud9OGCpyFmteCYJniD
RN0AniqoWoTk7HdFqsP4lfbgNKuSxXW3
=RHWC
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> Review: Approve
> I have a couple of comments, but basically:
> 
>  vote approve
> 
> John A Meinel wrote:
> [...]
>> +    def _add_text(self, key, parents, text, nostore_sha=None, random_id=False):
>> +        """See VersionedFiles.add_text()."""
>> +        self._index._check_write_ok()
>> +        self._check_add(key, None, random_id, check_content=False)
>> +        if text.__class__ is not str:
>> +            raise errors.BzrBadParameterUnicode("text")
> 
> Well... 'text' might not be unicode either ;)
> 
> (I'm not really concerned, but ideally perhaps that exception would take the
> param value or type as well as param name...)

Sure. It was all that "_check_add" was doing, but _check_add does it by
iterating over the list of lines.

> 
> [...]
>> @@ -1597,7 +1615,7 @@
>>                  if refs:
>>                      for ref in refs:
>>                          if ref:
>> -                            raise KnitCorrupt(self,
>> +                            raise errors.KnitCorrupt(self,
> 
> Thanks for fixing this.  pyflakes reports two other undefined names in this
> file:
> 
> bzrlib/groupcompress.py:1543: undefined name 'progress'
> bzrlib/groupcompress.py:1689: undefined name 'RevisionNotPresent'
> 
> Now would be a good time to fix those too :)

Done

Though of course it would be better to have tests that provoke the
behavior.... :)

>> === modified file 'bzrlib/knit.py'
> [...]
>> +        line_bytes = ''.join(lines)
>>          return self._add(key, lines, parents,
>> -            parent_texts, left_matching_blocks, nostore_sha, random_id)
>> +            parent_texts, left_matching_blocks, nostore_sha, random_id,
>> +            line_bytes=line_bytes)
> 
> You pass lines *and* line_bytes?  That looks odd.  I'm pretty sure it's
> intentional, but it's worth pointing this out and explaining the rationale
> in _add's docstring, I think.

And this way, we have both types accessible when we need them, rather
than "re-casting" from whatever we have into the other type, and then
multiplying memory *again*.

I'll doc it better.

> 
>> +    def _add_text(self, key, parents, text, nostore_sha=None, random_id=False):
>> +        """See VersionedFiles.add_text()."""
>> +        self._index._check_write_ok()
>> +        self._check_add(key, None, random_id, check_content=False)
>> +        if text.__class__ is not str:
>> +            raise errors.BzrBadParameterUnicode("text")
>> +        if parents is None:
>> +            # The caller might pass None if there is no graph data, but kndx
>> +            # indexes can't directly store that, so we give them
>> +            # an empty tuple instead.
>> +            parents = ()
>> +        return self._add(key, None, parents,
>> +            None, None, nostore_sha, random_id,
>> +            line_bytes=text)
> 
> Hmm.  It would be nice if this wasn't largely the same as _add_text in
> groupcompress.py.

All but the final "self._add" versus "record = FulltextContentFactory,
self._insert...".

It didn't seem like quite enough duplication to worry about. If you
really prefer I can refactor.

> 
> [...]
>> -        if lines:
>> -            if lines[-1][-1] != '\n':
>> -                # copy the contents of lines.
>> +        no_eol = False
>> +        # Note: line_bytes is not modified to add a newline, that is tracked
>> +        #       via the no_eol flag. 'lines' *is* modified, because that is the
>> +        #       general values needed by the Content code.
> 
> Gosh I hate our no_eol flag... :)

Yeah, it really wasn't the best way to go about it. The Content code is
all sorts of confused because of it. (line-deltas are a diff *with* the
final newline always present, etc.)

> 
> [...]
>> @@ -1920,21 +1949,16 @@
>>              function spends less time resizing the final string.
>>          :return: (len, a StringIO instance with the raw data ready to read.)
>>          """
>> -        # Note: using a string copy here increases memory pressure with e.g.
>> -        # ISO's, but it is about 3 seconds faster on a 1.2Ghz intel machine
>> -        # when doing the initial commit of a mozilla tree. RBC 20070921
>> -        bytes = ''.join(chain(
> 
> itertools.chain is now unused in this module, delete its import.
>

Done.

> [...]
>> +        for chunk in chunks:
>> +            if type(chunk) != str:
> 
> “is not”, seeing as you're touching this line anyway...

Done. I cleaned up a few more while I was at it.

> 
>> === modified file 'bzrlib/tests/test_versionedfile.py'
> [...]
> 
> The conditionals and repetition here are a bit ugly, but it'll do.  Ideally
> this would be done with the proper test parameterisation tools, though.
> 
> -Andrew.
> 
>

I don't know a way to handle 'self.graph' better. If the VF supports
graph, then we want to have parents supplied. If the VF *doesn't* then
it generates an exception if you supply parents.

It at least was how other tests (like test_annotate) use the same 'if
self.graph' so I stuck with it.

So it is at least better, even if it isn't perfect.

I suppose we could add a test permutation that tried to turn all calls
to "add_lines" into "_add_text" and run against those. I don't feel it
is quite worth that.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAko/pRAACgkQJdeBCYSNAAPrzACgi5N62aud9OGCpyFmteCYJniD
RN0AniqoWoTk7HdFqsP4lfbgNKuSxXW3
=RHWC
-----END PGP SIGNATURE-----

« Back to merge proposal