Merge lp:~garyvdm/bzr/unicode_bom_detect into lp:bzr
| Status: | Work in progress | ||||
|---|---|---|---|---|---|
| Proposed branch: | lp:~garyvdm/bzr/unicode_bom_detect | ||||
| Merge into: | lp:bzr | ||||
| Diff against target: |
229 lines (+91/-17) 7 files modified
NEWS (+7/-0) bzrlib/diff.py (+2/-2) bzrlib/merge.py (+1/-1) bzrlib/merge3.py (+4/-4) bzrlib/shelf_ui.py (+2/-2) bzrlib/tests/test_textfile.py (+27/-6) bzrlib/textfile.py (+48/-2) |
||||
| To merge this branch: | bzr merge lp:~garyvdm/bzr/unicode_bom_detect | ||||
| Related bugs: |
|
| Reviewer | Review Type | Date Requested | Status |
|---|---|---|---|
| Martin Packman (community) | Needs Information on 2010-01-11 | ||
| bzr-core | 2010-01-09 | Pending | |
|
Review via email:
|
|||
| Gary van der Merwe (garyvdm) wrote : | # |
| Martin Packman (gz) wrote : | # |
Is this partly inspired by the current thread on python-dev?
<http://
Can you clearly document what the actual heuristic is? It's not clear to me from reading the diff, and it's not the same as the notepad one. Why, for instance, are you not decoding UTF-8 text?
What happens if a UnicodeDecodeError is raised from one of these functions?
| Aaron Bentley (abentley) wrote : | # |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
review: resubmit
This branch introduces confusion between unicode characters and bytes
into our core commands.
Merge writes bytes, not unicode characters, to disk. Decoding the bytes
into unicode characters will mean that characters will be written to
disk. This only works for unicode characters in the ascii range because
python automatically converts unicode strings to ascii. Even though it
appears to work with the characters you tested, the merged version will
not be in utf-16, and that's not a reasonable result.
Similarly, patches are bytestreams, not unicode streams. They have no
defined encoding. It should be possible to apply a patch using
/bin/patch. But again, by decoding the bytes to unicode, this branch
makes that impossible.
A branch like this should, at minimum, test that the operations work
correctly with high-bit characters like u'\u1234', and that the input
encoding is preserved in the output.
Gary van der Merwe wrote:
> Gary van der Merwe has proposed merging lp:~garyvdm/bzr/unicode_bom_detect into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
> Related bugs:
> #267296 utf16 file detected as binary file
> https:/
>
>
> This makes files with a Unicode BOM correctly decoded and displayed in diff, merge, and shelve.
>
>
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://
iEYEARECAAYFAkt
mSoAnjchuECApiM
=ERY4
-----END PGP SIGNATURE-----
| Vincent Ladeuil (vila) wrote : | # |
Gary, I'm marking this proposal as 'Work In progress', keep us informed of your progress and feel free to ask for help.
| Alexander Belchenko (bialix) wrote : | # |
> Is this partly inspired by the current thread on python-dev?
> <http://
This is wrong URL. Do you have better one?
| Martin Packman (gz) wrote : | # |
> > Is this partly inspired by the current thread on python-dev?
> > <http://
>
> This is wrong URL. Do you have better one?
Something bad seems to have happened to their mailman, there's a bunch of junk in January. Try this instead, but if it breaks, just look through the archives for the thread "Improve open() to support reading file starting with an unicode BOM":
<http://
| Alexander Belchenko (bialix) wrote : | # |
Martin [gz] пишет:
>>> Is this partly inspired by the current thread on python-dev?
>>> <http://
>> This is wrong URL. Do you have better one?
>
> Something bad seems to have happened to their mailman, there's a bunch of junk in January. Try this instead, but if it breaks, just look through the archives for the thread "Improve open() to support reading file starting with an unicode BOM":
> <http://
Thanks.
Unmerged revisions
- 4952. By Gary van der Merwe on 2010-01-09
-
Update NEWS, and doc strings.
- 4951. By Gary van der Merwe on 2010-01-09
-
Use text_lines rather than check_text_lines.
- 4950. By Gary van der Merwe on 2010-01-09
-
Add text_file, which replaces check_text_lines. If a BOM encoding is detected, the lines are decoded, and returned.
- 4949. By Gary van der Merwe on 2010-01-09
-
Make text_file handle files with a BOM encoding.
- 4948. By Gary van der Merwe on 2010-01-09
-
Test text_file for files with a BOM encoding.

This makes files with a Unicode BOM correctly decoded and displayed in diff, merge, and shelve.