Bazaar

Merge lp:~bialix/bzr/diff-header-encoding into lp:bzr

diff-header-encoding
Merge into bzr.dev

Proposed by Alexander Belchenko on 2010-05-26

Status:

Merged

Approved by:

Vincent Ladeuil on 2010-06-17

Approved revision:

no longer in the source branch.

Merged at revision:

5305

Proposed branch:

lp:~bialix/bzr/diff-header-encoding

Merge into:

lp:bzr

Diff against target:

222 lines (+78/-11)

7 files modified

NEWS (+6/-0)
bzrlib/builtins.py (+8/-2)
bzrlib/diff.py (+8/-5)
bzrlib/log.py (+3/-1)
bzrlib/osutils.py (+4/-0)
bzrlib/shelf_ui.py (+5/-2)
bzrlib/tests/test_diff.py (+44/-1)

To merge this branch:

bzr merge lp:~bialix/bzr/diff-header-encoding

Medium

Fix Released

Link a bug report

Reviewer	Date Requested	Status
Martin Pool	2010-05-26	Needs Fixing on 2010-06-15
Martin Packman (community)	2010-05-26	Approve on 2010-05-28
Robert Collins (community)		Needs Information on 2010-05-27
John A Meinel	2010-05-26	Approve on 2010-05-26
Review via email: mp+26067@code.launchpad.net

Commit message

Show unicode filenames in diff headers using terminal encoding.

Description of the change

This branch ensure filenames in diff header on windows always will be in mbcs encoding. My next step is to add new command-line option ``--path-encoding`` to some commands to control that encoding.

I would like to get some review on this work to ensure it's heading the right way. Also I would like to know: should I add --path-encoding option in this patch or it could be done as follow-on patch.

Revision history for this message

John A Meinel (jameinel) wrote on 2010-05-26:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Alexander Belchenko wrote:
> Alexander Belchenko has proposed merging lp:~bialix/bzr/diff-header-encoding into lp:bzr.
>
> Requested reviews:
> John A Meinel (jameinel)
> bzr-core (bzr-core)
> Related bugs:
> #382699 diff headers should contain non-ascii filenames in user_encoding, not in utf-8
> https://bugs.launchpad.net/bugs/382699
>
>
> This branch ensure filenames in diff header on windows always will be in mbcs encoding. My next step is to add new command-line option ``--path-encoding`` to some commands to control that encoding.
>
> I would like to get some review on this work to ensure it's heading the right way. Also I would like to know: should I add --path-encoding option in this patch or it could be done as follow-on patch.
>

Looks good to me.

merge: approve

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkv9gWwACgkQJdeBCYSNAANcCgCgreM1BHLlYxRjw4wl/LikySLA
FVwAoL8H4Dq25YI1tjn9w4KoJCHOFPUW
=P51i
-----END PGP SIGNATURE-----

review: Approve

Revision history for this message

Martin Pool (mbp) wrote on 2010-05-26:

You can't assume paths will always be in utf-8 on unix. You should probably use the output encoding.

So the plan is that in the diff output stream the paths will always be mbcs and the body of the diff may be in some other encoding?

Is 'path_encoding' really the ideal name for this? To me it that name seems confusable with the filesystem path encoding.

review: Needs Fixing

Revision history for this message

John A Meinel (jameinel) wrote on 2010-05-26:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> Review: Needs Fixing
> You can't assume paths will always be in utf-8 on unix. You should probably use the output encoding.
>

Prior to this, we always used utf-8, so I don't feel that this is a blocker.

> So the plan is that in the diff output stream the paths will always be mbcs and the body of the diff may be in some other encoding?
>
> Is 'path_encoding' really the ideal name for this? To me it that name seems confusable with the filesystem path encoding.

Well, for diff, it is the encoding of the path portion, but since we
don't transcode the file content, I don't know that we would want to use
a generic "output" encoding.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkv9i7oACgkQJdeBCYSNAANN7wCggQbU+iq69/41viy2LpvoLigO
P5EAoMEZrPxtzLJdOYpm9voqzBPjzMdz
=m8IN
-----END PGP SIGNATURE-----

Revision history for this message

Martin Pool (mbp) wrote on 2010-05-26:

On 27 May 2010 07:00, John A Meinel <email address hidden> wrote:
>> Is 'path_encoding' really the ideal name for this? To me it that name seems confusable with the filesystem path encoding.
>
> Well, for diff, it is the encoding of the path portion, but since we
> don't transcode the file content, I don't know that we would want to use
> a generic "output" encoding.

Within a diff we will have
- actual file content - not transcoded (but potentially could be in
future if we knew the file encoding and the output encoding)
- diff syntax (like +,-,line numbers) - mostly ascii so it's fairly
easy, but would need to be encoded for eg ucs-2
- filenames - presumably should be in the same encoding as the diff
syntax, but more likely to be non-ascii so more important to encode

Also a diff may occur within the context of a merge directive or log
output, and there too I think you would want all the non-file metadata
in the same encoding, wouldn't you?

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message

Robert Collins (lifeless) wrote on 2010-05-27:

So AIUI the problem has been that we were using utf8 for path metadata, and mcbs on windows for other stuff.

Isn't the appropriate thing to do to pick 1 encoding - e.g. the most capable of the filesystem encoding and the terminal encoding (or perhaps least capable), and use that for:
- paths
- diff syntax
- bzr chatter

And raw content for the file content itself (but perhaps escaped to not mangle terminals?)

review: Needs Information

Revision history for this message

Martin Pool (mbp) wrote on 2010-05-27:

On 27 May 2010 11:41, Robert Collins <email address hidden> wrote:
> Review: Needs Information
> So AIUI the problem has been that we were using utf8 for path metadata, and mcbs on windows for other stuff.
>
> Isn't the appropriate thing to do to pick 1 encoding - e.g. the most capable of the filesystem encoding and the terminal encoding (or perhaps least capable), and use that for:
> - paths
> - diff syntax
> - bzr chatter
>
> And raw content for the file content itself (but perhaps escaped to not mangle terminals?)

I was pointing to having one encoding for all of these too, except
that I don't think that's the right heuristic to pick it. afaict if
it's going to a terminal we should use the terminal encoding,
otherwise we should use the output encoding.

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message

Robert Collins (lifeless) wrote on 2010-05-27:

Alexander, will that work for you?

Revision history for this message

Alexander Belchenko (bialix) wrote on 2010-05-27:

Robert Collins пишет:
> Review: Needs Information
> So AIUI the problem has been that we were using utf8 for path metadata, and mcbs on windows for other stuff.

What is "other stuff" here? bzr diff produce "binary output", i.e. it
does not try to encode the diff itself, but only (generated by bzr
itself) header information like

=== modified file "foo.txt"
--- foo.txt
+++ foo.txt

If this part is not called header, then please forgive me.

My patch only touches that header.

I don't think we can transcoding file content until we have some sort of
encoding properties for files.

> Isn't the appropriate thing to do to pick 1 encoding - e.g. the most capable of the filesystem encoding and the terminal encoding (or perhaps least capable), and use that for:
> - paths
> - diff syntax
> - bzr chatter
>
> And raw content for the file content itself (but perhaps escaped to not mangle terminals?)

The problem is there is no "only one right choice" on Windows.

If user want to see diff in the terminal then it will be better to use
terminal encoding.

If user want to save diff into file and use it as patch (with GNU patch)
then the best we can do is to use mbcs (or user_encoding).

The problem is bzr can't detect user intention.

Revision history for this message

Alexander Belchenko (bialix) wrote on 2010-05-27:

Robert Collins пишет:
> Alexander, will that work for you?

I don't know. I'm happy with qdiff which does the right thing. I'm
working on this patch because Russian people often asking me about this
problem.

So I've sent RFC mail to ru_bzr ML to collect opinions from other
people. Because in main bzr ML there is still no feedback from non-ascii
people :-/

--
All the dude wanted was his rug back

Revision history for this message

Martin Pool (mbp) wrote on 2010-05-27:

On 27 May 2010 21:31, Alexander Belchenko <email address hidden> wrote:
> Robert Collins пишет:
>> Review: Needs Information
>> So AIUI the problem has been that we were using utf8 for path metadata, and mcbs on windows for other stuff.
>
> What is "other stuff" here? bzr diff produce "binary output", i.e. it
> does not try to encode the diff itself, but only (generated by bzr
> itself) header information like
>
> === modified file "foo.txt"
> --- foo.txt
> +++ foo.txt
>
> If this part is not called header, then please forgive me.
>
> My patch only touches that header.
>
> I don't think we can transcoding file content until we have some sort of
> encoding properties for files.

'other stuff' includes the '@@' '===' '---' and the line numbers, as I
discussed above. They're not very interesting to encode but they do
have an encoding. I think they should be the same as the paths.

>
>> Isn't the appropriate thing to do to pick 1 encoding - e.g. the most capable of the filesystem encoding and the terminal encoding (or perhaps least capable), and use that for:
>> - paths
>> - diff syntax
>> - bzr chatter
>>
>> And raw content for the file content itself (but perhaps escaped to not mangle terminals?)
>
> The problem is there is no "only one right choice" on Windows.
>
> If user want to see diff in the terminal then it will be better to use
> terminal encoding.
>
> If user want to save diff into file and use it as patch (with GNU patch)
> then the best we can do is to use mbcs (or user_encoding).
>
> The problem is bzr can't detect user intention.

We can't tell if stdout is redirected to a file? On Unix we can tell
with isatty().

I think these headings should be in output_encoding, which already has
special logic to decide whether to use the terminal encoding or
something else.

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message

Alexander Belchenko (bialix) wrote on 2010-05-28:

Martin Pool пишет:
> On 27 May 2010 21:31, Alexander Belchenko <email address hidden> wrote:
>> Robert Collins пишет:
>>> Review: Needs Information
>>> So AIUI the problem has been that we were using utf8 for path metadata, and mcbs on windows for other stuff.
>> What is "other stuff" here? bzr diff produce "binary output", i.e. it
>> does not try to encode the diff itself, but only (generated by bzr
>> itself) header information like
>>
>> === modified file "foo.txt"
>> --- foo.txt
>> +++ foo.txt
>>
>> If this part is not called header, then please forgive me.
>>
>> My patch only touches that header.
>>
>> I don't think we can transcoding file content until we have some sort of
>> encoding properties for files.
>
> 'other stuff' includes the '@@' '===' '---' and the line numbers, as I
> discussed above. They're not very interesting to encode but they do
> have an encoding. I think they should be the same as the paths.

As I can see '@@' is always ascii and they're generated by difflib
itself, right? I'd better do not touch them without very important reason.

>>> Isn't the appropriate thing to do to pick 1 encoding - e.g. the most capable of the filesystem encoding and the terminal encoding (or perhaps least capable), and use that for:
>>> - paths
>>> - diff syntax
>>> - bzr chatter
>>>
>>> And raw content for the file content itself (but perhaps escaped to not mangle terminals?)
>> The problem is there is no "only one right choice" on Windows.
>>
>> If user want to see diff in the terminal then it will be better to use
>> terminal encoding.
>>
>> If user want to save diff into file and use it as patch (with GNU patch)
>> then the best we can do is to use mbcs (or user_encoding).
>>
>> The problem is bzr can't detect user intention.
>
> We can't tell if stdout is redirected to a file? On Unix we can tell
> with isatty().

On Windows it's not reliable as I can see:

C:\Temp\8>python -c "import sys; print >>sys.stderr, sys.stdout.isatty()"
True

C:\Temp\8>python -c "import sys; print >>sys.stderr,
sys.stdout.isatty()"|less
False

C:\Temp\8>python -c "import sys; print >>sys.stderr,
sys.stdout.isatty()" > NUL
True

C:\Temp\8>python -c "import sys; print >>sys.stderr,
sys.stdout.isatty()" > temp.txt
False

> I think these headings should be in output_encoding, which already has
> special logic to decide whether to use the terminal encoding or
> something else.

Then, I think I should switch to terminal_encoding which is the right
thing for main use case: see the diff on the screen. And provide
--path-encoding for other cases.

Martin Pool пишет:
> On 27 May 2010 21:31, Alexander Belchenko <bialix@ukr.net> wrote:
>> Robert Collins пишет:
>>> Review: Needs Information
>>> So AIUI the problem has been that we were using utf8 for path metadata, and mcbs on windows for other stuff.
>> What is "other stuff" here? bzr diff produce "binary output", i.e. it
>> does not try to encode the diff itself, but only (generated by bzr
>> itself) header information like
>>
>> === modified file "foo.txt"
>> --- foo.txt
>> +++ foo.txt
>>
>> If this part is not called header, then please forgive me.
>>
>> My patch only touches that header.
>>
>> I don't think we can transcoding file content until we have some sort of
>> encoding properties for files.
> 
> 'other stuff' includes the '@@' '===' '---' and the line numbers, as I
> discussed above.  They're not very interesting to encode but they do
> have an encoding.  I think they should be the same as the paths.

As I can see '@@' is always ascii and they're generated by difflib 
itself, right? I'd better do not touch them without very important reason.

>>> Isn't the appropriate thing to do to pick 1 encoding - e.g. the most capable of the filesystem encoding and the terminal encoding (or perhaps least capable), and use that for:
>>>  - paths
>>>  - diff syntax
>>>  - bzr chatter
>>>
>>> And raw content for the file content itself (but perhaps escaped to not mangle terminals?)
>> The problem is there is no "only one right choice" on Windows.
>>
>> If user want to see diff in the terminal then it will be better to use
>> terminal encoding.
>>
>> If user want to save diff into file and use it as patch (with GNU patch)
>> then the best we can do is to use mbcs (or user_encoding).
>>
>> The problem is bzr can't detect user intention.
> 
> We can't tell if stdout is redirected to a file?  On Unix we can tell
> with isatty().

On Windows it's not reliable as I can see:

C:\Temp\8>python -c "import sys; print >>sys.stderr, sys.stdout.isatty()"
True

C:\Temp\8>python -c "import sys; print >>sys.stderr, 
sys.stdout.isatty()"|less
False

C:\Temp\8>python -c "import sys; print >>sys.stderr, 
sys.stdout.isatty()" > NUL
True

C:\Temp\8>python -c "import sys; print >>sys.stderr, 
sys.stdout.isatty()" > temp.txt
False

> I think these headings should be in output_encoding, which already has
> special logic to decide whether to use the terminal encoding or
> something else.

Then, I think I should switch to terminal_encoding which is the right 
thing for main use case: see the diff on the screen. And provide 
--path-encoding for other cases.

Revision history for this message

John A Meinel (jameinel) wrote on 2010-05-28:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Alexander Belchenko wrote:
> Martin Pool пишет:
>> On 27 May 2010 21:31, Alexander Belchenko <email address hidden> wrote:
>>> Robert Collins пишет:
>>>> Review: Needs Information
>>>> So AIUI the problem has been that we were using utf8 for path metadata, and mcbs on windows for other stuff.
>>> What is "other stuff" here? bzr diff produce "binary output", i.e. it
>>> does not try to encode the diff itself, but only (generated by bzr
>>> itself) header information like
>>>
>>> === modified file "foo.txt"
>>> --- foo.txt
>>> +++ foo.txt
>>>
>>> If this part is not called header, then please forgive me.
>>>
>>> My patch only touches that header.
>>>
>>> I don't think we can transcoding file content until we have some sort of
>>> encoding properties for files.
>> 'other stuff' includes the '@@' '===' '---' and the line numbers, as I
>> discussed above. They're not very interesting to encode but they do
>> have an encoding. I think they should be the same as the paths.
>
> As I can see '@@' is always ascii and they're generated by difflib
> itself, right? I'd better do not touch them without very important reason.
>

Martin's point is that if you selected "--encoding=utf-16" then you
should have "\0@\0@\0 \0m\0o\0d\0i...\0f\0o\0o\0.\0t\0x\0t" in the
output, rather than a mix of 8-bit ascii followed by 16-bit unicode chars.

Yes, it turns out that most (all?) 8-bit encodings use ascii as a
subset, but that certainly isn't true for 16-bit encodings.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkv/34AACgkQJdeBCYSNAAOR8QCgzXHT1MsCrZLopOaQpHGKmJNR
fI8AnRMBiSrbj1wKt8GzRiwau/McmriV
=wzCu
-----END PGP SIGNATURE-----

Revision history for this message

Alexander Belchenko (bialix) wrote on 2010-05-28:

John A Meinel пишет:
> Alexander Belchenko wrote:
>> Martin Pool пишет:
>>> On 27 May 2010 21:31, Alexander Belchenko <email address hidden> wrote:
>>>> Robert Collins пишет:
>>>>> Review: Needs Information
>>>>> So AIUI the problem has been that we were using utf8 for path metadata, and mcbs on windows for other stuff.
>>>> What is "other stuff" here? bzr diff produce "binary output", i.e. it
>>>> does not try to encode the diff itself, but only (generated by bzr
>>>> itself) header information like
>>>>
>>>> === modified file "foo.txt"
>>>> --- foo.txt
>>>> +++ foo.txt
>>>>
>>>> If this part is not called header, then please forgive me.
>>>>
>>>> My patch only touches that header.
>>>>
>>>> I don't think we can transcoding file content until we have some sort of
>>>> encoding properties for files.
>>> 'other stuff' includes the '@@' '===' '---' and the line numbers, as I
>>> discussed above. They're not very interesting to encode but they do
>>> have an encoding. I think they should be the same as the paths.
>> As I can see '@@' is always ascii and they're generated by difflib
>> itself, right? I'd better do not touch them without very important reason.
>>
>
> Martin's point is that if you selected "--encoding=utf-16" then you
> should have "\0@\0@\0 \0m\0o\0d\0i...\0f\0o\0o\0.\0t\0x\0t" in the
> output, rather than a mix of 8-bit ascii followed by 16-bit unicode chars.
>
> Yes, it turns out that most (all?) 8-bit encodings use ascii as a
> subset, but that certainly isn't true for 16-bit encodings.

Hmmm. Good point. But I think it won't fly anyway. What we should do
with file content? It still will be just 8-bit binary stream?

Or this makes sense only if the file content itself is utf-16? Right?

Revision history for this message

Alexander Belchenko (bialix) wrote on 2010-05-28:

Alexander Belchenko пишет:
> John A Meinel пишет:
>> Alexander Belchenko wrote:
>>> Martin Pool пишет:
>>>> On 27 May 2010 21:31, Alexander Belchenko <email address hidden> wrote:
>>>>> Robert Collins пишет:
>>>>>> Review: Needs Information
>>>>>> So AIUI the problem has been that we were using utf8 for path metadata, and mcbs on windows for other stuff.
>>>>> What is "other stuff" here? bzr diff produce "binary output", i.e. it
>>>>> does not try to encode the diff itself, but only (generated by bzr
>>>>> itself) header information like
>>>>>
>>>>> === modified file "foo.txt"
>>>>> --- foo.txt
>>>>> +++ foo.txt
>>>>>
>>>>> If this part is not called header, then please forgive me.
>>>>>
>>>>> My patch only touches that header.
>>>>>
>>>>> I don't think we can transcoding file content until we have some sort of
>>>>> encoding properties for files.
>>>> 'other stuff' includes the '@@' '===' '---' and the line numbers, as I
>>>> discussed above. They're not very interesting to encode but they do
>>>> have an encoding. I think they should be the same as the paths.
>>> As I can see '@@' is always ascii and they're generated by difflib
>>> itself, right? I'd better do not touch them without very important reason.
>>>
>> Martin's point is that if you selected "--encoding=utf-16" then you
>> should have "\0@\0@\0 \0m\0o\0d\0i...\0f\0o\0o\0.\0t\0x\0t" in the
>> output, rather than a mix of 8-bit ascii followed by 16-bit unicode chars.
>>
>> Yes, it turns out that most (all?) 8-bit encodings use ascii as a
>> subset, but that certainly isn't true for 16-bit encodings.
>
> Hmmm. Good point. But I think it won't fly anyway. What we should do
> with file content? It still will be just 8-bit binary stream?
>
> Or this makes sense only if the file content itself is utf-16? Right?

It won't work with external diff. I think supporting utf-16 is separate
issue.

Revision history for this message

John A Meinel (jameinel) wrote on 2010-05-28:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...
>>> Martin's point is that if you selected "--encoding=utf-16" then you
>>> should have "\0@\0@\0 \0m\0o\0d\0i...\0f\0o\0o\0.\0t\0x\0t" in the
>>> output, rather than a mix of 8-bit ascii followed by 16-bit unicode chars.
>>>
>>> Yes, it turns out that most (all?) 8-bit encodings use ascii as a
>>> subset, but that certainly isn't true for 16-bit encodings.
>> Hmmm. Good point. But I think it won't fly anyway. What we should do
>> with file content? It still will be just 8-bit binary stream?
>>
>> Or this makes sense only if the file content itself is utf-16? Right?
>
> It won't work with external diff. I think supporting utf-16 is separate
> issue.

Conceptually if someone had utf-16 content, they would then want utf-16
diff + utf-16 headers.

Martin's point is basically about this. I don't know what (if any) 8-bit
encodings don't use the ascii subset, but that might also be an issue.

Short term, this is probably better than it was, As long as it gives us
a way to transition to even better.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkv/6vwACgkQJdeBCYSNAAOorQCcCllgJ6rZZb8M71F+NAS2xjTU
Ge8AoI2NvGUb0kXpAqliAMm+NzAjDE8N
=s/WC
-----END PGP SIGNATURE-----

Revision history for this message

Martin Packman (gz) wrote on 2010-05-28:

Okay, so, there are no perfect answers here, but this is a clear improvement and should be landed.

The problem with diff is that it's half user-readable text format, and half machine-readable patch input. Using the locale byte-encoding on Windows means throwing away information, but the without storing the encodings of the actual files we're already stuffed on correctness anyway (and as Alexander mentions, that means there's no way of generating a sensible UTF-16 diff file, even if such things were taken as valid input anywhere).

The terminal encoding issue is a worthwhile point, and a future enhancement could be to detect where we have a terminal handle rather than a pipe (which might still end up on the terminal, but again, we're stuffed on that), and use the unicode console apis.

+ else:
+ return 'utf8'

As mentioned by Martin Pool, this should probably instead be something like:

+ return locale.getpreferredencoding()

That would actually be fine on win32 too, but we're happy with the lossy downcoding the Python mbcs codec does, right?

review: Approve

Revision history for this message

Alexander Belchenko (bialix) wrote on 2010-06-01:

OK, I've changed default encoding from mbcs to terminal encoding, thus optimizing my patch to use case: "show me a diff on the screen" rather than provide compatibility with GNU diff/patch tools. If user need to save diff to file then user need to change terminal encoding with `chcp` command.

Also I've added a NEWS.

vila and mgz persuaded me to land this patch in its current form and add --path-encoding option later.

Revision history for this message

Alexander Belchenko (bialix) wrote on 2010-06-07:

PING.

Revision history for this message

Andrew Bennetts (spiv) wrote on 2010-06-08:

Thanks for the ping. I've poked Martin (Pool) in person; he's promised to take another look at this.

Revision history for this message

Alexander Belchenko (bialix) wrote on 2010-06-08:

Andrew Bennetts пишет:
> Thanks for the ping. I've poked Martin (Pool) in person; he's promised to take another look at this.

Thank you, mr. lotzman!

Revision history for this message

Vincent Ladeuil (vila) wrote on 2010-06-14:

@poolie: ping

Revision history for this message

Martin Pool (mbp) wrote on 2010-06-15:

Sorry for the lag.

I agree this is probably a step forward except that you really should not just default to utf8 on non-Windows, as Martin[gz] and I said previously in this review. You could just default to output_encoding. But in fact, why not always default to output_encoding, and then on win32 we should arrange for that to be either the terminal encoding or something else, as appropriate.

review: Needs Fixing

Revision history for this message

Alexander Belchenko (bialix) wrote on 2010-06-15:

What do you mean by "output_encoding"?

Revision history for this message

Martin Pool (mbp) wrote on 2010-06-16:

> What do you mean by "output_encoding"?

I was confused by the fact that the value returned by get_terminal_encoding is often referred to as output_encoding. Possibly we should eventually have that separated in the ui output stream.

In short what I'd recommend for now is that you change

+def get_diff_header_encoding():
123 + if sys.platform == 'win32':
124 + return get_terminal_encoding()
125 + else:
126 + return 'utf8'
127 +
128 +

to simply return get_terminal_encoding: that seems at least as likely to be correct on unix? On my machine it returns 'UTF-8'.

Revision history for this message

Vincent Ladeuil (vila) wrote on 2010-06-17:

@Sacha: I think Martin's request is just a tweak, if you agree with that, please land once it's fixed.
I'll mark this proposal approved, make it wip if needed.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Alejandro Cornejo2

Alexander Belchenko

Bazaar Codereview Subscribers

Benoit Pierre

Gmood

Karl Bielefeldt

Mahmoud Hassan

Matt Nordhoff

Mohd Fikri Mohd Amin

MrJOHN

Václav Haisman

bzr PQM

vincenzo

to status/vote changes:

amandla2023

1	=== modified file 'NEWS'
2	--- NEWS 2010-06-17 02:06:20 +0000
3	+++ NEWS 2010-06-17 08:55:41 +0000
4	@@ -66,6 +66,12 @@
5	which previously caused "SyntaxError: No command for line".
6	(Martin Pool)
7
8	+* Show unicode filenames in diff headers using terminal encoding.
9	+ (Alexander Belchenko, Bug #382699)
10	+ NOTE for Windows users: If user need to save diff to file then user need to
11	+ change encoding of the terminal to ANSI encoding with command ``chcp XXX``
12	+ (e.g. ``chcp 1251`` for Russian Windows).
13	+
14	Improvements
15	************
16
17
18	=== modified file 'bzrlib/builtins.py'
19	--- bzrlib/builtins.py 2010-06-07 02:38:04 +0000
20	+++ bzrlib/builtins.py 2010-06-17 08:55:41 +0000
21	@@ -1970,11 +1970,15 @@
22	old_branch, new_branch,
23	specific_files, extra_trees) = get_trees_and_branches_to_diff_locked(
24	file_list, revision, old, new, self.add_cleanup, apply_view=True)
25	+ # GNU diff on Windows uses ANSI encoding for filenames
26	+ path_encoding = osutils.get_diff_header_encoding()
27	return show_diff_trees(old_tree, new_tree, sys.stdout,
28	specific_files=specific_files,
29	external_diff_options=diff_options,
30	old_label=old_label, new_label=new_label,
31	- extra_trees=extra_trees, using=using,
32	+ extra_trees=extra_trees,
33	+ path_encoding=path_encoding,
34	+ using=using,
35	format_cls=format)
36
37
38	@@ -3892,8 +3896,10 @@
39	def _do_preview(self, merger):
40	from bzrlib.diff import show_diff_trees
41	result_tree = self._get_preview(merger)
42	+ path_encoding = osutils.get_diff_header_encoding()
43	show_diff_trees(merger.this_tree, result_tree, self.outf,
44	- old_label='', new_label='')
45	+ old_label='', new_label='',
46	+ path_encoding=path_encoding)
47
48	def _do_merge(self, merger, change_reporter, allow_pending, verified):
49	merger.change_reporter = change_reporter
50
51	=== modified file 'bzrlib/diff.py'
52	--- bzrlib/diff.py 2010-05-25 17:27:52 +0000
53	+++ bzrlib/diff.py 2010-06-17 08:55:41 +0000
54	@@ -99,8 +99,8 @@
55	if sequence_matcher is None:
56	sequence_matcher = patiencediff.PatienceSequenceMatcher
57	ud = patiencediff.unified_diff(oldlines, newlines,
58	- fromfile=old_filename.encode(path_encoding),
59	- tofile=new_filename.encode(path_encoding),
60	+ fromfile=old_filename.encode(path_encoding, 'replace'),
61	+ tofile=new_filename.encode(path_encoding, 'replace'),
62	sequencematcher=sequence_matcher)
63
64	ud = list(ud)
65	@@ -713,11 +713,11 @@
66	from_text = _get_text(self.old_tree, from_file_id, from_path)
67	to_text = _get_text(self.new_tree, to_file_id, to_path)
68	self.text_differ(from_label, from_text, to_label, to_text,
69	- self.to_file)
70	+ self.to_file, path_encoding=self.path_encoding)
71	except errors.BinaryFile:
72	self.to_file.write(
73	("Binary files %s and %s differ\n" %
74	- (from_label, to_label)).encode(self.path_encoding))
75	+ (from_label, to_label)).encode(self.path_encoding,'replace'))
76	return self.CHANGED
77
78
79	@@ -920,7 +920,10 @@
80	extra_factories = []
81	if external_diff_options:
82	opts = external_diff_options.split()
83	- def diff_file(olab, olines, nlab, nlines, to_file):
84	+ def diff_file(olab, olines, nlab, nlines, to_file, path_encoding=None):
85	+ """:param path_encoding: not used but required
86	+ to match the signature of internal_diff.
87	+ """
88	external_diff(olab, olines, nlab, nlines, to_file, opts)
89	else:
90	diff_file = internal_diff
91
92	=== modified file 'bzrlib/log.py'
93	--- bzrlib/log.py 2010-06-08 09:50:27 +0000
94	+++ bzrlib/log.py 2010-06-17 08:55:41 +0000
95	@@ -70,6 +70,7 @@
96	diff,
97	errors,
98	foreign,
99	+ osutils,
100	repository as _mod_repository,
101	revision as _mod_revision,
102	revisionspec,
103	@@ -432,8 +433,9 @@
104	else:
105	specific_files = None
106	s = StringIO()
107	+ path_encoding = osutils.get_diff_header_encoding()
108	diff.show_diff_trees(tree_1, tree_2, s, specific_files, old_label='',
109	- new_label='')
110	+ new_label='', path_encoding=path_encoding)
111	return s.getvalue()
112
113	def _create_log_revision_iterator(self):
114
115	=== modified file 'bzrlib/osutils.py'
116	--- bzrlib/osutils.py 2010-06-07 01:13:05 +0000
117	+++ bzrlib/osutils.py 2010-06-17 08:55:41 +0000
118	@@ -1952,6 +1952,10 @@
119	return user_encoding
120
121
122	+def get_diff_header_encoding():
123	+ return get_terminal_encoding()
124	+
125	+
126	def get_host_name():
127	"""Return the current unicode host name.
128
129
130	=== modified file 'bzrlib/shelf_ui.py'
131	--- bzrlib/shelf_ui.py 2010-01-15 05:33:28 +0000
132	+++ bzrlib/shelf_ui.py 2010-06-17 08:55:41 +0000
133	@@ -241,7 +241,8 @@
134	new_tree = self.work_tree
135	old_path = old_tree.id2path(file_id)
136	new_path = new_tree.id2path(file_id)
137	- text_differ = diff.DiffText(old_tree, new_tree, diff_file)
138	+ text_differ = diff.DiffText(old_tree, new_tree, diff_file,
139	+ path_encoding=osutils.get_terminal_encoding())
140	patch = text_differ.diff(file_id, old_path, new_path, 'file', 'file')
141	diff_file.seek(0)
142	return patches.parse_patch(diff_file)
143	@@ -493,7 +494,9 @@
144	new_tree = tt.get_preview_tree()
145	if self.write_diff_to is None:
146	self.write_diff_to = ui.ui_factory.make_output_stream()
147	- diff.show_diff_trees(merger.this_tree, new_tree, self.write_diff_to)
148	+ path_encoding = osutils.get_diff_header_encoding()
149	+ diff.show_diff_trees(merger.this_tree, new_tree, self.write_diff_to,
150	+ path_encoding=path_encoding)
151	tt.finalize()
152
153	def show_changes(self, merger):
154
155	=== modified file 'bzrlib/tests/test_diff.py'
156	--- bzrlib/tests/test_diff.py 2010-05-20 08:08:20 +0000
157	+++ bzrlib/tests/test_diff.py 2010-06-17 08:55:41 +0000
158	@@ -34,6 +34,7 @@
159	)
160	from bzrlib.symbol_versioning import deprecated_in
161	from bzrlib.tests import features
162	+from bzrlib.tests.blackbox.test_diff import subst_dates
163
164
165	class _AttribFeature(tests.Feature):
166	@@ -521,7 +522,6 @@
167	self.assertNotContainsRe(d, r"file 'e'")
168	self.assertNotContainsRe(d, r"file 'f'")
169
170	-
171	def test_binary_unicode_filenames(self):
172	"""Test that contents of files are not encoded in UTF-8 when there
173	is a binary file in the diff.
174	@@ -580,6 +580,49 @@
175	self.assertContainsRe(d, "=== modified file 'mod_%s'"%autf8)
176	self.assertContainsRe(d, "=== removed file 'del_%s'"%autf8)
177
178	+ def test_unicode_filename_path_encoding(self):
179	+ """Test for bug #382699: unicode filenames on Windows should be shown
180	+ in user encoding.
181	+ """
182	+ self.requireFeature(tests.UnicodeFilenameFeature)
183	+ # The word 'test' in Russian
184	+ _russian_test = u'\u0422\u0435\u0441\u0442'
185	+ directory = _russian_test + u'/'
186	+ test_txt = _russian_test + u'.txt'
187	+ u1234 = u'\u1234.txt'
188	+
189	+ tree = self.make_branch_and_tree('.')
190	+ self.build_tree_contents([
191	+ (test_txt, 'foo\n'),
192	+ (u1234, 'foo\n'),
193	+ (directory, None),
194	+ ])
195	+ tree.add([test_txt, u1234, directory])
196	+
197	+ sio = StringIO()
198	+ diff.show_diff_trees(tree.basis_tree(), tree, sio,
199	+ path_encoding='cp1251')
200	+
201	+ output = subst_dates(sio.getvalue())
202	+ shouldbe = ('''\
203	+=== added directory '%(directory)s'
204	+=== added file '%(test_txt)s'
205	+--- a/%(test_txt)s\tYYYY-MM-DD HH:MM:SS +ZZZZ
206	++++ b/%(test_txt)s\tYYYY-MM-DD HH:MM:SS +ZZZZ
207	+@@ -0,0 +1,1 @@
208	++foo
209	+
210	+=== added file '?.txt'
211	+--- a/?.txt\tYYYY-MM-DD HH:MM:SS +ZZZZ
212	++++ b/?.txt\tYYYY-MM-DD HH:MM:SS +ZZZZ
213	+@@ -0,0 +1,1 @@
214	++foo
215	+
216	+''' % {'directory': _russian_test.encode('cp1251'),
217	+ 'test_txt': test_txt.encode('cp1251'),
218	+ })
219	+ self.assertEqualDiff(output, shouldbe)
220	+
221
222	class DiffWasIs(diff.DiffPath):
223