darcs

Issue 580 list files converts names to utf-8 even if they already are utf-8

Title list files converts names to utf-8 even if they already are utf-8
Priority bug Status resolved
Milestone Resolved in
Superseder Nosy List darcs-devel, dmitry.kurochkin, kowey, ppessi, thorkilnaur, tommy
Assigned To
Topics

Created on 2008-01-09.04:19:29 by ppessi, last changed 2009-08-27.14:08:00 by admin.

Files
File name Uploaded Type Edit Remove
list-files-utf8 ppessi, 2008-01-09.04:19:21 application/octet-stream
Messages
msg2385 (view) Author: ppessi Date: 2008-01-09.04:19:21
When the repo contnets are shown with darcs list, the file names that
contain 8-bit chars (UTF-8 or ISO-8859-* or whatever) are converted to
UTF-8 as if they are ISO-8859-1.

For example, file named "Ääliö älä lyö ööliä läikkyy" in 8859-1 is
byte string \e4\e4\6c\69\f6 ...
It is shown with, e.g., darcs changes --summary as quoted bytestring
[_\e4_][_\e4_]li[_\f6_] ...

With darcs list files it is shown as
./[_\c3_][_\a4_][_\c3_][_\a4_]li[_\c3_][_\b6_] (iow, it has been
converted into utf-8 as iso-8859-1).

If the file name is encoded in utf-8, it has bytestring
\c3\84\c3\a4\6c\69\c3\b6 (each accented char is now encoded in two
bytes). It is shown with, e.g., darcs changes --summary as quoted
bytestring  [_\c3_][_\84_][_\c3_][_\a4_]li[_\c3_][_\b6_]

However, with darcs list files it is shown as

[_\c3_][_\83_][_\c2_][_\84_][_\c3_][_\83_][_\c2_][_\a4_]li[_\c3_][_\83_][_\c2_][_\b6_]

that is, darcs list assumes that the bytestring is a ISO-8859-1 string
and converts it into UTF-8.

A script output from utf-8 terminal is attached.
Attachments
msg2413 (view) Author: droundy Date: 2008-01-10.21:04:06
On Wed, Jan 09, 2008 at 04:19:29AM -0000, Pekka Pessi wrote:
> When the repo contnets are shown with darcs list, the file names that
> contain 8-bit chars (UTF-8 or ISO-8859-* or whatever) are converted to
> UTF-8 as if they are ISO-8859-1.

I've just fixed (although it hasn't hit the central repo yet) this bug in
list files.  Thanks for the report!

However, the similar bug in the output of whatsnew, etc, has not yet been
fixed.  Many of these issues date from my very naive assumption (long ago!)
that because Char is a 32 bit value, it is therefore a unicode value. :(

It would be nice to fix this for darcs-2, but it's a bit awkward.  At a
minimum, I think I could hack things together so that darcs' text output
doesn't have these faulty conversions.  Changing the on-disk format is a
bit more awkward.
-- 
David Roundy
Department of Physics
Oregon State University
msg2417 (view) Author: droundy Date: 2008-01-10.22:55:04
I've just pushed code making the --darcs-2 format store and display
filenames as raw bytes, rather than trying to convert to utf-8.  This means
anyone who already has a darcs repository in the experimental --darcs-2
format with non-ascii filenames will run into trouble.  It also means that
I'd love to have some tests for this in the test suite, or at least some
testing by users who use non-ascii filenames.  Thanks!
-- 
David Roundy
Department of Physics
Oregon State University
History
Date User Action Args
2008-01-09 04:19:30ppessicreate
2008-01-10 21:04:09droundysetstatus: unread -> unknown
nosy: + darcs-devel
messages: + msg2413
2008-01-10 22:55:05droundysetmessages: + msg2417
2008-01-10 23:45:14droundysetstatus: unknown -> resolved-in-unstable
2008-09-04 21:31:51adminsetstatus: resolved-in-unstable -> resolved
nosy: + dagit
2009-08-06 17:33:26adminsetnosy: + markstos, jast, Serware, dmitry.kurochkin, zooko, mornfall, simon, thorkilnaur, - droundy, ppessi
2009-08-06 20:30:54adminsetnosy: - beschmi
2009-08-10 22:10:30adminsetnosy: + ppessi, - markstos, zooko, jast, Serware, mornfall
2009-08-11 00:04:23adminsetnosy: - dagit
2009-08-25 17:48:10adminsetnosy: - simon
2009-08-27 14:08:00adminsetnosy: tommy, kowey, darcs-devel, ppessi, thorkilnaur, dmitry.kurochkin