Created on 2008-10-11.06:54:07 by twb, last changed 2011-02-25.11:09:08 by bfr.
msg6305 (view) |
Author: twb |
Date: 2008-10-11.06:54:05 |
|
There's at least one patch in the Darcs darcs repo that is not encoded
in UTF-8. As the output does not specify an encoding, I believe it is
required by the XML standard to be UTF-8 encoded.
$ darcs changes --xml |
xmlstarlet sel -t -m changelog/patches -v @author -n
-:10179: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xFC 0x6E 0x7A 0x6C
<patch author='Daniel B�nzli <daniel.buenzli@epfl.ch>' date='2005112017015
This is in itself a problem, preventing me from working with the darcs
metadata programmatically. But it may be indicative of a very serious
problem -- darcs may not convert patche metadata to a single internal
encoding.
Suppose there's a two-man repo, and the contributors respectively use
Shift JIS and UTF-16. Even if they use -*- coding -*- magic in their
source files to use the same coding there, it might be that "darcs
changes" emits data that's partly encoded as UTF-16, and partly
encoded as JIS!
|
msg6315 (view) |
Author: tux_rocker |
Date: 2008-10-11.21:04:11 |
|
Perhaps we can put a declaration in the XML that the encoding is iso-8859-1 (aka
latin1)? There is no such thing as invalid iso-8859-1, and most data in
ASCII-based encoding will look reasonable in iso-8859-1.
I believe I read in a mailing list thread that darcs can't use a consistent
encoding for metadata, because it uses the metadata for hashing precisely
(bit-by-bit) as it got it from the operating system.
|
msg6316 (view) |
Author: twb |
Date: 2008-10-12.04:09:28 |
|
On Sat, Oct 11, 2008 at 09:04:14PM -0000, Reinier Lamers wrote:
> I believe I read in a mailing list thread that darcs can't use a
> consistent encoding for metadata, because it uses the metadata for
> hashing precisely (bit-by-bit) as it got it from the operating
> system.
AIUI darcs currently treats everything as byte vectors. This is fine
as long as everyone uses the same character set and encoding.
Unfortunately, while that might be true for small groups, it's not
true for large, international projects like Darcs itself.
Any existing patches recorded by darcs have lost essential
information: the encoding of the metadata. It's impossible to get
this back reliably. So we have two separate issues:
- We need a work around in order to work with existing multi-encoding
repositories, including Darcs' own repo. As you say below, probably
the best we can do is just throw away any non-ASCII characters :-(
- We need to prevent this from happening in future, by either
0) forcing everyone to use UTF-8. I think we can just dismiss this
as impossible, if only because of Japan.
1) recording the metadata coding as part of the metadata (as done by
MIME for email); or
2) by standardizing on a single coding for internal use (that is,
within the actual patches in _darcs), and converting all user
input to that coding.
The encoding used internally isn't particularly important, but
obvious candidates are UTF-8, UTF-16, Unicode codepoint sequences
and ISO 10646. Since UTF-8 has useful properties for
In both (1) and (2) we need to converted output to the user's
coding, with some kind of sensible behaviour when that's not
possible (e.g. user is using ISO 8859-1 and the patch author's name
contains Greek characters). The iconv(1) tool might be useful as an
example of handling such lossy recoding.
> Perhaps we can put a declaration in the XML that the encoding is
> iso-8859-1 (aka latin1)? There is no such thing as invalid
> iso-8859-1, and most data in ASCII-based encoding will look
> reasonable in iso-8859-1.
While this might work around the immediate issue, it is not a long
term solution. If you forcibly treat the entire byte vector as some
ASCII-compatible eight-bit encoding (e.g. ISO 8859-1), you will
silently(!) get gibberish for
- any non-ASCII character in all other ASCII-compatible codings,
including UTF-8 and other ISO 8859; and
- *ALL* characters in ASCII-incompatible codings, including the
popular UTF-16 and JIS.
----------------------------------------------------------------------
As a real-world case study, I compared the Darcs' repo's metadata with
and without invalid UTF-8 characters:
darcs changes --xml >/tmp/x
darcs changes --xml | iconv -c -f utf-8 -t utf-8 >/tmp/y
diff -u /tmp/[xy]
It appears that 'Daniel Bünzli' is using Latin-1 and every other
contributor is using UTF-8, or is using an encoding that happens to
silently convert to gibberish when treated as UTF-8.
If we treat everything as pure ASCII, we can see that there are only
two more cases -- one use of UTF-8 smart quotes, and one UTF-8 ú.
darcs changes --xml >/tmp/x
darcs changes --xml | iconv -c -f ascii -t utf-8 >/tmp/y
diff -u /tmp/[xy]
|
msg6319 (view) |
Author: tux_rocker |
Date: 2008-10-12.11:28:17 |
|
It seems that three years ago, David already stated that he supported converting
metadata to UTF-8 before storing them (your solution #2) in issue33.
|
msg11348 (view) |
Author: kowey |
Date: 2010-06-10.09:00:31 |
|
Reviving this bug and removing superseders.
I think I was wrong to merge this into issue33 (which is sort of an
umbrella bug tracking lots of different XML related issues).
Issue64 has now been resolved by Reinier (yay!), so we have Unicode
patch metadata. In issue33, Trent suggested that raw bytes can be
base64 encoded.
If I understand correctly, the rest of this is just a matter of
implementation?
|
msg11353 (view) |
Author: kowey |
Date: 2010-06-10.09:28:45 |
|
Err, do we need a good way to deal with issue1863 in order to be able to
fix this properly?
|
msg13758 (view) |
Author: bfr |
Date: 2011-02-25.11:09:06 |
|
My colleague has just been bitten by this one, working on a darcs
plugin to hudson (http://hudson-ci.org/): the Java xml parser is not
able to correctly parse the "darcs changes --xml" output, because darcs
some repositories contain *different* encodings in the metadata
(different patches having been recorded with different locales).
Of the three solutions listed by twb, only (1) and (2) should be
considered ((0) being too unfriendly to users), the decision between
them is purely a matter of implementation (no user-visible difference
AFAICS). Along with a fix to this problem, the --xml output should be
fixed to convert everything to either the user's locale, or else some
standard encoding; in any case the --xml output should be correct xml,
i.e. provide a header specifying the encoding and then make sure the
output is actually valid according to this encoding.
Furthermore, darcs should offer an easy way for users to fix/upgrade
existing repositories with mixed encodings to the new scheme. The
upgrade fix/procedure should let the user declare what the default
encoding for patch metadata in the existing repo is, but should also
prompt her (for another encoding) if conversion is impossible.
|
|
Date |
User |
Action |
Args |
2008-10-11 06:54:07 | twb | create | |
2008-10-11 21:04:13 | tux_rocker | set | priority: bug nosy:
+ tux_rocker status: unread -> unknown messages:
+ msg6315 |
2008-10-12 04:09:31 | twb | set | nosy:
darcs-users, kowey, dagit, simon, twb, thorkilnaur, tux_rocker, dmitry.kurochkin messages:
+ msg6316 |
2008-10-12 11:28:18 | tux_rocker | set | status: unknown -> duplicate nosy:
darcs-users, kowey, dagit, simon, twb, thorkilnaur, tux_rocker, dmitry.kurochkin superseder:
+ wish: improve "darcs --xml" messages:
+ msg6319 |
2009-02-06 02:09:35 | twb | set | nosy:
darcs-users, kowey, dagit, simon, twb, thorkilnaur, tux_rocker, dmitry.kurochkin superseder:
+ Should store patch metadata in utf-8 |
2009-08-10 23:48:15 | admin | set | nosy:
- dagit |
2009-08-25 17:31:01 | admin | set | nosy:
+ darcs-devel, - simon |
2009-08-27 14:15:31 | admin | set | nosy:
darcs-users, kowey, darcs-devel, twb, thorkilnaur, tux_rocker, dmitry.kurochkin |
2010-06-10 09:00:32 | kowey | set | status: duplicate -> needs-implementation nosy:
- darcs-users, darcs-devel, thorkilnaur superseder:
- wish: improve "darcs --xml", Should store patch metadata in utf-8 messages:
+ msg11348 |
2010-06-10 09:07:10 | kowey | link | issue33 superseder |
2010-06-10 09:28:46 | kowey | set | status: needs-implementation -> waiting-for messages:
+ msg11353 superseder:
+ unicode filenames |
2011-02-25 11:09:08 | bfr | set | messages:
+ msg13758 |
|