darcs

Issue 1143 darcs changes --xml is not consistently encoded

Title darcs changes --xml is not consistently encoded
Priority bug Status waiting-for
Milestone Resolved in
Superseder unicode filenames
View: 1863
Nosy List dmitry.kurochkin, kowey, tux_rocker, twb
Assigned To
Topics

Created on 2008-10-11.06:54:07 by twb, last changed 2011-02-25.11:09:08 by bfr.

Messages
msg6305 (view) Author: twb Date: 2008-10-11.06:54:05
There's at least one patch in the Darcs darcs repo that is not encoded
in UTF-8.  As the output does not specify an encoding, I believe it is
required by the XML standard to be UTF-8 encoded.

  $ darcs changes --xml |
    xmlstarlet sel -t -m changelog/patches -v @author -n
  -:10179: parser error : Input is not proper UTF-8, indicate encoding !
  Bytes: 0xFC 0x6E 0x7A 0x6C
  <patch author='Daniel B�nzli &lt;daniel.buenzli@epfl.ch&gt;' date='2005112017015

This is in itself a problem, preventing me from working with the darcs
metadata programmatically.  But it may be indicative of a very serious
problem -- darcs may not convert patche metadata to a single internal
encoding.

Suppose there's a two-man repo, and the contributors respectively use
Shift JIS and UTF-16.  Even if they use -*- coding -*- magic in their
source files to use the same coding there, it might be that "darcs
changes" emits data that's partly encoded as UTF-16, and partly
encoded as JIS!
msg6315 (view) Author: tux_rocker Date: 2008-10-11.21:04:11
Perhaps we can put a declaration in the XML that the encoding is iso-8859-1 (aka
latin1)? There is no such thing as invalid iso-8859-1, and most data in
ASCII-based encoding will look reasonable in iso-8859-1.

I believe I read in a mailing list thread that darcs can't use a consistent
encoding for metadata, because it uses the metadata for hashing precisely
(bit-by-bit) as it got it from the operating system.
msg6316 (view) Author: twb Date: 2008-10-12.04:09:28
On Sat, Oct 11, 2008 at 09:04:14PM -0000, Reinier Lamers wrote:
> I believe I read in a mailing list thread that darcs can't use a
> consistent encoding for metadata, because it uses the metadata for
> hashing precisely (bit-by-bit) as it got it from the operating
> system.

AIUI darcs currently treats everything as byte vectors.  This is fine
as long as everyone uses the same character set and encoding.
Unfortunately, while that might be true for small groups, it's not
true for large, international projects like Darcs itself.

Any existing patches recorded by darcs have lost essential
information: the encoding of the metadata.  It's impossible to get
this back reliably.  So we have two separate issues:

- We need a work around in order to work with existing multi-encoding
  repositories, including Darcs' own repo.  As you say below, probably
  the best we can do is just throw away any non-ASCII characters :-(

- We need to prevent this from happening in future, by either

  0) forcing everyone to use UTF-8.  I think we can just dismiss this
     as impossible, if only because of Japan.

  1) recording the metadata coding as part of the metadata (as done by
     MIME for email); or

  2) by standardizing on a single coding for internal use (that is,
     within the actual patches in _darcs), and converting all user
     input to that coding.

     The encoding used internally isn't particularly important, but
     obvious candidates are UTF-8, UTF-16, Unicode codepoint sequences
     and ISO 10646.  Since UTF-8 has useful properties for

  In both (1) and (2) we need to converted output to the user's
  coding, with some kind of sensible behaviour when that's not
  possible (e.g. user is using ISO 8859-1 and the patch author's name
  contains Greek characters).  The iconv(1) tool might be useful as an
  example of handling such lossy recoding.

> Perhaps we can put a declaration in the XML that the encoding is
> iso-8859-1 (aka latin1)? There is no such thing as invalid
> iso-8859-1, and most data in ASCII-based encoding will look
> reasonable in iso-8859-1.

While this might work around the immediate issue, it is not a long
term solution.  If you forcibly treat the entire byte vector as some
ASCII-compatible eight-bit encoding (e.g. ISO 8859-1), you will
silently(!) get gibberish for

- any non-ASCII character in all other ASCII-compatible codings,
  including UTF-8 and other ISO 8859; and

- *ALL* characters in ASCII-incompatible codings, including the
  popular UTF-16 and JIS.

----------------------------------------------------------------------

As a real-world case study, I compared the Darcs' repo's metadata with
and without invalid UTF-8 characters:

    darcs changes --xml >/tmp/x
    darcs changes --xml | iconv -c -f utf-8 -t utf-8 >/tmp/y
    diff -u /tmp/[xy]

It appears that 'Daniel Bünzli' is using Latin-1 and every other
contributor is using UTF-8, or is using an encoding that happens to
silently convert to gibberish when treated as UTF-8.

If we treat everything as pure ASCII, we can see that there are only
two more cases -- one use of UTF-8 smart quotes, and one UTF-8 ú.

    darcs changes --xml >/tmp/x
    darcs changes --xml | iconv -c -f ascii -t utf-8 >/tmp/y
    diff -u /tmp/[xy]
msg6319 (view) Author: tux_rocker Date: 2008-10-12.11:28:17
It seems that three years ago, David already stated that he supported converting
metadata to UTF-8 before storing them (your solution #2) in issue33.
msg11348 (view) Author: kowey Date: 2010-06-10.09:00:31
Reviving this bug and removing superseders.  

I think I was wrong to merge this into issue33 (which is sort of an
umbrella bug tracking lots of different XML related issues).

Issue64 has now been resolved by Reinier (yay!), so we have Unicode
patch metadata.  In issue33, Trent suggested that raw bytes can be
base64 encoded.

If I understand correctly, the rest of this is just a matter of
implementation?
msg11353 (view) Author: kowey Date: 2010-06-10.09:28:45
Err, do we need a good way to deal with issue1863 in order to be able to
fix this properly?
msg13758 (view) Author: bfr Date: 2011-02-25.11:09:06
My colleague has just been bitten by this one, working on a darcs
plugin to hudson (http://hudson-ci.org/): the Java xml parser is not
able to correctly parse the "darcs changes --xml" output, because darcs
some repositories contain *different* encodings in the metadata
(different patches having been recorded with different locales).

Of the three solutions listed by twb, only (1) and (2) should be
considered ((0) being too unfriendly to users), the decision between
them is purely a matter of implementation (no user-visible difference
AFAICS). Along with a fix to this problem, the --xml output should be
fixed to convert everything to either the user's locale, or else some
standard encoding; in any case the --xml output should be correct xml,
i.e. provide a header specifying the encoding and then make sure the
output is actually valid according to this encoding.

Furthermore, darcs should offer an easy way for users to fix/upgrade
existing repositories with mixed encodings to the new scheme. The
upgrade fix/procedure should let the user declare what the default
encoding for patch metadata in the existing repo is, but should also
prompt her (for another encoding) if conversion is impossible.
History
Date User Action Args
2008-10-11 06:54:07twbcreate
2008-10-11 21:04:13tux_rockersetpriority: bug
nosy: + tux_rocker
status: unread -> unknown
messages: + msg6315
2008-10-12 04:09:31twbsetnosy: darcs-users, kowey, dagit, simon, twb, thorkilnaur, tux_rocker, dmitry.kurochkin
messages: + msg6316
2008-10-12 11:28:18tux_rockersetstatus: unknown -> duplicate
nosy: darcs-users, kowey, dagit, simon, twb, thorkilnaur, tux_rocker, dmitry.kurochkin
superseder: + wish: improve "darcs --xml"
messages: + msg6319
2009-02-06 02:09:35twbsetnosy: darcs-users, kowey, dagit, simon, twb, thorkilnaur, tux_rocker, dmitry.kurochkin
superseder: + Should store patch metadata in utf-8
2009-08-10 23:48:15adminsetnosy: - dagit
2009-08-25 17:31:01adminsetnosy: + darcs-devel, - simon
2009-08-27 14:15:31adminsetnosy: darcs-users, kowey, darcs-devel, twb, thorkilnaur, tux_rocker, dmitry.kurochkin
2010-06-10 09:00:32koweysetstatus: duplicate -> needs-implementation
nosy: - darcs-users, darcs-devel, thorkilnaur
superseder: - wish: improve "darcs --xml", Should store patch metadata in utf-8
messages: + msg11348
2010-06-10 09:07:10koweylinkissue33 superseder
2010-06-10 09:28:46koweysetstatus: needs-implementation -> waiting-for
superseder: + unicode filenames
messages: + msg11353
2011-02-25 11:09:08bfrsetmessages: + msg13758