darcs

Patch 37 Store textual patch metadata encoded in UTF-8

Title Store textual patch metadata encoded in UTF-8
Superseder Nosy List darcs-users, galbolle, kowey, tux_rocker, twb
Related Issues Should store patch metadata in utf-8
View: 64
Status accepted Assigned To galbolle
Milestone

Created on 2009-11-01.17:55:50 by tux_rocker, last changed 2009-12-24.15:49:56 by tux_rocker.

Files
File name Status Uploaded Type Edit Remove
autodetect-utf8.dpatch tux_rocker, 2009-11-15.15:58:47 text/x-darcs-patch
conflict-fix.dpatch galbolle, 2009-12-18.13:55:49 text/x-darcs-patch
utf8-bundle-amended.dpatch tux_rocker, 2009-11-09.22:11:24 text/x-darcs-patch
utf8-metadata-20091122-manypatches.dpatch tux_rocker, 2009-11-22.18:00:50 text/x-darcs-patch
utf8-metadata-20091122-onebigpatch.dpatch tux_rocker, 2009-11-22.18:06:15 text/x-darcs-patch
utf8-metadata.diff tux_rocker, 2009-11-01.17:55:44 text/x-patch
utf8-metadata.dpatch tux_rocker, 2009-11-01.17:55:15 text/x-darcs-patch
utf8-metadata.unified.diff twb, 2009-11-02.04:38:04 text/x-diff
See mailing list archives for discussion on individual patches.
Messages
msg9163 (view) Author: tux_rocker Date: 2009-11-01.17:55:44
Hi all,

Here is the patch that makes darcs store patch name, author and log as UTF-8 encoded strings.

By itself, this has no visible advantage. However, it makes it possible to:

* display non-ASCII patch and author names consistently in different locales
* produce closer-to-valid XML if the --xml-output option is given
* use --match with non-ASCII patch and author names on patches that were
  recorded in another locale

The patches makes darcs differentiate between old-style and UTF-8 patches by
adding an 'Ignore-this: UTF-8' line to the patch log, and looking for that
line when interpreting the patch.

The patches make darcs treat old-style patches as locale-encoded. This may be
controversial. I did this because I suppose that people have already been
using darcs with non-ASCII locales, using its encoding ignorance as a feature.
They may have recorded patches with metadata in UTF-8 or other multibyte
encodings, and it would be sorry for them to start displaying such metadata
interpreted as ASCII or latin-1.

I attached a diff between the starting point and the end result of these
patches, as they contain some hunks that are reverted in later patches.

Regards,
Reinier
Attachments
msg9178 (view) Author: twb Date: 2009-11-02.04:38:04
John Cowan is a Unicode guru, and I asked him to take a quick look at
this patch.  Here are his comments:

twb> Darcs is a VCS implemented in Haskell.  There is a patch to make
twb> it understand (in some places) Unicode, instead of just treating
twb> everything as sequences of bytes.  Can you recommend someone to
twb> review this patch from a "knows about Unicode" perspective?

jcowan> I'll look at it myself.  Though I'm no Haskell maven, so I may
jcowan> have questions.  [...]  I'm looking at the .diff.  I can sort
jcowan> of imagine the original.  Hmm, the XML escaping looks
jcowan> problematic, unless it's something about laziness I don't
jcowan> understand.  You need to convert & to & first before the
jcowan> others are done.  Slavoj Žižek!  Cool! [...]  Except for the
jcowan> XML escaping, all looks fine.

I've also attached a copy of the bundle in unified format, for other
reviewers who aren't familiar with Darcs' internal format (.dpatch)
and don't like context diffs :-)
Attachments
msg9182 (view) Author: tux_rocker Date: 2009-11-02.09:05:56
It looks like John Cowan was reading the '.' operator the wrong way. The replace
of '&' by '&quot' is the last line of the function, but it is the first
operation that takes place.
msg9229 (view) Author: tux_rocker Date: 2009-11-09.22:11:24
Here's the bundle again. This serves two purposes:

(1) Hopefully the web interface will not corrupt the bundle (it contains lots of
funny characters that may get mutilated by eager MUAs and MTAs)
(2) It resolves two conflicts with recent developments
Attachments
msg9307 (view) Author: tux_rocker Date: 2009-11-15.15:57:34
Here is the same patch bundle as utf8-bundle-amended.dpatch, but now with
another patch that adds UTF-8 autodetection.

This can not go in until a new version of haskeline is released, which requires
a new version of utf8-string, that fixes a bug that is triggered by the unit
tests that this bundle does.

I am slightly worried that UTF-8 autodetection *may* give false positives
because we use such short strings to do our autodetection on. OTOH, we assume
that encodings are ASCII-compatible anyway, so sending ASCII through
unpackPSFromUTF8 instead of decodeLocale should do no harm.
msg9308 (view) Author: tux_rocker Date: 2009-11-15.15:58:47
Here is the same patch bundle as utf8-bundle-amended.dpatch, but now with
another patch that adds UTF-8 autodetection.

This can not go in until a new version of haskeline is released, which requires
a new version of utf8-string, that fixes a bug that is triggered by the unit
tests that this bundle does.

I am slightly worried that UTF-8 autodetection *may* give false positives
because we use such short strings to do our autodetection on. OTOH, we assume
that encodings are ASCII-compatible anyway, so sending ASCII through
unpackPSFromUTF8 instead of decodeLocale should do no harm.
Attachments
msg9463 (view) Author: tux_rocker Date: 2009-11-22.18:00:50
Again, a new patch file that can be applied to current head. I'll add another
one that is one big patch that contains the same changes as the 22 patches in
this bundle.

Changes: conflict resolution, remove a compiler warning in Darcs/Utils.hs
Attachments
msg9469 (view) Author: kowey Date: 2009-11-23.15:37:13
This patch was from before darcs-users was automatically made nosy on new bugs,
so people might not have seen Reinier's amended version.

Note also that the new Haskeline has been released (thanks to Eric M and Judah
for their rapid response!) so we can apply this once the reviewer OKs the bundle.

I had a look at the first version of this, so to maximise eyeball application,
I'd like the second reviewer to be somebody else.
msg9550 (view) Author: kowey Date: 2009-12-06.18:04:16
I propose that Florent take on this bundle for review.  Somebody else will have
to push it on his behalf.  I suspect that this is bundle will be enough for us
to get Florent on the Review Team officially.
msg9646 (view) Author: galbolle Date: 2009-12-18.13:42:55
It looks like the decision to go with the intended behaviour of this patch has
already been taken, that is, we should store new meta-data as UTF-8 and
auto-detect encoding of meta-data by looking for incorrect UTF-8 sequences, and
assuming that the encoding is correct if we find none. I cannot comment on these
decisions, and I only reviewed the patch to check that they were implemented as
they should.

I think the code is correct, in the manypatches version. There is a conflict in
unit.hs which can be fixed by the attached patch.
History
Date User Action Args
2009-11-01 17:55:50tux_rockercreate
2009-11-01 19:17:14koweysetissues: + Should store patch metadata in utf-8
2009-11-02 04:38:05twbsetfiles: + utf8-metadata.unified.diff
nosy: + twb
messages: + msg9178
2009-11-02 09:05:57tux_rockersetmessages: + msg9182
2009-11-09 22:11:25tux_rockersetfiles: + utf8-bundle-amended.dpatch
messages: + msg9229
2009-11-15 15:57:36tux_rockersetmessages: + msg9307
2009-11-15 15:58:49tux_rockersetfiles: + autodetect-utf8.dpatch
messages: + msg9308
2009-11-22 18:00:51tux_rockersetfiles: + utf8-metadata-20091122-manypatches.dpatch
messages: + msg9463
2009-11-22 18:06:15tux_rockersetfiles: + utf8-metadata-20091122-onebigpatch.dpatch
2009-11-23 15:37:14koweysetnosy: + kowey, darcs-users
messages: + msg9469
2009-12-06 18:04:19koweysetnosy: + galbolle
messages: + msg9550
assignedto: galbolle
2009-12-11 12:40:35galbollesetstatus: needs-review -> review-in-progress
2009-12-18 13:42:57galbollesetstatus: review-in-progress -> accepted-pending-tests
messages: + msg9646
2009-12-18 13:55:49galbollesetfiles: + conflict-fix.dpatch
2009-12-24 15:49:56tux_rockersetstatus: accepted-pending-tests -> accepted