darcs

Issue 64 Should store patch metadata in utf-8

Title Should store patch metadata in utf-8
Priority feature Status resolved
Milestone 2.5.0 Resolved in 2.5.0
Superseder Nosy List darcs-devel, dmitry.kurochkin, jch, kowey, mornfall, thorkilnaur, tommy, tuomov, tux_rocker
Assigned To tux_rocker
Topics

Created on 2005-12-17.12:06:21 by tuomov1, last changed 2010-06-15.22:32:20 by tux_rocker.

Messages
msg237 (view) Author: tuomov Date: 2005-12-17.12:06:21
(This bug doesn't seem to be in the new bug tracker yet, although it
should be in the old one.)

Darcs should store patch metadata (author name/email, patch name and
message) in UTF-8, converting to/from locale encoding for display and
input. There should perhaps also be --utf8 to be used with --pipe so
that ssh pushes and so on work correctly without losing information.
msg546 (view) Author: jch Date: 2006-03-03.16:46:01
This was discussed in detail in the thread starting at

  http://www.darcs.net/pipermail/darcs-users/2005-October/thread.html#8551

and the consensus is that there is no good solution, so the status quo will
remain.  Sorry for that.

As this is a bug of the universe, I'm half tempted to mark it not-our-bug ;-)
msg547 (view) Author: tuomov Date: 2006-03-03.16:55:10
On 2006-03-03 16:46 +0000, Juliusz Chroboczek wrote:
> As this is a bug of the universe, I'm half tempted to mark it not-our-bug ;-)

Ok, seems like time for a fork.. a fork that fixed all the americanisms,
including the time format. If I just had time to write the code...
msg549 (view) Author: igloo Date: 2006-03-03.17:27:14
I think there is some confusion here.

For filenames the problem is indeed tricky, and sticking with 8bit chars seems
reasonable.

However, I can see no reason not to make the author, patch description etc UTF-8
(and I think it should be done).
msg550 (view) Author: jch Date: 2006-03-03.17:29:46
> I think there is some confusion here.

> For filenames the problem is indeed tricky, and sticking with 8bit
> chars seems reasonable.

> However, I can see no reason not to make the author, patch
> description etc UTF-8 (and I think it should be done).

Hmm, you're right.  How do we manage the transition?

                                        Juliusz
msg551 (view) Author: igloo Date: 2006-03-03.17:46:57
[I've set this bug back to chatting as you agree]

The best transition plan OTTOMH is have a suitable update command do:
if (data is utf8) return data else return (<convert latin1 to utf8> data)
and then add something to the repo format description. I think any other scheme
will either be wrong more of the time or will involve user interaction (which is
especially bad if people don't choose the same options for patches in two repos).

While we could have new darcs able to send patches between converted and
non-converted repos, I think this isn't worthwhile. Just give an error telling
the user the conversion command needs to be run and which end to run it.
msg555 (view) Author: droundy Date: 2006-03-06.13:42:35
On Fri, Mar 03, 2006 at 06:29:35PM +0100, Juliusz Chroboczek wrote:
> > I think there is some confusion here.
> 
> > For filenames the problem is indeed tricky, and sticking with 8bit
> > chars seems reasonable.
> 
> > However, I can see no reason not to make the author, patch
> > description etc UTF-8 (and I think it should be done).
> 
> Hmm, you're right.  How do we manage the transition?

If we don't mind leaving old metadata in whatever format it's already in,
we don't need a transition plan, do we?
-- 
David Roundy
http://www.darcs.net
msg556 (view) Author: igloo Date: 2006-03-06.17:12:08
On Mon, Mar 06, 2006 at 01:42:38PM +0000, David Roundy wrote:
> 
> David Roundy <droundy@darcs.net> added the comment:
> 
> On Fri, Mar 03, 2006 at 06:29:35PM +0100, Juliusz Chroboczek wrote:
> > > I think there is some confusion here.
> > 
> > > For filenames the problem is indeed tricky, and sticking with 8bit
> > > chars seems reasonable.
> > 
> > > However, I can see no reason not to make the author, patch
> > > description etc UTF-8 (and I think it should be done).
> > 
> > Hmm, you're right.  How do we manage the transition?
> 
> If we don't mind leaving old metadata in whatever format it's already in,
> we don't need a transition plan, do we?

That would mean code can never assume that metadata is valid UTF8, which
I'm sure will bite us sooner or later.

Thanks
Ian
msg626 (view) Author: jch Date: 2006-05-04.17:35:07
I'm marking this as wontfix -- I don't believe we'll fix that.  Please reopen
if you disagree.
msg627 (view) Author: igloo Date: 2006-05-04.19:57:07
On Thu, May 04, 2006 at 05:35:09PM +0000, Juliusz Chroboczek wrote:
> 
> I'm marking this as wontfix -- I don't believe we'll fix that.  Please reopen
> if you disagree.

I thought we agreed this should be fixed, even if we hadn't agreed on
the details of how? I certainly still think it should be fixed.

Thanks
Ian
msg628 (view) Author: droundy Date: 2006-05-05.10:48:40
On Thu, May 04, 2006 at 07:57:08PM +0000, Ian Lynagh wrote:
> I thought we agreed this should be fixed, even if we hadn't agreed on
> the details of how? I certainly still think it should be fixed.

Just to be clear (although I think it's obvious):  I'm ambivalent on the
issue.  I try to leave decisions (and coding) involving non-ASCII character
sets to people who use them.
-- 
David Roundy
http://www.darcs.net
msg629 (view) Author: tuomov Date: 2006-05-05.10:53:25
On 2006-05-05 10:48 +0000, David Roundy wrote:
> Just to be clear (although I think it's obvious):  I'm ambivalent on the
> issue.  I try to leave decisions (and coding) involving non-ASCII character
> sets to people who use them.

The problem is that without any locale support, two people who use 
different locales can't cooperate. What looks correct to one, looks
wrong to another. So, darcs should store metadata in such a manner
that the conversion to the user's locale can be done. The obvious
and simple way is to store it in utf-8. (An alternative would be
to indicate the encoding used.)
msg631 (view) Author: jch Date: 2006-05-09.15:33:57
> I try to leave decisions (and coding) involving non-ASCII character
> sets to people who use them.

The following is the point of view of someone using multiple European
languages daily (some encoded in Latin-1/9, some Latin-2).  The point
of view of someone using, say, Japanese would likely be different.

> The problem is that without any locale support, two people who use 
> different locales can't cooperate. What looks correct to one, looks
> wrong to another. So, darcs should store metadata in such a manner
> that the conversion to the user's locale can be done. The obvious
> and simple way is to store it in utf-8. (An alternative would be
> to indicate the encoding used.)

There are three possibilities here:

 (i) consider patch metadata as opaque strings, and leave conversion
     issues to the user; encourage users to standardise on UTF-8.
 (ii) convert everything to UTF-8 on the fly.
 (iii) allow random encodings, tag everything.

The accurate technical term for (iii) is ``ISO 2022 brain-damage''.  I
won't consider it any further.

Note that (i) and (ii) are indistinguishable in a UTF-8 locale, which
is what we've been encouraging users to use.  The main advantage of
(i) is that it's a much, much simpler model and is consistent with the
way Unix applications tend to work, and then when something goes wrong
it's usually easy to fix.  The advantage of (ii) is that it allows
interoperability even when some people use legacy locales and others
don't, which is undesirable; its disadvantage is that it's difficult
to make reliable -- you end up with things like double Latin 1 -> UTF-8
conversions, which tend to be a pain to fix.

I certainly favour staying with (i), but that's because I'm a lazy
bum.  If we're to switch to (ii), we're need a solution that deals
gracefully with wrongly encoded data (non-UTF-8 data can be detected
fairly reliably).

                                        Juliusz
msg632 (view) Author: tuomov Date: 2006-05-09.17:14:08
On 2006-05-09 15:33 +0000, Juliusz Chroboczek wrote:
> Note that (i) and (ii) are indistinguishable in a UTF-8 locale, which
> is what we've been encouraging users to use.  The main advantage of
> (i) is that it's a much, much simpler model and is consistent with the
> way Unix applications tend to work, and then when something goes wrong
> it's usually easy to fix.  

One should _never_ expect a particular encoding. Expecting UTF-8 is bound
to lead to similar problems that we're _still_ facing thanks to programs
having used to (and still doing so) expect (plain 7-bit) ASCII. All
programs should use locales to be agnostic to the encoding used in the
operating environment. As for storing data in the program's own files,
that have no need to be in locale encoding, and where it might no even
be sufficient, there UTF-8 is the best choice at the moment, if one does
not want to use 32-bit unicode.

> If we're to switch to (ii), we're need a solution that deals
> gracefully with wrongly encoded data (non-UTF-8 data can be detected
> fairly reliably).

Just add a marker to new patches that indicates that the metadata is
in UTF-8, and map non-ASCII in old patches to some suitable (free) range
in Unicode. A tool should also be provided to convert old patches to new 
format, asking the user to indicate the encoding.

The non-solution (i) is problematic, because at least in my case, the
incorrect encoding is only discovered after I've pushed the patch to
the public repository, and my changelog-generation tools crash on
incorrect encoding.
msg633 (view) Author: jch Date: 2006-05-09.17:46:25
> One should _never_ expect a particular encoding. Expecting UTF-8 is bound
> to lead to similar problems that we're _still_ facing thanks to programs
> having used to (and still doing so) expect (plain 7-bit) ASCII.

I'm a little reluctant to discuss that, as I've already spent a few
hundred message worth of flaming on that subject.  Let me just mention
that unlike ASCII, Unicode is an encoding designed to be universal and
that contains a huge Private Use Area.

(Chroboczek's third law: any thread that contains a mention of ISO 2022
in the context of Unicode degenerates into a flame war.)

> Just add a marker to new patches that indicates that the metadata is
> in UTF-8,

Hmm...  That is not absolutely necessary.  UTF-8 is sufficiently
stylised to be recognisable automatically.

> and map non-ASCII in old patches to some suitable (free) range in
> Unicode.

This is unfortunately not possible.  You cannot automatically
recognise ASCII in an unknown encoding due to the popularity of
seriously brain-damaged encodings (e.g ISO 2022-JP, Shift-JIS or Big5).

                                        Juliusz
msg634 (view) Author: tuomov Date: 2006-05-09.17:53:22
On 2006-05-09 17:46 +0000, Juliusz Chroboczek wrote:
> Unicode is an encoding designed to be universal and
> that contains a huge Private Use Area.

"640 kilobytes is enough for everyone."

> This is unfortunately not possible.  You cannot automatically
> recognise ASCII in an unknown encoding due to the popularity of
> seriously brain-damaged encodings (e.g ISO 2022-JP, Shift-JIS or Big5).

Eh? Who said anything about conversion? Just map the _bytes numbered_ from
128 and 256 that we don't know how to interpret, to some range of Unicode 
that doesn't have any useful meaning to put them out of the way, but 
without breaking the encoding.
msg993 (view) Author: droundy Date: 2006-09-18.17:36:24
I just took another look at this, and my leaning would be to add an optional
conversion process for metadata and/or file names.  Just talked with Eric about
this yesterday...

I still don't like the idea of doing automatic locale-determined conversion,
just because it can fail, and I don't like the idea of darcs failing.  I suppose
locale-based conversion for metadata could be the default (definitely not for
filenames, as that could lead to repo corruption and extra excitement), with an
option to disable it.

David
msg994 (view) Author: kowey Date: 2006-09-18.17:56:10
On Mon, Sep 18, 2006 at 17:36:31 +0000, David Roundy wrote:
> I still don't like the idea of doing automatic locale-determined
> conversion, just because it can fail, and I don't like the idea of
> darcs failing.  I suppose locale-based conversion for metadata could
> be the default (definitely not for filenames, as that could lead to
> repo corruption and extra excitement), with an option to disable it.

Some more questions about this.  Just re: metadata:

1) Do we use Haskell putChar/getChar-derived functions to write/read
   the metadata?

2) Is metadata is stored using Haskell Char?

3) Does #1 and #2 mean that we know for a fact, having used ghc
   circa 2006-09 and prior, that all "old" patches are encoded
   using latin-1 (*)?  

4) Does latin-1 cover the whole 8 bits?  Does everything in those
   8 bits correspond to some Unicode character?

5) Would #3 and #4 not considerably simplify the question of dealing
   with old patch metadata?  No need to worry about locales because
   there is exactly one way to interpret old patches?

Or am I missing something?

(*) Does the latin-1 encoding exactly correspond to the first 256
    Unicode code points?
msg995 (view) Author: droundy Date: 2006-09-18.18:11:35
On Mon, Sep 18, 2006 at 05:56:15PM +0000, Eric Kow wrote:
> Some more questions about this.  Just re: metadata:
> 
> 1) Do we use Haskell putChar/getChar-derived functions to write/read
>    the metadata?

No, I think we use a variety of functions for dealing with metadata.

> 2) Is metadata is stored using Haskell Char?

No, sometimes I think it's in FastPackedString.

> 3) Does #1 and #2 mean that we know for a fact, having used ghc
>    circa 2006-09 and prior, that all "old" patches are encoded
>    using latin-1 (*)?  

No, not really, it really means that old patches are stored as raw
bytes, in whatever encoding the user provided them.

> 4) Does latin-1 cover the whole 8 bits?  Does everything in those
>    8 bits correspond to some Unicode character?

Yeah, latin-1 is an 8-bit encoding, and all 256 possibilities (except
maybe NULL? unless NULL is a unicode character?) correspond to unicode
characters.  But we can't trust that the bytes we were given were
actually latin-1.

> 5) Would #3 and #4 not considerably simplify the question of dealing
>    with old patch metadata?  No need to worry about locales because
>    there is exactly one way to interpret old patches?

No, but what I'd say is that we can just display as raw bytes any old
patches that aren't in utf8, and in that case we'll be doing no worse
than current darcs does.

My current leaning (which has changed in the last half-hour... and as
always could be swayed by someone who has actual data that would be
affected by this) is to try to display metadata as if it were in utf8,
but if the utf8 fails to decode (as will often be the case for old
patches), then just display the raw bytes, and hope they're in the
right encoding.  When encoding metadata, we'd convert into utf8 based
on the current locale, unless the metadata can't be decoded from the
current locale, in which case I presume we'd treat it as raw bytes.
But all this behavior ought to be determined by a switch of some sort,
so that if a project has already standardized on latin-1 (or 2-9) or
something, their project needn't get messed up.  On the other hand, if
I could be convinced that under any plausible usage scenario the
encoding/decoding.

I'm thinking we could have some sort of a conversion routines like:

convertFromLocale :: FastPackedString -> FastPackedString
convertToLocale :: FastPackedString -> FastPackedString

with the above functions falling back to the identity if the
conversion fails.  Then we could perhaps write helper functions to use
the above when displaying PatchInfos (but not when writing them to
files).  I think there's a function like "friendly_somethingorother"
which is used to print patch names, and that's perhaps the only place
a change need be made.

But it'll take someone who wants this feature to code it up.  Tuomo?

> Or am I missing something?
> 
> (*) Does the latin-1 encoding exactly correspond to the first 256
>     Unicode code points?

Yeah.
-- 
David Roundy
msg996 (view) Author: tuomov Date: 2006-09-18.18:34:43
On 2006-09-18 17:36 +0000, David Roundy wrote:
> I still don't like the idea of doing automatic locale-determined
> conversion, just because it can fail, and I don't like the idea of darcs
> failing.  

I don't think there's anything that could "fail" in metadata conversions.
Sure, you may not be able to convert everything into the current locale
encoding, thus having to do subtitutions, but that's not darcs' problem, 
and infact it already by default only display ASCII range as-is. And a 
locale not being able to represent all characters is less of a problem, 
than a hodgepodge of different encodings, for different patches. 

> with an option to disable it.

Hmm.. that's not good, if the patches are specified to contain proper
utf-8. Perhaps there could be an --utf8 to make darcs expect utf8, and
check for correctness, but LC_CTYPE=whatever.UTF-8 does the same thing.

As for old patches, perhaps the best would be to have a repository-specific
setting that indicates the encoding used in old-format patches, with a new
slightly modified patch format specified to use a particular encoding
(utf-8), or specifying the encoding itself, which is likely more cumbersome.
(One could also simply use that repository-specific setting for new patches
as well, but that doesn't stop someone from messing with the setting in
their copy of the repository, and thus sending me patches with corrupt
encoding, that I'll only detect when my xsltproc fails in invalid UTF-8, and
the patch is already in the public repository. Thus the information should
be in the patch itself, I think. Using the repository setting for new
patches doesn't also help with upgrading the encoding used.)

Some more clarifications:

http://www.abridgegame.org/pipermail/darcs-devel/2006-September/004777.html
msg997 (view) Author: tuomov Date: 2006-09-18.18:45:57
On 2006-09-18 18:11 +0000, David Roundy wrote:
> I'm thinking we could have some sort of a conversion routines like:
> 
> convertFromLocale :: FastPackedString -> FastPackedString
> convertToLocale :: FastPackedString -> FastPackedString
> 
> with the above functions falling back to the identity if the
> conversion fails.  

If we allow the conversion to fail, I think we should then indicate
in the string whether it is utf-8 (or whatever), or raw bytes.
Also, perhaps the same setting should apply to both the author name
and patch description, so that if either fails, both are considered
to be raw bytes.

> But it'll take someone who wants this feature to code it up.  Tuomo?

If I was that familiar with the darcs codebase, I'd have coded it already,
even for just my own use. I guess I could try to find the time to write it
(whatever little time I find when I feel like coding, I'd rather spend it on
finishing Ion3) if someone could point me to the right places in the code,
and it isn't awful lot of work... but then again, someone more familiar with
the code has already done half of the job by explaining it to me, as it
really shouldn't be that much work. (I once, almost a few years ago, IIRC,
looked into locale support or option for ISO-8601 for time output, and that
was going to be a lot of work.)
msg1092 (view) Author: tuomov Date: 2006-10-15.10:20:22
On 2006-09-18 18:46 +0000, Tuomo Valkonen wrote:
> > But it'll take someone who wants this feature to code it up.  Tuomo?
> 
> if someone could point me to the right places in the code,
> and it isn't awful lot of work... but then again, someone more familiar with
> the code has already done half of the job by explaining it to me, as it
> really shouldn't be that much work. 

So?
msg1116 (view) Author: tommy Date: 2006-10-17.15:56:47
On Sun, Oct 15, 2006 at 10:20:29AM +0000, Tuomo Valkonen wrote:
> > if someone could point me to the right places in the code,
> > and it isn't awful lot of work...

Unfortunately it looks like no one can point you to the right
place. :-(

I have only worked on some of the printing code, but here is an
at least somewhat informed guess. The change that David proposes
would require adding a new option in DarcsFlags.lhs /
DarcsArguments.lhs, writing the converting functions (maybe in
DarcsUtils.lhs), inserting calls to these conversion functions
at after reading and before writing patchinfos (in
PatchInfo.lhs, I think). If the comand line options are not
passed down all the way to the patch info writing and reading
functions, these functions must be extended with an extra
argument for either to govern just the conversion, or the entire
option data structure itself so it can be examined. If "after
reading" and "before writing" of meta data is not confined to
just two functions, things become more complicated.

Note that this simple solution does not store any information in
the meta data about how it is encoded. It would work as: any
reopo stored meta data that can successfully be converted from
UTF-8 to the current locale _is_ in UTF-8 (which is why the
override option is needed), anything else is raw data possibly
in the current locale (the current behavior of darcs). Any user
supplied meta data is converted (if necessary and not inhibited
by use of option) to UTF-8.

I suspect that extending the meta data format with extra
information requires use of the new repo format thing (that is
not used for anything yet, I think) so older darcs can tell they
doesn't support a repo with encoding flagged meta data. That
would be substantially more work, although not very difficult, I
guess.
msg1121 (view) Author: tuomov Date: 2006-10-17.18:12:22
On 2006-10-17 15:57 +0000, Tommy Pettersson wrote:
> 
> Tommy Pettersson <ptp@lysator.liu.se> added the comment:
> 
> On Sun, Oct 15, 2006 at 10:20:29AM +0000, Tuomo Valkonen wrote:
> > > if someone could point me to the right places in the code,
> > > and it isn't awful lot of work...
> 
> Unfortunately it looks like no one can point you to the right
> place. :-(

Bah. The only "simple" places for conversions seem to be
askUser, and the flags, which should amount of the input. 
'Doc', or extension of Printable, seems another quite appropriate
place, but these conversions depend on the IO monad or some other 
parametrisation! But the rendering isn't currently done there. So 
it's once again one of Haskell's biggest problems that make things
laboursome: parametrisation, even at execution time. Of course, one
could do unsafeIO, but that's so utterly and totally ugly. One might
also be able to do the conversions at another point, but finding
if this is possible, may need extensive archeological work
on the mess known as darcs source code, and certainly is a lot
of work.

I haven't bothered looking into repo formats, and probably
won't, if wholesale conversion of Doc is what is needed...
And even there I haven't looked into how much could would
need to be converted to use alternative composition routines..
msg1176 (view) Author: jch Date: 2006-11-05.22:21:01
> I just took another look at this, and my leaning would be to add an optional
> conversion process for metadata and/or file names.

[...]

> I still don't like the idea of doing automatic locale-determined
> conversion, just because it can fail, and I don't like the idea of
> darcs failing.

David, Eric, Tommy,

Locali[sz]ation is something people feel very strongly about, and if
we implement any form of localisation in Darcs, we'll end up at the
wrong end of a series of flamewars.

I strongly recommend that we avoid introducing any form of
locale-sensitivity in Darcs.

The clean solution would be to ask users to use UTF-8 throughout the
system.  If that's not reasonable, it should be enough to allow the
user to specify conversion filters to be used when reading and writing
files:

  darcs record --input-filter='iconv -f latin-1 -t utf-8'
               --output-filter='iconv -f utf-8 -t latin-1'

Of course, if the user specifies filters that are not inverses, he'll
get grief, which is what he deserves for not using UTF-8 in the first
place.

                                        Juliusz
msg1177 (view) Author: tuomov Date: 2006-11-05.23:07:53
On 2006-11-05 23:20 +0100, Juliusz Chroboczek wrote:
> The clean solution would be to ask users to use UTF-8 throughout the
> system.  If that's not reasonable, it should be enough to allow the
> user to specify conversion filters to be used when reading and writing
> files:
> 
>   darcs record --input-filter='iconv -f latin-1 -t utf-8'
>                --output-filter='iconv -f utf-8 -t latin-1'

And how's that supposed to help with the users (contributors) sending
stuff in the wrong encoding? 

> Of course, if the user specifies filters that are not inverses, he'll
> get grief, which is what he deserves for not using UTF-8 in the first
> place.

Fucking monoculturist. Locales, i.e. abstraction, is the right way
to go about encodings, not standardising on a single one, thus
creating a monoculture, an evolutionary dead-end, that will not
allow easily replacing the encoding with something better. UTF-8
(and Unicode in particular) is not perfect, and leaves a lot of
room for improvement. Unfortunately, monoculturists like you
seem to think this is so, that there will be nothing after UTF-8.
(Or else, they want to make shitty software to have jobs once
the next encoding comes along.) 

I think I'll remove the <meta encoding> tag from my web pages,
and the Content-Encoding field from my emails one of these days,
and use a custom encoding modelled after tex \charactername escapes
– and use a lot of special characters.
msg6870 (view) Author: mornfall Date: 2008-12-23.09:08:58
I unilaterally reopen this issue and expect to come up with patches that will 
make darcs:

- convert input strings (patch name, author name) from locale encoding to utf8
- convert output strings from utf8 to locale encoding if possible (assume
  raw otherwise)
- convert non-utf8 8bit chars into a free range of codepoints in unicode
  for --xml, so we finally get utf8-clean --xml output

I won't come up with a conversion tool for now, since I believe the above fixes 
most of our worst sores, namely the inability to produce valid XML and our 
clunky non-encoding of (new) metadata (which makes it unreasonably hard to 
recover the actual metadata from a repository). Moreover, there seems to be 
sort of consensus in the discussion that this is about the right approach.

Later on, an interactive conversion command might be devised. It might be 
possible to create a dump from that, which'll make conversion of related repos 
easier and safer. This conversion part is however likely to be quite laborious 
and I don't expect to need any of it, so in case the first part gets through, 
I'm likely to file a new wish and mark it as needs-volunteer.
msg6874 (view) Author: kowey Date: 2008-12-23.21:56:50
On Tue, Dec 23, 2008 at 09:09:03 -0000, Petr Ročkai wrote:
> I unilaterally reopen this issue and expect to come up with patches that will 
> make darcs:

So I've very pleased to see this issue re-opened, and I agree that we
should do something to sort this out.

Yes, I want to see some sort of solution that lets us (for example)
guarantee that we've got actual Unicode metadata (while dealing
gracefully with things we can't understand)...

> - convert input strings (patch name, author name) from locale encoding to utf8
> - convert output strings from utf8 to locale encoding if possible (assume
>   raw otherwise)
> - convert non-utf8 8bit chars into a free range of codepoints in unicode
>   for --xml, so we finally get utf8-clean --xml output

But what do you make of Juliusz's cautionary note about locale
sensitivity?

Juliusz, if I understand correctly, has considerable experience in these
matters and I suspect that it may be wise to heed his advice (or find
somebody equally experienced to tell us why this is OK)

Thanks!
msg8012 (view) Author: kowey Date: 2009-08-05.12:55:39
If I understand correctly, GHC 6.12's IO will use the current locale by default.  

http://hackage.haskell.org/trac/ghc/ticket/2811

I think this makes it more urgent that we review the state of Darcs IO and
figure out what is the right way to respond to this (hopefully closing this
ticket along the way).

One option is to resist this change so that Darcs continues to behave in the
same way (in other words, using special code if compiling with GHC 6.12).

Probably a better option is to bend with the wind and have Darcs act in  a
similar fashion.  But are there places where we aren't using System.IO to read
stuff that will need to be updated accordingly?  How are we going to deal with
older GHC?
msg8049 (view) Author: tux_rocker Date: 2009-08-09.11:24:34
I'm assigning to myself because this is currently not high on mornfall's
priority list.

What shall we do regarding older metadata?

I think I'll make a flag in the repo format saying that the metadata in UTF-8.
If that flag is present, darcs knows that patch metadata in that repo are UTF-8.

If such a flag is not present in the repo, try to decode the patch metadata with
the current locale. If that fails, it should try latin-1. AFAIK, decoding with
latin1 cannot fail.

I can see two problems now with this approach:
 (1) patch metadata in repositories that are not yet UTF-8 encoded may look
different in different environments
 (2) patches in non-UTF-8 converted environments may get converted to UTF-8
differently, depending on what environment they are converted in

Point (1) is already present in the current darcs. Point (2) may become a bit
hairy when a non-UTF-8 patch is converted to an UTF-8 patch in two different
environments, one latin-1 and another latin-2 for example. Then if a fourth
machine pulls both from the latin-1 and the latin-2 environment, he will see two
different UTF-8-encoded patches which are really the same non-encoded patch.

The locale-aware GHC 6.12 will not be such a great problem. You can still get
byte I/O when you open a file as binary, and there are special libraries for
encoded and unencoded IO (text and bytestring come to my mind).
msg8051 (view) Author: kowey Date: 2009-08-09.13:46:52
OK, I get confused easily, so watch out in case I say something
stupid.

On Sun, Aug 09, 2009 at 11:24:38 +0000, Reinier Lamers wrote:
> What shall we do regarding older metadata?
> 
> I think I'll make a flag in the repo format saying that the metadata in UTF-8.
> If that flag is present, darcs knows that patch metadata in that repo are UTF-8.

If I understand correctly, this means that older darcs cannot push
patches to utf-8 repos (which is deliberate to prevent them from
introducing patches that aren't to be treated as being in UTF-8).

I suppose we'd could provide some sort of upgrade mechanism, eg. for
example, darcs convert --from-encoding=latin-2 maybe even a really fancy
one that somehow lets you specify different encodings for different
older patches.  Maybe I'm getting way ahead of myself.

Perhaps another option is not to do this at the repo level, but at the
patch level (see issue1906).  In this particular context, we could
maybe re-use the Ignore-this mechanism.  For example, we optimistically
parse the metadata using the UTF-8 encoding, and if either there is a
UTF-8 error or we fail to find 'Ignore-this: UTF-8', we would back off
to the current behaviour of treating it as Latin-1.

Note that this suggestion is orthogonal to the question of locale
support in user IO as the patch log is part and parcel of the patch
identifier and is meant only to indicate how Darcs stores the patch
internally.

> If such a flag is not present in the repo, try to decode the patch metadata with
> the current locale. If that fails, it should try latin-1. AFAIK, decoding with
> latin1 cannot fail.

I think that sounds right.

> I can see two problems now with this approach:
>  (1) patch metadata in repositories that are not yet UTF-8 encoded may look
> different in different environments
>  (2) patches in non-UTF-8 converted environments may get converted to UTF-8
> differently, depending on what environment they are converted in
> 
> Point (1) is already present in the current darcs. Point (2) may become a bit
> hairy when a non-UTF-8 patch is converted to an UTF-8 patch in two different
> environments, one latin-1 and another latin-2 for example. Then if a fourth
> machine pulls both from the latin-1 and the latin-2 environment, he will see two

If we go the flag route, repos that lack the flag could also take the
option of leaving the status quo
msg8055 (view) Author: tux_rocker Date: 2009-08-09.17:51:07
On Sunday 09 August 2009 15:46:54 Eric Kow wrote:
> Eric Kow <kowey@darcs.net> added the comment:
>
> On Sun, Aug 09, 2009 at 11:24:38 +0000, Reinier Lamers wrote:
> > What shall we do regarding older metadata?
> >
> > I think I'll make a flag in the repo format saying that the metadata in
> > UTF-8. If that flag is present, darcs knows that patch metadata in that
> > repo are UTF-8.
>
> Perhaps another option is not to do this at the repo level, but at the
> patch level (see issue1906).  In this particular context, we could
> maybe re-use the Ignore-this mechanism.  For example, we optimistically
> parse the metadata using the UTF-8 encoding, and if either there is a
> UTF-8 error or we fail to find 'Ignore-this: UTF-8', we would back off
> to the current behaviour of treating it as Latin-1.

That sounds good because it avoids the incompatibility problems you get when 
you convert patches.

> Note that this suggestion is orthogonal to the question of locale
> support in user IO as the patch log is part and parcel of the patch
> identifier and is meant only to indicate how Darcs stores the patch
> internally.

But when we want to support locale in user IO, we do have to know what to 
convert the patch name from when outputting it to the user. And right now we 
don't.

Reinier
msg8056 (view) Author: kowey Date: 2009-08-09.19:17:45
On Sun, Aug 09, 2009 at 17:51:10 +0000, Reinier Lamers wrote:
> > Note that this suggestion is orthogonal to the question of locale
> > support in user IO as the patch log is part and parcel of the patch
> > identifier and is meant only to indicate how Darcs stores the patch
> > internally.
> 
> But when we want to support locale in user IO, we do have to know what to 
> convert the patch name from when outputting it to the user. And right now we 
> don't.

Right, so I *think* this means we're on the same page.  I thought it was
completely orthogonal but you've corrected me by pointing out it's not
quite because if we're dealing with known-utf-8 patches we have convert
from that, otherwise we have to convert from ??? (which could be try the
current locale or fall back to latin-1 as you said)

And I think that our same-pageness extends to the idea that we can
hopefully solve the patch internal representation stuff first and
then tackle the user IO stuff next...
msg9338 (view) Author: tux_rocker Date: 2009-11-15.17:53:15
The fix for /storing/ the metadata goes in in 2.4. Actually being sane according
to Unicode when matching or displaying is going to be fixed whenever someone
(possibly me) likes to do it. They were not given a high priority at the
discussion at the Vienna sprint.
msg10918 (view) Author: tux_rocker Date: 2010-05-04.06:50:25
The following patch updated the status of issue64 to be resolved:

* resolve issue64: store metadata as UTF-8, autodetect UTF-8, and don't normalize to NFC 
Ignore-this: ae22511c7679f078412698f866d69255
msg11442 (view) Author: tux_rocker Date: 2010-06-15.22:32:19
The following patch updated issue issue64 with status=resolved;resolvedin=2.5.0 (current)

* resolve issue64: store metadata as UTF-8, autodetect UTF-8, and don't normalize to NFC 
Ignore-this: ae22511c7679f078412698f866d69255
History
Date User Action Args
2005-12-17 12:06:21tuomov1create
2005-12-17 21:15:47jchsetnosy: + jch
2006-03-03 16:46:03jchsetstatus: unread -> wont-fix
nosy: droundy, jch, tommy, tuomov1
messages: + msg546
2006-03-03 16:55:15tuomov1setnosy: droundy, jch, tommy, tuomov1
messages: + msg547
2006-03-03 17:27:15igloosetnosy: + igloo
messages: + msg549
2006-03-03 17:29:47jchsetnosy: droundy, jch, tommy, tuomov1, igloo
messages: + msg550
2006-03-03 17:46:58igloosetstatus: wont-fix -> unknown
nosy: droundy, jch, tommy, tuomov1, igloo
messages: + msg551
2006-03-06 13:42:38droundysetnosy: droundy, jch, tommy, tuomov1, igloo
messages: + msg555
2006-03-06 17:12:10igloosetnosy: droundy, jch, tommy, tuomov1, igloo
messages: + msg556
2006-05-04 17:35:09jchsetstatus: unknown -> wont-fix
nosy: droundy, jch, tommy, tuomov1, igloo
messages: + msg626
2006-05-04 19:57:08igloosetnosy: droundy, jch, tommy, tuomov1, igloo
messages: + msg627
2006-05-05 10:48:42droundysetnosy: droundy, jch, tommy, tuomov1, igloo
messages: + msg628
2006-05-05 10:53:29tuomov1setnosy: droundy, jch, tommy, tuomov1, igloo
messages: + msg629
2006-05-09 15:33:59jchsetnosy: droundy, jch, tommy, tuomov1, igloo
messages: + msg631
2006-05-09 17:14:10tuomov1setnosy: droundy, jch, tommy, tuomov1, igloo
messages: + msg632
2006-05-09 17:46:27jchsetnosy: droundy, jch, tommy, tuomov1, igloo
messages: + msg633
2006-05-09 17:53:24tuomov1setnosy: droundy, jch, tommy, tuomov1, igloo
messages: + msg634
2006-09-18 17:36:31droundysetnosy: + kowey
messages: + msg993
2006-09-18 17:56:15koweysetnosy: droundy, jch, tommy, kowey, tuomov1, igloo
messages: + msg994
2006-09-18 18:11:42droundysetnosy: droundy, jch, tommy, kowey, tuomov1, igloo
messages: + msg995
2006-09-18 18:34:49tuomov1setnosy: droundy, jch, tommy, kowey, tuomov1, igloo
messages: + msg996
2006-09-18 18:46:03tuomov1setnosy: droundy, jch, tommy, kowey, tuomov1, igloo
messages: + msg997
2006-10-15 10:20:29tuomov1setnosy: droundy, jch, tommy, kowey, tuomov1, igloo
messages: + msg1092
2006-10-17 15:57:10tommysetnosy: droundy, jch, tommy, kowey, tuomov1, igloo
messages: + msg1116
2006-10-17 18:12:33tuomov1setnosy: droundy, jch, tommy, kowey, tuomov1, igloo
messages: + msg1121
2006-11-05 22:21:06jchsetnosy: droundy, jch, tommy, kowey, tuomov1, igloo
messages: + msg1176
2006-11-05 23:08:01tuomov1setnosy: droundy, jch, tommy, kowey, tuomov1, igloo
messages: + msg1177
2008-12-23 09:09:03mornfallsetstatus: wont-fix -> unknown
nosy: + dmitry.kurochkin, simon, mornfall, thorkilnaur
messages: + msg6870
assignedto: mornfall
2008-12-23 18:03:46droundysetnosy: - droundy
2008-12-23 21:56:54koweysetnosy: + droundy, tuomov12345
messages: + msg6874
2009-02-06 02:09:34twblinkissue1143 superseder
2009-08-05 12:55:41koweysetpriority: wishlist -> feature
nosy: droundy, jch, tommy, kowey, tuomov1, igloo, tuomov12345, simon, thorkilnaur, dmitry.kurochkin, mornfall
messages: + msg8012
2009-08-05 12:56:47koweysetnosy: - droundy
2009-08-09 11:24:38tux_rockersetnosy: + tux_rocker
messages: + msg8049
assignedto: mornfall -> tux_rocker
2009-08-09 13:46:54koweysetnosy: jch, tommy, kowey, tuomov1, igloo, tuomov12345, simon, thorkilnaur, tux_rocker, dmitry.kurochkin, mornfall
messages: + msg8051
2009-08-09 17:51:10tux_rockersetnosy: jch, tommy, kowey, tuomov1, igloo, tuomov12345, simon, thorkilnaur, tux_rocker, dmitry.kurochkin, mornfall
messages: + msg8055
2009-08-09 19:17:48koweysetnosy: jch, tommy, kowey, tuomov1, igloo, tuomov12345, simon, thorkilnaur, tux_rocker, dmitry.kurochkin, mornfall
messages: + msg8056
2009-08-19 10:39:53koweysetstatus: unknown -> has-patch
nosy: jch, tommy, kowey, tuomov1, igloo, tuomov12345, simon, thorkilnaur, tux_rocker, dmitry.kurochkin, mornfall
topic: + Target-2.4
2009-08-25 17:18:16adminsetnosy: + darcs-devel, - igloo
2009-08-25 17:36:44adminsetnosy: - simon
2009-08-26 17:56:38koweylinkissue33 superseder
2009-08-27 14:33:42adminsetnosy: jch, tommy, kowey, darcs-devel, tuomov1, tuomov12345, thorkilnaur, tux_rocker, dmitry.kurochkin, mornfall
2009-09-14 10:52:22koweysettopic: + Target-2.5, - Target-2.4
nosy: jch, tommy, kowey, darcs-devel, tuomov1, tuomov12345, thorkilnaur, tux_rocker, dmitry.kurochkin, mornfall
2009-10-23 22:47:19adminsetnosy: - tuomov1
2009-10-24 00:11:54adminsetnosy: + tuomov1, - tuomov12345
2009-10-24 00:40:53adminsetnosy: + tuomov, - tuomov1
2009-11-01 19:17:14koweylinkpatch37 issues
2009-11-15 17:53:20tux_rockersettopic: + Target-2.4, - Target-2.5
messages: + msg9338
2009-11-15 18:03:44tux_rockerlinkissue1692 superseder
2009-11-15 18:08:14tux_rockerlinkissue1693 superseder
2010-03-01 13:22:08koweysettopic: + Target-2.5, - Target-2.4
2010-05-04 06:50:26tux_rockersetstatus: has-patch -> resolved
messages: + msg10918
2010-06-10 09:00:32koweyunlinkissue1143 superseder
2010-06-10 09:07:10koweyunlinkissue33 superseder
2010-06-15 20:52:08adminsetmilestone: 2.5.0
2010-06-15 20:59:45adminsettopic: - Target-2.5
2010-06-15 22:32:20tux_rockersetmessages: + msg11442
resolvedin: 2.5.0