Issue 64: Should store patch metadata in utf-8

Title	Should store patch metadata in utf-8
Priority	feature	Status	resolved
Milestone	2.5.0	Resolved in	2.5.0
Superseder		Nosy List	darcs-devel, dmitry.kurochkin, jch, kowey, mornfall, thorkilnaur, tommy, tuomov, tux_rocker
Assigned To	tux_rocker	Topics

Created on 2005-12-17.12:06:21 by tuomov1, last changed 2010-06-15.22:32:20 by tux_rocker.

Messages
msg237 (view)	Author: tuomov	Date: 2005-12-17.12:06:21
(This bug doesn't seem to be in the new bug tracker yet, although it should be in the old one.) Darcs should store patch metadata (author name/email, patch name and message) in UTF-8, converting to/from locale encoding for display and input. There should perhaps also be --utf8 to be used with --pipe so that ssh pushes and so on work correctly without losing information.
msg546 (view)	Author: jch	Date: 2006-03-03.16:46:01
This was discussed in detail in the thread starting at http://www.darcs.net/pipermail/darcs-users/2005-October/thread.html#8551 and the consensus is that there is no good solution, so the status quo will remain. Sorry for that. As this is a bug of the universe, I'm half tempted to mark it not-our-bug ;-)
msg547 (view)	Author: tuomov	Date: 2006-03-03.16:55:10
On 2006-03-03 16:46 +0000, Juliusz Chroboczek wrote: > As this is a bug of the universe, I'm half tempted to mark it not-our-bug ;-) Ok, seems like time for a fork.. a fork that fixed all the americanisms, including the time format. If I just had time to write the code...
msg549 (view)	Author: igloo	Date: 2006-03-03.17:27:14
I think there is some confusion here. For filenames the problem is indeed tricky, and sticking with 8bit chars seems reasonable. However, I can see no reason not to make the author, patch description etc UTF-8 (and I think it should be done).
msg550 (view)	Author: jch	Date: 2006-03-03.17:29:46
> I think there is some confusion here. > For filenames the problem is indeed tricky, and sticking with 8bit > chars seems reasonable. > However, I can see no reason not to make the author, patch > description etc UTF-8 (and I think it should be done). Hmm, you're right. How do we manage the transition? Juliusz
msg551 (view)	Author: igloo	Date: 2006-03-03.17:46:57
[I've set this bug back to chatting as you agree] The best transition plan OTTOMH is have a suitable update command do: if (data is utf8) return data else return (<convert latin1 to utf8> data) and then add something to the repo format description. I think any other scheme will either be wrong more of the time or will involve user interaction (which is especially bad if people don't choose the same options for patches in two repos). While we could have new darcs able to send patches between converted and non-converted repos, I think this isn't worthwhile. Just give an error telling the user the conversion command needs to be run and which end to run it.
msg555 (view)	Author: droundy	Date: 2006-03-06.13:42:35
On Fri, Mar 03, 2006 at 06:29:35PM +0100, Juliusz Chroboczek wrote: > > I think there is some confusion here. > > > For filenames the problem is indeed tricky, and sticking with 8bit > > chars seems reasonable. > > > However, I can see no reason not to make the author, patch > > description etc UTF-8 (and I think it should be done). > > Hmm, you're right. How do we manage the transition? If we don't mind leaving old metadata in whatever format it's already in, we don't need a transition plan, do we? -- David Roundy http://www.darcs.net
msg556 (view)	Author: igloo	Date: 2006-03-06.17:12:08
On Mon, Mar 06, 2006 at 01:42:38PM +0000, David Roundy wrote: > > David Roundy <droundy@darcs.net> added the comment: > > On Fri, Mar 03, 2006 at 06:29:35PM +0100, Juliusz Chroboczek wrote: > > > I think there is some confusion here. > > > > > For filenames the problem is indeed tricky, and sticking with 8bit > > > chars seems reasonable. > > > > > However, I can see no reason not to make the author, patch > > > description etc UTF-8 (and I think it should be done). > > > > Hmm, you're right. How do we manage the transition? > > If we don't mind leaving old metadata in whatever format it's already in, > we don't need a transition plan, do we? That would mean code can never assume that metadata is valid UTF8, which I'm sure will bite us sooner or later. Thanks Ian
msg626 (view)	Author: jch	Date: 2006-05-04.17:35:07
I'm marking this as wontfix -- I don't believe we'll fix that. Please reopen if you disagree.
msg627 (view)	Author: igloo	Date: 2006-05-04.19:57:07
On Thu, May 04, 2006 at 05:35:09PM +0000, Juliusz Chroboczek wrote: > > I'm marking this as wontfix -- I don't believe we'll fix that. Please reopen > if you disagree. I thought we agreed this should be fixed, even if we hadn't agreed on the details of how? I certainly still think it should be fixed. Thanks Ian
msg628 (view)	Author: droundy	Date: 2006-05-05.10:48:40
On Thu, May 04, 2006 at 07:57:08PM +0000, Ian Lynagh wrote: > I thought we agreed this should be fixed, even if we hadn't agreed on > the details of how? I certainly still think it should be fixed. Just to be clear (although I think it's obvious): I'm ambivalent on the issue. I try to leave decisions (and coding) involving non-ASCII character sets to people who use them. -- David Roundy http://www.darcs.net
msg629 (view)	Author: tuomov	Date: 2006-05-05.10:53:25
On 2006-05-05 10:48 +0000, David Roundy wrote: > Just to be clear (although I think it's obvious): I'm ambivalent on the > issue. I try to leave decisions (and coding) involving non-ASCII character > sets to people who use them. The problem is that without any locale support, two people who use different locales can't cooperate. What looks correct to one, looks wrong to another. So, darcs should store metadata in such a manner that the conversion to the user's locale can be done. The obvious and simple way is to store it in utf-8. (An alternative would be to indicate the encoding used.)
msg631 (view)	Author: jch	Date: 2006-05-09.15:33:57
> I try to leave decisions (and coding) involving non-ASCII character > sets to people who use them. The following is the point of view of someone using multiple European languages daily (some encoded in Latin-1/9, some Latin-2). The point of view of someone using, say, Japanese would likely be different. > The problem is that without any locale support, two people who use > different locales can't cooperate. What looks correct to one, looks > wrong to another. So, darcs should store metadata in such a manner > that the conversion to the user's locale can be done. The obvious > and simple way is to store it in utf-8. (An alternative would be > to indicate the encoding used.) There are three possibilities here: (i) consider patch metadata as opaque strings, and leave conversion issues to the user; encourage users to standardise on UTF-8. (ii) convert everything to UTF-8 on the fly. (iii) allow random encodings, tag everything. The accurate technical term for (iii) is ``ISO 2022 brain-damage''. I won't consider it any further. Note that (i) and (ii) are indistinguishable in a UTF-8 locale, which is what we've been encouraging users to use. The main advantage of (i) is that it's a much, much simpler model and is consistent with the way Unix applications tend to work, and then when something goes wrong it's usually easy to fix. The advantage of (ii) is that it allows interoperability even when some people use legacy locales and others don't, which is undesirable; its disadvantage is that it's difficult to make reliable -- you end up with things like double Latin 1 -> UTF-8 conversions, which tend to be a pain to fix. I certainly favour staying with (i), but that's because I'm a lazy bum. If we're to switch to (ii), we're need a solution that deals gracefully with wrongly encoded data (non-UTF-8 data can be detected fairly reliably). Juliusz
msg632 (view)	Author: tuomov	Date: 2006-05-09.17:14:08
On 2006-05-09 15:33 +0000, Juliusz Chroboczek wrote: > Note that (i) and (ii) are indistinguishable in a UTF-8 locale, which > is what we've been encouraging users to use. The main advantage of > (i) is that it's a much, much simpler model and is consistent with the > way Unix applications tend to work, and then when something goes wrong > it's usually easy to fix. One should _never_ expect a particular encoding. Expecting UTF-8 is bound to lead to similar problems that we're _still_ facing thanks to programs having used to (and still doing so) expect (plain 7-bit) ASCII. All programs should use locales to be agnostic to the encoding used in the operating environment. As for storing data in the program's own files, that have no need to be in locale encoding, and where it might no even be sufficient, there UTF-8 is the best choice at the moment, if one does not want to use 32-bit unicode. > If we're to switch to (ii), we're need a solution that deals > gracefully with wrongly encoded data (non-UTF-8 data can be detected > fairly reliably). Just add a marker to new patches that indicates that the metadata is in UTF-8, and map non-ASCII in old patches to some suitable (free) range in Unicode. A tool should also be provided to convert old patches to new format, asking the user to indicate the encoding. The non-solution (i) is problematic, because at least in my case, the incorrect encoding is only discovered after I've pushed the patch to the public repository, and my changelog-generation tools crash on incorrect encoding.
msg633 (view)	Author: jch	Date: 2006-05-09.17:46:25
> One should _never_ expect a particular encoding. Expecting UTF-8 is bound > to lead to similar problems that we're _still_ facing thanks to programs > having used to (and still doing so) expect (plain 7-bit) ASCII. I'm a little reluctant to discuss that, as I've already spent a few hundred message worth of flaming on that subject. Let me just mention that unlike ASCII, Unicode is an encoding designed to be universal and that contains a huge Private Use Area. (Chroboczek's third law: any thread that contains a mention of ISO 2022 in the context of Unicode degenerates into a flame war.) > Just add a marker to new patches that indicates that the metadata is > in UTF-8, Hmm... That is not absolutely necessary. UTF-8 is sufficiently stylised to be recognisable automatically. > and map non-ASCII in old patches to some suitable (free) range in > Unicode. This is unfortunately not possible. You cannot automatically recognise ASCII in an unknown encoding due to the popularity of seriously brain-damaged encodings (e.g ISO 2022-JP, Shift-JIS or Big5). Juliusz
msg634 (view)	Author: tuomov	Date: 2006-05-09.17:53:22
On 2006-05-09 17:46 +0000, Juliusz Chroboczek wrote: > Unicode is an encoding designed to be universal and > that contains a huge Private Use Area. "640 kilobytes is enough for everyone." > This is unfortunately not possible. You cannot automatically > recognise ASCII in an unknown encoding due to the popularity of > seriously brain-damaged encodings (e.g ISO 2022-JP, Shift-JIS or Big5). Eh? Who said anything about conversion? Just map the _bytes numbered_ from 128 and 256 that we don't know how to interpret, to some range of Unicode that doesn't have any useful meaning to put them out of the way, but without breaking the encoding.
msg993 (view)	Author: droundy	Date: 2006-09-18.17:36:24
I just took another look at this, and my leaning would be to add an optional conversion process for metadata and/or file names. Just talked with Eric about this yesterday... I still don't like the idea of doing automatic locale-determined conversion, just because it can fail, and I don't like the idea of darcs failing. I suppose locale-based conversion for metadata could be the default (definitely not for filenames, as that could lead to repo corruption and extra excitement), with an option to disable it. David
msg994 (view)	Author: kowey	Date: 2006-09-18.17:56:10
On Mon, Sep 18, 2006 at 17:36:31 +0000, David Roundy wrote: > I still don't like the idea of doing automatic locale-determined > conversion, just because it can fail, and I don't like the idea of > darcs failing. I suppose locale-based conversion for metadata could > be the default (definitely not for filenames, as that could lead to > repo corruption and extra excitement), with an option to disable it. Some more questions about this. Just re: metadata: 1) Do we use Haskell putChar/getChar-derived functions to write/read the metadata? 2) Is metadata is stored using Haskell Char? 3) Does #1 and #2 mean that we know for a fact, having used ghc circa 2006-09 and prior, that all "old" patches are encoded using latin-1 ()? 4) Does latin-1 cover the whole 8 bits? Does everything in those 8 bits correspond to some Unicode character? 5) Would #3 and #4 not considerably simplify the question of dealing with old patch metadata? No need to worry about locales because there is exactly one way to interpret old patches? Or am I missing something? () Does the latin-1 encoding exactly correspond to the first 256 Unicode code points?
msg995 (view)	Author: droundy	Date: 2006-09-18.18:11:35
On Mon, Sep 18, 2006 at 05:56:15PM +0000, Eric Kow wrote: > Some more questions about this. Just re: metadata: > > 1) Do we use Haskell putChar/getChar-derived functions to write/read > the metadata? No, I think we use a variety of functions for dealing with metadata. > 2) Is metadata is stored using Haskell Char? No, sometimes I think it's in FastPackedString. > 3) Does #1 and #2 mean that we know for a fact, having used ghc > circa 2006-09 and prior, that all "old" patches are encoded > using latin-1 ()? No, not really, it really means that old patches are stored as raw bytes, in whatever encoding the user provided them. > 4) Does latin-1 cover the whole 8 bits? Does everything in those > 8 bits correspond to some Unicode character? Yeah, latin-1 is an 8-bit encoding, and all 256 possibilities (except maybe NULL? unless NULL is a unicode character?) correspond to unicode characters. But we can't trust that the bytes we were given were actually latin-1. > 5) Would #3 and #4 not considerably simplify the question of dealing > with old patch metadata? No need to worry about locales because > there is exactly one way to interpret old patches? No, but what I'd say is that we can just display as raw bytes any old patches that aren't in utf8, and in that case we'll be doing no worse than current darcs does. My current leaning (which has changed in the last half-hour... and as always could be swayed by someone who has actual data that would be affected by this) is to try to display metadata as if it were in utf8, but if the utf8 fails to decode (as will often be the case for old patches), then just display the raw bytes, and hope they're in the right encoding. When encoding metadata, we'd convert into utf8 based on the current locale, unless the metadata can't be decoded from the current locale, in which case I presume we'd treat it as raw bytes. But all this behavior ought to be determined by a switch of some sort, so that if a project has already standardized on latin-1 (or 2-9) or something, their project needn't get messed up. On the other hand, if I could be convinced that under any plausible usage scenario the encoding/decoding. I'm thinking we could have some sort of a conversion routines like: convertFromLocale :: FastPackedString -> FastPackedString convertToLocale :: FastPackedString -> FastPackedString with the above functions falling back to the identity if the conversion fails. Then we could perhaps write helper functions to use the above when displaying PatchInfos (but not when writing them to files). I think there's a function like "friendly_somethingorother" which is used to print patch names, and that's perhaps the only place a change need be made. But it'll take someone who wants this feature to code it up. Tuomo? > Or am I missing something? > > () Does the latin-1 encoding exactly correspond to the first 256 > Unicode code points? Yeah. -- David Roundy
msg996 (view)	Author: tuomov	Date: 2006-09-18.18:34:43
On 2006-09-18 17:36 +0000, David Roundy wrote: > I still don't like the idea of doing automatic locale-determined > conversion, just because it can fail, and I don't like the idea of darcs > failing. I don't think there's anything that could "fail" in metadata conversions. Sure, you may not be able to convert everything into the current locale encoding, thus having to do subtitutions, but that's not darcs' problem, and infact it already by default only display ASCII range as-is. And a locale not being able to represent all characters is less of a problem, than a hodgepodge of different encodings, for different patches. > with an option to disable it. Hmm.. that's not good, if the patches are specified to contain proper utf-8. Perhaps there could be an --utf8 to make darcs expect utf8, and check for correctness, but LC_CTYPE=whatever.UTF-8 does the same thing. As for old patches, perhaps the best would be to have a repository-specific setting that indicates the encoding used in old-format patches, with a new slightly modified patch format specified to use a particular encoding (utf-8), or specifying the encoding itself, which is likely more cumbersome. (One could also simply use that repository-specific setting for new patches as well, but that doesn't stop someone from messing with the setting in their copy of the repository, and thus sending me patches with corrupt encoding, that I'll only detect when my xsltproc fails in invalid UTF-8, and the patch is already in the public repository. Thus the information should be in the patch itself, I think. Using the repository setting for new patches doesn't also help with upgrading the encoding used.) Some more clarifications: http://www.abridgegame.org/pipermail/darcs-devel/2006-September/004777.html
msg997 (view)	Author: tuomov	Date: 2006-09-18.18:45:57
On 2006-09-18 18:11 +0000, David Roundy wrote: > I'm thinking we could have some sort of a conversion routines like: > > convertFromLocale :: FastPackedString -> FastPackedString > convertToLocale :: FastPackedString -> FastPackedString > > with the above functions falling back to the identity if the > conversion fails. If we allow the conversion to fail, I think we should then indicate in the string whether it is utf-8 (or whatever), or raw bytes. Also, perhaps the same setting should apply to both the author name and patch description, so that if either fails, both are considered to be raw bytes. > But it'll take someone who wants this feature to code it up. Tuomo? If I was that familiar with the darcs codebase, I'd have coded it already, even for just my own use. I guess I could try to find the time to write it (whatever little time I find when I feel like coding, I'd rather spend it on finishing Ion3) if someone could point me to the right places in the code, and it isn't awful lot of work... but then again, someone more familiar with the code has already done half of the job by explaining it to me, as it really shouldn't be that much work. (I once, almost a few years ago, IIRC, looked into locale support or option for ISO-8601 for time output, and that was going to be a lot of work.)
msg1092 (view)	Author: tuomov	Date: 2006-10-15.10:20:22
On 2006-09-18 18:46 +0000, Tuomo Valkonen wrote: > > But it'll take someone who wants this feature to code it up. Tuomo? > > if someone could point me to the right places in the code, > and it isn't awful lot of work... but then again, someone more familiar with > the code has already done half of the job by explaining it to me, as it > really shouldn't be that much work. So?
msg1116 (view)	Author: tommy	Date: 2006-10-17.15:56:47
On Sun, Oct 15, 2006 at 10:20:29AM +0000, Tuomo Valkonen wrote: > > if someone could point me to the right places in the code, > > and it isn't awful lot of work... Unfortunately it looks like no one can point you to the right place. :-( I have only worked on some of the printing code, but here is an at least somewhat informed guess. The change that David proposes would require adding a new option in DarcsFlags.lhs / DarcsArguments.lhs, writing the converting functions (maybe in DarcsUtils.lhs), inserting calls to these conversion functions at after reading and before writing patchinfos (in PatchInfo.lhs, I think). If the comand line options are not passed down all the way to the patch info writing and reading functions, these functions must be extended with an extra argument for either to govern just the conversion, or the entire option data structure itself so it can be examined. If "after reading" and "before writing" of meta data is not confined to just two functions, things become more complicated. Note that this simple solution does not store any information in the meta data about how it is encoded. It would work as: any reopo stored meta data that can successfully be converted from UTF-8 to the current locale _is_ in UTF-8 (which is why the override option is needed), anything else is raw data possibly in the current locale (the current behavior of darcs). Any user supplied meta data is converted (if necessary and not inhibited by use of option) to UTF-8. I suspect that extending the meta data format with extra information requires use of the new repo format thing (that is not used for anything yet, I think) so older darcs can tell they doesn't support a repo with encoding flagged meta data. That would be substantially more work, although not very difficult, I guess.
msg1121 (view)	Author: tuomov	Date: 2006-10-17.18:12:22
On 2006-10-17 15:57 +0000, Tommy Pettersson wrote: > > Tommy Pettersson <ptp@lysator.liu.se> added the comment: > > On Sun, Oct 15, 2006 at 10:20:29AM +0000, Tuomo Valkonen wrote: > > > if someone could point me to the right places in the code, > > > and it isn't awful lot of work... > > Unfortunately it looks like no one can point you to the right > place. :-( Bah. The only "simple" places for conversions seem to be askUser, and the flags, which should amount of the input. 'Doc', or extension of Printable, seems another quite appropriate place, but these conversions depend on the IO monad or some other parametrisation! But the rendering isn't currently done there. So it's once again one of Haskell's biggest problems that make things laboursome: parametrisation, even at execution time. Of course, one could do unsafeIO, but that's so utterly and totally ugly. One might also be able to do the conversions at another point, but finding if this is possible, may need extensive archeological work on the mess known as darcs source code, and certainly is a lot of work. I haven't bothered looking into repo formats, and probably won't, if wholesale conversion of Doc is what is needed... And even there I haven't looked into how much could would need to be converted to use alternative composition routines..
msg1176 (view)	Author: jch	Date: 2006-11-05.22:21:01
> I just took another look at this, and my leaning would be to add an optional > conversion process for metadata and/or file names. [...] > I still don't like the idea of doing automatic locale-determined > conversion, just because it can fail, and I don't like the idea of > darcs failing. David, Eric, Tommy, Locali[sz]ation is something people feel very strongly about, and if we implement any form of localisation in Darcs, we'll end up at the wrong end of a series of flamewars. I strongly recommend that we avoid introducing any form of locale-sensitivity in Darcs. The clean solution would be to ask users to use UTF-8 throughout the system. If that's not reasonable, it should be enough to allow the user to specify conversion filters to be used when reading and writing files: darcs record --input-filter='iconv -f latin-1 -t utf-8' --output-filter='iconv -f utf-8 -t latin-1' Of course, if the user specifies filters that are not inverses, he'll get grief, which is what he deserves for not using UTF-8 in the first place. Juliusz
msg1177 (view)	Author: tuomov	Date: 2006-11-05.23:07:53
On 2006-11-05 23:20 +0100, Juliusz Chroboczek wrote: > The clean solution would be to ask users to use UTF-8 throughout the > system. If that's not reasonable, it should be enough to allow the > user to specify conversion filters to be used when reading and writing > files: > > darcs record --input-filter='iconv -f latin-1 -t utf-8' > --output-filter='iconv -f utf-8 -t latin-1' And how's that supposed to help with the users (contributors) sending stuff in the wrong encoding? > Of course, if the user specifies filters that are not inverses, he'll > get grief, which is what he deserves for not using UTF-8 in the first > place. Fucking monoculturist. Locales, i.e. abstraction, is the right way to go about encodings, not standardising on a single one, thus creating a monoculture, an evolutionary dead-end, that will not allow easily replacing the encoding with something better. UTF-8 (and Unicode in particular) is not perfect, and leaves a lot of room for improvement. Unfortunately, monoculturists like you seem to think this is so, that there will be nothing after UTF-8. (Or else, they want to make shitty software to have jobs once the next encoding comes along.) I think I'll remove the <meta encoding> tag from my web pages, and the Content-Encoding field from my emails one of these days, and use a custom encoding modelled after tex \charactername escapes – and use a lot of special characters.
msg6870 (view)	Author: mornfall	Date: 2008-12-23.09:08:58
I unilaterally reopen this issue and expect to come up with patches that will make darcs: - convert input strings (patch name, author name) from locale encoding to utf8 - convert output strings from utf8 to locale encoding if possible (assume raw otherwise) - convert non-utf8 8bit chars into a free range of codepoints in unicode for --xml, so we finally get utf8-clean --xml output I won't come up with a conversion tool for now, since I believe the above fixes most of our worst sores, namely the inability to produce valid XML and our clunky non-encoding of (new) metadata (which makes it unreasonably hard to recover the actual metadata from a repository). Moreover, there seems to be sort of consensus in the discussion that this is about the right approach. Later on, an interactive conversion command might be devised. It might be possible to create a dump from that, which'll make conversion of related repos easier and safer. This conversion part is however likely to be quite laborious and I don't expect to need any of it, so in case the first part gets through, I'm likely to file a new wish and mark it as needs-volunteer.
msg6874 (view)	Author: kowey	Date: 2008-12-23.21:56:50
On Tue, Dec 23, 2008 at 09:09:03 -0000, Petr Ročkai wrote: > I unilaterally reopen this issue and expect to come up with patches that will > make darcs: So I've very pleased to see this issue re-opened, and I agree that we should do something to sort this out. Yes, I want to see some sort of solution that lets us (for example) guarantee that we've got actual Unicode metadata (while dealing gracefully with things we can't understand)... > - convert input strings (patch name, author name) from locale encoding to utf8 > - convert output strings from utf8 to locale encoding if possible (assume > raw otherwise) > - convert non-utf8 8bit chars into a free range of codepoints in unicode > for --xml, so we finally get utf8-clean --xml output But what do you make of Juliusz's cautionary note about locale sensitivity? Juliusz, if I understand correctly, has considerable experience in these matters and I suspect that it may be wise to heed his advice (or find somebody equally experienced to tell us why this is OK) Thanks!
msg8012 (view)	Author: kowey	Date: 2009-08-05.12:55:39
If I understand correctly, GHC 6.12's IO will use the current locale by default. http://hackage.haskell.org/trac/ghc/ticket/2811 I think this makes it more urgent that we review the state of Darcs IO and figure out what is the right way to respond to this (hopefully closing this ticket along the way). One option is to resist this change so that Darcs continues to behave in the same way (in other words, using special code if compiling with GHC 6.12). Probably a better option is to bend with the wind and have Darcs act in a similar fashion. But are there places where we aren't using System.IO to read stuff that will need to be updated accordingly? How are we going to deal with older GHC?
msg8049 (view)	Author: tux_rocker	Date: 2009-08-09.11:24:34
I'm assigning to myself because this is currently not high on mornfall's priority list. What shall we do regarding older metadata? I think I'll make a flag in the repo format saying that the metadata in UTF-8. If that flag is present, darcs knows that patch metadata in that repo are UTF-8. If such a flag is not present in the repo, try to decode the patch metadata with the current locale. If that fails, it should try latin-1. AFAIK, decoding with latin1 cannot fail. I can see two problems now with this approach: (1) patch metadata in repositories that are not yet UTF-8 encoded may look different in different environments (2) patches in non-UTF-8 converted environments may get converted to UTF-8 differently, depending on what environment they are converted in Point (1) is already present in the current darcs. Point (2) may become a bit hairy when a non-UTF-8 patch is converted to an UTF-8 patch in two different environments, one latin-1 and another latin-2 for example. Then if a fourth machine pulls both from the latin-1 and the latin-2 environment, he will see two different UTF-8-encoded patches which are really the same non-encoded patch. The locale-aware GHC 6.12 will not be such a great problem. You can still get byte I/O when you open a file as binary, and there are special libraries for encoded and unencoded IO (text and bytestring come to my mind).
msg8051 (view)	Author: kowey	Date: 2009-08-09.13:46:52
OK, I get confused easily, so watch out in case I say something stupid. On Sun, Aug 09, 2009 at 11:24:38 +0000, Reinier Lamers wrote: > What shall we do regarding older metadata? > > I think I'll make a flag in the repo format saying that the metadata in UTF-8. > If that flag is present, darcs knows that patch metadata in that repo are UTF-8. If I understand correctly, this means that older darcs cannot push patches to utf-8 repos (which is deliberate to prevent them from introducing patches that aren't to be treated as being in UTF-8). I suppose we'd could provide some sort of upgrade mechanism, eg. for example, darcs convert --from-encoding=latin-2 maybe even a really fancy one that somehow lets you specify different encodings for different older patches. Maybe I'm getting way ahead of myself. Perhaps another option is not to do this at the repo level, but at the patch level (see issue1906). In this particular context, we could maybe re-use the Ignore-this mechanism. For example, we optimistically parse the metadata using the UTF-8 encoding, and if either there is a UTF-8 error or we fail to find 'Ignore-this: UTF-8', we would back off to the current behaviour of treating it as Latin-1. Note that this suggestion is orthogonal to the question of locale support in user IO as the patch log is part and parcel of the patch identifier and is meant only to indicate how Darcs stores the patch internally. > If such a flag is not present in the repo, try to decode the patch metadata with > the current locale. If that fails, it should try latin-1. AFAIK, decoding with > latin1 cannot fail. I think that sounds right. > I can see two problems now with this approach: > (1) patch metadata in repositories that are not yet UTF-8 encoded may look > different in different environments > (2) patches in non-UTF-8 converted environments may get converted to UTF-8 > differently, depending on what environment they are converted in > > Point (1) is already present in the current darcs. Point (2) may become a bit > hairy when a non-UTF-8 patch is converted to an UTF-8 patch in two different > environments, one latin-1 and another latin-2 for example. Then if a fourth > machine pulls both from the latin-1 and the latin-2 environment, he will see two If we go the flag route, repos that lack the flag could also take the option of leaving the status quo
msg8055 (view)	Author: tux_rocker	Date: 2009-08-09.17:51:07
On Sunday 09 August 2009 15:46:54 Eric Kow wrote: > Eric Kow <kowey@darcs.net> added the comment: > > On Sun, Aug 09, 2009 at 11:24:38 +0000, Reinier Lamers wrote: > > What shall we do regarding older metadata? > > > > I think I'll make a flag in the repo format saying that the metadata in > > UTF-8. If that flag is present, darcs knows that patch metadata in that > > repo are UTF-8. > > Perhaps another option is not to do this at the repo level, but at the > patch level (see issue1906). In this particular context, we could > maybe re-use the Ignore-this mechanism. For example, we optimistically > parse the metadata using the UTF-8 encoding, and if either there is a > UTF-8 error or we fail to find 'Ignore-this: UTF-8', we would back off > to the current behaviour of treating it as Latin-1. That sounds good because it avoids the incompatibility problems you get when you convert patches. > Note that this suggestion is orthogonal to the question of locale > support in user IO as the patch log is part and parcel of the patch > identifier and is meant only to indicate how Darcs stores the patch > internally. But when we want to support locale in user IO, we do have to know what to convert the patch name from when outputting it to the user. And right now we don't. Reinier
msg8056 (view)	Author: kowey	Date: 2009-08-09.19:17:45
On Sun, Aug 09, 2009 at 17:51:10 +0000, Reinier Lamers wrote: > > Note that this suggestion is orthogonal to the question of locale > > support in user IO as the patch log is part and parcel of the patch > > identifier and is meant only to indicate how Darcs stores the patch > > internally. > > But when we want to support locale in user IO, we do have to know what to > convert the patch name from when outputting it to the user. And right now we > don't. Right, so I think this means we're on the same page. I thought it was completely orthogonal but you've corrected me by pointing out it's not quite because if we're dealing with known-utf-8 patches we have convert from that, otherwise we have to convert from ??? (which could be try the current locale or fall back to latin-1 as you said) And I think that our same-pageness extends to the idea that we can hopefully solve the patch internal representation stuff first and then tackle the user IO stuff next...
msg9338 (view)	Author: tux_rocker	Date: 2009-11-15.17:53:15
The fix for /storing/ the metadata goes in in 2.4. Actually being sane according to Unicode when matching or displaying is going to be fixed whenever someone (possibly me) likes to do it. They were not given a high priority at the discussion at the Vienna sprint.
msg10918 (view)	Author: tux_rocker	Date: 2010-05-04.06:50:25
The following patch updated the status of issue64 to be resolved: * resolve issue64: store metadata as UTF-8, autodetect UTF-8, and don't normalize to NFC Ignore-this: ae22511c7679f078412698f866d69255
msg11442 (view)	Author: tux_rocker	Date: 2010-06-15.22:32:19
The following patch updated issue issue64 with status=resolved;resolvedin=2.5.0 (current) * resolve issue64: store metadata as UTF-8, autodetect UTF-8, and don't normalize to NFC Ignore-this: ae22511c7679f078412698f866d69255

History
Date	User	Action	Args
2005-12-17 12:06:21	tuomov1	create
2005-12-17 21:15:47	jch	set	nosy: + jch
2006-03-03 16:46:03	jch	set	status: unread -> wont-fix nosy: droundy, jch, tommy, tuomov1 messages: + msg546
2006-03-03 16:55:15	tuomov1	set	nosy: droundy, jch, tommy, tuomov1 messages: + msg547
2006-03-03 17:27:15	igloo	set	nosy: + igloo messages: + msg549
2006-03-03 17:29:47	jch	set	nosy: droundy, jch, tommy, tuomov1, igloo messages: + msg550
2006-03-03 17:46:58	igloo	set	status: wont-fix -> unknown nosy: droundy, jch, tommy, tuomov1, igloo messages: + msg551
2006-03-06 13:42:38	droundy	set	nosy: droundy, jch, tommy, tuomov1, igloo messages: + msg555
2006-03-06 17:12:10	igloo	set	nosy: droundy, jch, tommy, tuomov1, igloo messages: + msg556
2006-05-04 17:35:09	jch	set	status: unknown -> wont-fix nosy: droundy, jch, tommy, tuomov1, igloo messages: + msg626
2006-05-04 19:57:08	igloo	set	nosy: droundy, jch, tommy, tuomov1, igloo messages: + msg627
2006-05-05 10:48:42	droundy	set	nosy: droundy, jch, tommy, tuomov1, igloo messages: + msg628
2006-05-05 10:53:29	tuomov1	set	nosy: droundy, jch, tommy, tuomov1, igloo messages: + msg629
2006-05-09 15:33:59	jch	set	nosy: droundy, jch, tommy, tuomov1, igloo messages: + msg631
2006-05-09 17:14:10	tuomov1	set	nosy: droundy, jch, tommy, tuomov1, igloo messages: + msg632
2006-05-09 17:46:27	jch	set	nosy: droundy, jch, tommy, tuomov1, igloo messages: + msg633
2006-05-09 17:53:24	tuomov1	set	nosy: droundy, jch, tommy, tuomov1, igloo messages: + msg634
2006-09-18 17:36:31	droundy	set	nosy: + kowey messages: + msg993
2006-09-18 17:56:15	kowey	set	nosy: droundy, jch, tommy, kowey, tuomov1, igloo messages: + msg994
2006-09-18 18:11:42	droundy	set	nosy: droundy, jch, tommy, kowey, tuomov1, igloo messages: + msg995
2006-09-18 18:34:49	tuomov1	set	nosy: droundy, jch, tommy, kowey, tuomov1, igloo messages: + msg996
2006-09-18 18:46:03	tuomov1	set	nosy: droundy, jch, tommy, kowey, tuomov1, igloo messages: + msg997
2006-10-15 10:20:29	tuomov1	set	nosy: droundy, jch, tommy, kowey, tuomov1, igloo messages: + msg1092
2006-10-17 15:57:10	tommy	set	nosy: droundy, jch, tommy, kowey, tuomov1, igloo messages: + msg1116
2006-10-17 18:12:33	tuomov1	set	nosy: droundy, jch, tommy, kowey, tuomov1, igloo messages: + msg1121
2006-11-05 22:21:06	jch	set	nosy: droundy, jch, tommy, kowey, tuomov1, igloo messages: + msg1176
2006-11-05 23:08:01	tuomov1	set	nosy: droundy, jch, tommy, kowey, tuomov1, igloo messages: + msg1177
2008-12-23 09:09:03	mornfall	set	status: wont-fix -> unknown nosy: + dmitry.kurochkin, simon, mornfall, thorkilnaur messages: + msg6870 assignedto: mornfall
2008-12-23 18:03:46	droundy	set	nosy: - droundy
2008-12-23 21:56:54	kowey	set	nosy: + droundy, tuomov12345 messages: + msg6874
2009-02-06 02:09:34	twb	link	issue1143 superseder
2009-08-05 12:55:41	kowey	set	priority: wishlist -> feature nosy: droundy, jch, tommy, kowey, tuomov1, igloo, tuomov12345, simon, thorkilnaur, dmitry.kurochkin, mornfall messages: + msg8012
2009-08-05 12:56:47	kowey	set	nosy: - droundy
2009-08-09 11:24:38	tux_rocker	set	nosy: + tux_rocker messages: + msg8049 assignedto: mornfall -> tux_rocker
2009-08-09 13:46:54	kowey	set	nosy: jch, tommy, kowey, tuomov1, igloo, tuomov12345, simon, thorkilnaur, tux_rocker, dmitry.kurochkin, mornfall messages: + msg8051
2009-08-09 17:51:10	tux_rocker	set	nosy: jch, tommy, kowey, tuomov1, igloo, tuomov12345, simon, thorkilnaur, tux_rocker, dmitry.kurochkin, mornfall messages: + msg8055
2009-08-09 19:17:48	kowey	set	nosy: jch, tommy, kowey, tuomov1, igloo, tuomov12345, simon, thorkilnaur, tux_rocker, dmitry.kurochkin, mornfall messages: + msg8056
2009-08-19 10:39:53	kowey	set	status: unknown -> has-patch nosy: jch, tommy, kowey, tuomov1, igloo, tuomov12345, simon, thorkilnaur, tux_rocker, dmitry.kurochkin, mornfall topic: + Target-2.4
2009-08-25 17:18:16	admin	set	nosy: + darcs-devel, - igloo
2009-08-25 17:36:44	admin	set	nosy: - simon
2009-08-26 17:56:38	kowey	link	issue33 superseder
2009-08-27 14:33:42	admin	set	nosy: jch, tommy, kowey, darcs-devel, tuomov1, tuomov12345, thorkilnaur, tux_rocker, dmitry.kurochkin, mornfall
2009-09-14 10:52:22	kowey	set	topic: + Target-2.5, - Target-2.4 nosy: jch, tommy, kowey, darcs-devel, tuomov1, tuomov12345, thorkilnaur, tux_rocker, dmitry.kurochkin, mornfall
2009-10-23 22:47:19	admin	set	nosy: - tuomov1
2009-10-24 00:11:54	admin	set	nosy: + tuomov1, - tuomov12345
2009-10-24 00:40:53	admin	set	nosy: + tuomov, - tuomov1
2009-11-01 19:17:14	kowey	link	patch37 issues
2009-11-15 17:53:20	tux_rocker	set	topic: + Target-2.4, - Target-2.5 messages: + msg9338
2009-11-15 18:03:44	tux_rocker	link	issue1692 superseder
2009-11-15 18:08:14	tux_rocker	link	issue1693 superseder
2010-03-01 13:22:08	kowey	set	topic: + Target-2.5, - Target-2.4
2010-05-04 06:50:26	tux_rocker	set	status: has-patch -> resolved messages: + msg10918
2010-06-10 09:00:32	kowey	unlink	issue1143 superseder
2010-06-10 09:07:10	kowey	unlink	issue33 superseder
2010-06-15 20:52:08	admin	set	milestone: 2.5.0
2010-06-15 20:59:45	admin	set	topic: - Target-2.5
2010-06-15 22:32:20	tux_rocker	set	messages: + msg11442 resolvedin: 2.5.0
2014-11-26 06:34:21	ganesh	unlink	issue1693 superseder

Issue 64 Should store patch metadata in utf-8