Issue 1863: unicode filenames - Darcs bug tracker

Title	unicode filenames
Priority	feature	Status	given-up
Milestone	3.0.0	Resolved in
Superseder		Nosy List	dmitry.kurochkin, kowey, mornfall, stephen, tux_rocker
Assigned To		Topics

Created on 2010-06-06.12:28:55 by kowey, last changed 2017-07-30.23:59:34 by gh.

Messages
msg11131 (view)	Author: kowey	Date: 2010-05-28.09:04:30
On Thu, May 27, 2010 at 12:15:56 +0000, Petr Ročkai wrote: > Thu May 27 14:10:19 CEST 2010 Petr Rockai <me@mornfall.net> > * Resolve issue1763: use correct filename encoding in conflictors. OK, we've had two people (Reinier, Eric) look at this and OK it, so I guess it makes sense for me to push it now with some thoughts about future work. Resolve issue1763: use correct filename encoding in conflictors. ---------------------------------------------------------------- > hunk ./src/Darcs/Patch/Real.hs 716 > blueText "conflictor" <+> showNons i <+> blueText "[]" $$ showNon p > showPatch (Conflictor i cs p) = > blueText "conflictor" <+> showNons i <+> blueText "[" $$ > - showPatch cs $$ > + showPrimFL NewFormat cs $$ > blueText "]" $$ > showNon p > showPatch (InvConflictor i NilFL p) = > I'm still concerned that we're not being systematic enough about really fixing this (eg. show we worry about rotcifnoc? showNon? etc) [The mental image I have is those old cartoons where you have the character on a boat and a leak forms, so he plugs it with a finger, and then another leak, and another finger, and another leak...] I also notice this: instance ReadPatch Prim where readPatch' _ = readPrim OldFormat -- this and other darcs-2 format patches use readPrim NewFormat readNons :: (ReadPatch p, ParserM m) => m [Non p C(x)] readNons = peekfor "{{" rns (return []) where rns = peekfor "}}" (return []) $ do Just (Sealed ps) <- readPatch' False lexChar ':' Just (Sealed p) <- readPrim NewFormat (Non ps p :) `liftM` rns and in the read code for Non and RealPatch (I think these are darcs-2 style patches), readPatch eventually uses readPrim NewFormat. So that makes sense: the double-encoding comes from reading UTF-8 bytes [this is where Petr's assertion that "the filepath is never decoded" makes sense] as code-points, and then trying to encode those code-points into UTF-8 bytes. Plan for future work? (Prim FileNameFormat) ------------------------------------------- How does this plan sound: introduce two new wrapper types OldFormatPrim and NewFormatPrim whose read/show instances use OldFormat/NewFormat respectively, thus ensuring that readPatch and showPatch automagically do the right thing? (or even one parametrisable type (although I imagine that involves turning on some extension for instances)) Plan for future work? (two kinds of read/show) ---------------------------------------------- Complementary plan: we should distinguish between decoding/encoding filepaths from the operating system, and decoding/encoding filepaths to patch files and patch bundles. Basically the picture looks like this: OS <--> darcs <---> patch files The reason why I initially thought that NewFormat was a step backwards was that I was thinking about the darcs <--> patch files part. IMHO, what you want is for darcs <--> patch files to always use UTF-8. On the other hand, the OS <--> darcs part needs some more thought. This is a little half-baked right, but maybe somebody else can run with the idea? -- Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow> PGP Key ID: 08AC04F9
msg11182 (view)	Author: tux_rocker	Date: 2010-06-02.06:33:06
Hi all, Op vrijdag 28 mei 2010 11:05 schreef Eric Kow: > Plan for future work? (two kinds of read/show) > ---------------------------------------------- > Complementary plan: we should distinguish between decoding/encoding > filepaths from the operating system, and decoding/encoding filepaths > to patch files and patch bundles. > > Basically the picture looks like this: > > OS <--> darcs <---> patch files > > The reason why I initially thought that NewFormat was a step backwards > was that I was thinking about the darcs <--> patch files part. IMHO, > what you want is for darcs <--> patch files to always use UTF-8. On the > other hand, the OS <--> darcs part needs some more thought. > > This is a little half-baked right, but maybe somebody else can run with > the idea? As a Unix guy I never think of filenames as text. If our patch files are UTF-8, how do we represent patches to the issue1763 repo with its Hungarian characters in a single-byte encoding? Note that if I copy these files to my machine that has a UTF-8 locale, the file names will still be valid single- byte-hungarian and invalid UTF-8! Given that even enterprisey Java does not have a good solution to this problem makes me feel hopeless about finding one for darcs. We could of course say that for darcs, filenames are Unicode text. Then we should check upon darcs add that a filename is valid according to the current locale, and encode the file name using the locale encoding whenever darcs does filesystem operations on Unix. But refusing to 'darcs add' a file because its name does not fit our model may anger some users. And I haven't even thought about backward compatibility. Reinier
msg11183 (view)	Author: stephen	Date: 2010-06-02.07:47:39
Reinier Lamers writes: > As a Unix guy I never think of filenames as text. If our patch files are > UTF-8, how do we represent patches to the issue1763 repo with its Hungarian > characters in a single-byte encoding? PEP 383 is the best solution I've seen yet. http://www.python.org/dev/peps/pep-0383/
msg11258 (view)	Author: kowey	Date: 2010-06-06.12:35:27
This issue was split from patch252 (hmm! sharing msg objects between patches and issues). I still think it'd be nice to think about having text filenames in Darcs 3. If I understand correctly, PEP 383 that Stephen pointed would make it possible to cope relatively gracefully with various encoding troubles that could creep up. Now if it turns out I'm wrong and text filenames really are a bad idea, we can always wont-fix this ticket. But repo portability is a big feature to have IMHO.
msg11345 (view)	Author: tux_rocker	Date: 2010-06-10.06:18:31
Hi, Op zondag 06 juni 2010 14:35 schreef je: > I still think it'd be nice to think about having text filenames in Darcs > 3. If I understand correctly, PEP 383 that Stephen pointed would make it > possible to cope relatively gracefully with various encoding troubles > that could creep up. About this PEP 383, I wonder whether we can also use it to embed these encodingwise unwellformed filenames in XML for instance. If we can't, we might be better of using some private use area of Unicode instead of lone surrogates. Reinier
msg11347 (view)	Author: stephen	Date: 2010-06-10.08:46:11
Reinier Lamers writes: > > Reinier Lamers <tux_rocker@reinier.de> added the comment: > > Hi, > > Op zondag 06 juni 2010 14:35 schreef je: > > I still think it'd be nice to think about having text filenames in Darcs > > 3. If I understand correctly, PEP 383 that Stephen pointed would make it > > possible to cope relatively gracefully with various encoding troubles > > that could creep up. > > About this PEP 383, I wonder whether we can also use it to embed > these encodingwise unwellformed filenames in XML for instance. No. By definition of UTF-16, the lone surrogates are also malformed, so they cannot appear in UTF-16-encoded XML. Surrogates cannot appear at all in UTF-8. Most Unicode applications will yell at you if you try this, so it's a bad idea. But they won't DTRT with a private use area, either, because that requires a non-standard private agreement on encoding. So basically these encodings will only be useful in darcs. I'm not an XML expert, but IIRC there is some way to embed binary in XML, and that's what you would want to use rather than private use characters or surrogates. > If we can't, we might be better of using some private use area of > Unicode instead of lone surrogates. This is equivalent, except that (a) all internal strings will be well-formed (good) but (b) you now need to worry about collisions with well-formed Unicode that happens to use the same private space (ugly).

History
Date	User	Action	Args
2010-06-06 12:28:55	kowey	create
2010-06-06 12:32:01	admin	set	nosy: + kowey, mornfall, stephen, tux_rocker messages: + msg11131, msg11182, msg11183
2010-06-06 12:35:27	kowey	set	messages: + msg11258
2010-06-10 06:18:32	tux_rocker	set	messages: + msg11345
2010-06-10 08:46:12	stephen	set	messages: + msg11347
2010-06-10 09:28:46	kowey	link	issue1143 superseder
2010-06-15 21:10:49	admin	set	topic: - Target-3.0
2010-06-15 21:10:49	admin	set	milestone: 3.0.0
2017-07-30 23:59:34	gh	set	status: deferred -> given-up