Patch 252: Resolve issue1763: use correct filename encoding in co...

Title	Resolve issue1763: use correct filename encoding in co...
Superseder		Nosy List	kowey, mornfall, stephen, tux_rocker
Related Issues	encoding of non-ascii file names into patches is broken; triggers assert later View: 1763
Status	accepted	Assigned To
Milestone

Created on 2010-05-27.12:15:56 by mornfall, last changed 2011-05-10.18:36:41 by darcswatch. Tracked on DarcsWatch.

Files
File name	Status	Uploaded	Type	Edit	Remove
resolve-issue1763_-use-correct-filename-encoding-in-conflictors_.dpatch		mornfall, 2010-05-27.12:15:56	text/x-darcs-patch
unnamed		mornfall, 2010-05-27.12:15:56

See mailing list archives for discussion on individual patches.

Messages
msg11119 (view)	Author: mornfall	Date: 2010-05-27.12:15:56
Hi, the patch itself should be self-explanatory. The catch was, that conflictors were written out using showPatch which uses the legacy OldFormat encoding. In reading them, this is not reflected and the result is bogus conflictors. Ideally, we'd just ban showPatch use on Prims, but this would require further changes. For now, we should hope that this does not happen anywhere else in darcs-2 patch core. Yours, Petr. 1 patch for repository darcs-unstable@darcs.net:darcs: Thu May 27 14:10:19 CEST 2010 Petr Rockai <me@mornfall.net> * Resolve issue1763: use correct filename encoding in conflictors. Attachments resolve-issue1763_-use-correct-filename-encoding-in-conflictors_.dpatch unnamed
msg11125 (view)	Author: kowey	Date: 2010-05-27.17:04:07
On Thu, May 27, 2010 at 12:15:56 +0000, Petr Ročkai wrote: > the patch itself should be self-explanatory. The catch was, that conflictors > were written out using showPatch which uses the legacy OldFormat encoding. In > reading them, this is not reflected and the result is bogus conflictors. Nice work. I hesitate to apply this because I'm not paying enough this patch the attention it deserves, but in case my comments help Petr or another reviewer... The context behind this patch is that we have two styles of filename, the second style being introduced in Darcs 2 (Petr claims that that the new format patches are only used for darcs-2 patches and not say darcs-1 patches. I hope that's right, but in any case it's unlikely that darcs-2 patches would use OldFormat!). (Code is slightly paraphrased for hopefully clarity) data FileNameFormat = OldFormat \| NewFormat readFileName :: FileNameFormat -> B.ByteString -> FileName readFileName OldFormat = fp2fn . decodeWhite . unpackPSFromUTF8 readFileName NewFormat = fp2fn . decodeWhite . BC.unpack formatFileName :: FileNameFormat -> FileName -> Doc formatFileName OldFormat = packedString . packStringToUTF8 . encodeWhite . fn2fp formatFileName NewFormat = text . encodeWhite . fn2fp (packedString and text are just wrappers around the Doc constructors) So in the OldFormat we seem to assume that Darcs.Patch.FileName uses Unicode filenames encoded in UTF-8. Does this mean that in the NewFormat, we just treat filenames as just sequences of bytes? If so that (very superficially and unthinkingly said) sounds like a step backwards. I wonder what exactly the thinking behind this was... One thing that worries me though is if printing filenames will ever locale-dependent in GHC 6.12. I seem to remember that the put/get audit said that we're always outputting in binary mode where it matters (and so treating the Strings as padded octets a la GHC 6.10) > Ideally, we'd just ban showPatch use on Prims, but this would require further > changes. For now, we should hope that this does not happen anywhere else in > darcs-2 patch core. Alternatively, we would have patches that know if they are NewFormat or OldFormat (and then safely use showPatch?) > +showPrimFL :: FileNameFormat -> FL Prim C(a b) -> Doc > +showPrimFL f xs = vcat (mapFL (showPrim f) xs) Presumably this comes from the instance of ShowPatch for FL (which makes sense). > - showPatch cs $$ > + showPrimFL NewFormat cs $$ What about showNon and the rotcilfnoc case? Are there other cases where we'd need to pay attention to this? -- Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow> PGP Key ID: 08AC04F9
msg11126 (view)	Author: kowey	Date: 2010-05-27.17:10:58
On Thu, May 27, 2010 at 18:04:36 +0100, Eric Kow wrote: > readFileName :: FileNameFormat -> B.ByteString -> FileName > readFileName OldFormat = fp2fn . decodeWhite . unpackPSFromUTF8 > readFileName NewFormat = fp2fn . decodeWhite . BC.unpack > > formatFileName :: FileNameFormat -> FileName -> Doc > formatFileName OldFormat = packedString . packStringToUTF8 . encodeWhite . fn2fp > formatFileName NewFormat = text . encodeWhite . fn2fp > So in the OldFormat we seem to assume that Darcs.Patch.FileName uses Unicode > filenames encoded in UTF-8. Does this mean that in the NewFormat, we just > treat filenames as just sequences of bytes? Argh, sorry I'm being super super sloppy in my thinking here. Darcs.Patch.FileName is just a newtype around String so as far as Darcs is concerned, FileNames are Unicode Strings. The question is just what happens when you encode/decode. In the old-style, whenever we have a bytestring that corresponds to a filename, we assume it was UTF-8 encoded (which seems wrong, which may be the motivation behind the switch). In the new-style, we seem to assume it's a superset of ISO 8859-1 (ie. each octet = Unicode code point, unless Data.ByteString.Char really is checking to see if the code point corresponds to holes in the code page, which I). But in any case, Unicode inside. It's the conversion to/from that worries me... Anyway, I'm going to shut up before I add any misinformation to the mix. Hopefully Reinier is paying attention and will yell at any stupid things I've said, -- Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow> PGP Key ID: 08AC04F9
msg11127 (view)	Author: mornfall	Date: 2010-05-27.17:48:28
Hi, Eric Kow <kowey@darcs.net> writes: > So in the OldFormat we seem to assume that Darcs.Patch.FileName uses Unicode > filenames encoded in UTF-8. Does this mean that in the NewFormat, we just > treat filenames as just sequences of bytes? If so that (very superficially > and unthinkingly said) sounds like a step backwards. I wonder what exactly > the thinking behind this was... just to put this line of reasoning back on the right track... The story is that UNIX stores a sequence of bytes for a filepath. This could be basically anything. We never decode anything we read from the filesystem, either. So in OldFormat, what happens is that a sequence of bytes (the filepath) is taken and encoded as a sequence of codepoints < 256 and stored as UTF8. With new format, the "store as UTF8" bit is skipped, and we just store bytes. It needs to be emphasised, that the UTF8 step in OldFormat is completely superfluous, as the filepath is never decoded, so the codepoints are completely bogus. Whether we encode them as UTF8 or as raw bytes doesn't matter, other than that there is an encoding difference for those filepath bytes that fall between 127 and 256 -- with OldFormat, each takes 2 storage bytes while with NewFormat they just use a single storage byte. The net result is that in the common case (i.e. UTF8 filenames), OldFormat is storing double-encoded UTF8. Hope this makes things clearer, and it shows why NewFormat is actually a step forward. Yours, Petr.
msg11128 (view)	Author: kowey	Date: 2010-05-27.18:27:06
On Thu, May 27, 2010 at 17:48:28 +0000, Petr Ročkai wrote: > It needs to be emphasised, that the UTF8 step in OldFormat is completely > superfluous, as the filepath is never decoded, so the codepoints are > completely bogus. Oh! Hmm, did I just draw the wrong conclusion from the below? readFileName :: FileNameFormat -> B.ByteString -> FileName readFileName OldFormat = ps2fn readFileName NewFormat = fp2fn . decodeWhite . BC.unpack ps2fn :: B.ByteString -> FileName ps2fn ps = FN $ decodeWhite $ unpackPSFromUTF8 ps I assume (without checking) that this function readFileName is used when reading patches from disk. (is that right?) Also what do you mean when you say Unix? I hope that's not a dumb question. I just hear that on MacOS X you have this thing going on where filenames are automagically normalised into NFD? So presumably somewhere along the way you have some kind of low-level Unicode awareness as opposed to treating filenames as just sequences of bytes. Also should we be worrying about what happens with Windows? -- Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow> PGP Key ID: 08AC04F9
msg11129 (view)	Author: tux_rocker	Date: 2010-05-27.21:22:27
Hi all, Op donderdag 27 mei 2010 20:27 schreef Eric Kow: > Also what do you mean when you say Unix? I hope that's not a dumb > question. I just hear that on MacOS X you have this thing going on > where filenames are automagically normalised into NFD? So presumably > somewhere along the way you have some kind of low-level Unicode > awareness as opposed to treating filenames as just sequences of bytes. I don't know about OS X. I have the same question you have here. > Also should we be worrying about what happens with Windows? On Windows, filenames are Unicode strings and not sequences of bytes as on Linux. This discrepancy can bring down Java actually. Try creating a file with an invalid UTF-8 name and feeding it to a Java program via the command line on a modern Linux distro. Java won't be able to find it. I talked about this with Duncan Coutts on #ghc and he said that this is why command line arguments are always interpreted as latin1 by GHC. So even a file has a name with multibyte characters, you'll get the individual bytes as Char's in the String's that you get from getArgs. So these Char's should be < 256 and should work. But I admit that sounds a bit flaky. Petr's solution looks alright to me BTW. Reinier > > -- > Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow> > PGP Key ID: 08AC04F9 > > __________________________________ > Darcs bug tracker <bugs@darcs.net> > <http://bugs.darcs.net/patch252> > __________________________________ > _______________________________________________ > darcs-users mailing list > darcs-users@darcs.net > http://lists.osuosl.org/mailman/listinfo/darcs-users >
msg11131 (view)	Author: kowey	Date: 2010-05-28.09:04:30
On Thu, May 27, 2010 at 12:15:56 +0000, Petr Ročkai wrote: > Thu May 27 14:10:19 CEST 2010 Petr Rockai <me@mornfall.net> > * Resolve issue1763: use correct filename encoding in conflictors. OK, we've had two people (Reinier, Eric) look at this and OK it, so I guess it makes sense for me to push it now with some thoughts about future work. Resolve issue1763: use correct filename encoding in conflictors. ---------------------------------------------------------------- > hunk ./src/Darcs/Patch/Real.hs 716 > blueText "conflictor" <+> showNons i <+> blueText "[]" $$ showNon p > showPatch (Conflictor i cs p) = > blueText "conflictor" <+> showNons i <+> blueText "[" $$ > - showPatch cs $$ > + showPrimFL NewFormat cs $$ > blueText "]" $$ > showNon p > showPatch (InvConflictor i NilFL p) = > I'm still concerned that we're not being systematic enough about really fixing this (eg. show we worry about rotcifnoc? showNon? etc) [The mental image I have is those old cartoons where you have the character on a boat and a leak forms, so he plugs it with a finger, and then another leak, and another finger, and another leak...] I also notice this: instance ReadPatch Prim where readPatch' _ = readPrim OldFormat -- this and other darcs-2 format patches use readPrim NewFormat readNons :: (ReadPatch p, ParserM m) => m [Non p C(x)] readNons = peekfor "{{" rns (return []) where rns = peekfor "}}" (return []) $ do Just (Sealed ps) <- readPatch' False lexChar ':' Just (Sealed p) <- readPrim NewFormat (Non ps p :) `liftM` rns and in the read code for Non and RealPatch (I think these are darcs-2 style patches), readPatch eventually uses readPrim NewFormat. So that makes sense: the double-encoding comes from reading UTF-8 bytes [this is where Petr's assertion that "the filepath is never decoded" makes sense] as code-points, and then trying to encode those code-points into UTF-8 bytes. Plan for future work? (Prim FileNameFormat) ------------------------------------------- How does this plan sound: introduce two new wrapper types OldFormatPrim and NewFormatPrim whose read/show instances use OldFormat/NewFormat respectively, thus ensuring that readPatch and showPatch automagically do the right thing? (or even one parametrisable type (although I imagine that involves turning on some extension for instances)) Plan for future work? (two kinds of read/show) ---------------------------------------------- Complementary plan: we should distinguish between decoding/encoding filepaths from the operating system, and decoding/encoding filepaths to patch files and patch bundles. Basically the picture looks like this: OS <--> darcs <---> patch files The reason why I initially thought that NewFormat was a step backwards was that I was thinking about the darcs <--> patch files part. IMHO, what you want is for darcs <--> patch files to always use UTF-8. On the other hand, the OS <--> darcs part needs some more thought. This is a little half-baked right, but maybe somebody else can run with the idea? -- Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow> PGP Key ID: 08AC04F9
msg11132 (view)	Author: darcswatch	Date: 2010-05-28.09:04:34
This patch bundle (with 1 patches) was just applied to the repository http://darcs.net/. This message was brought to you by DarcsWatch http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-12e2dfba7e5fb635ffde5856e22f2dfc6f0fe34b
msg11182 (view)	Author: tux_rocker	Date: 2010-06-02.06:33:06
Hi all, Op vrijdag 28 mei 2010 11:05 schreef Eric Kow: > Plan for future work? (two kinds of read/show) > ---------------------------------------------- > Complementary plan: we should distinguish between decoding/encoding > filepaths from the operating system, and decoding/encoding filepaths > to patch files and patch bundles. > > Basically the picture looks like this: > > OS <--> darcs <---> patch files > > The reason why I initially thought that NewFormat was a step backwards > was that I was thinking about the darcs <--> patch files part. IMHO, > what you want is for darcs <--> patch files to always use UTF-8. On the > other hand, the OS <--> darcs part needs some more thought. > > This is a little half-baked right, but maybe somebody else can run with > the idea? As a Unix guy I never think of filenames as text. If our patch files are UTF-8, how do we represent patches to the issue1763 repo with its Hungarian characters in a single-byte encoding? Note that if I copy these files to my machine that has a UTF-8 locale, the file names will still be valid single- byte-hungarian and invalid UTF-8! Given that even enterprisey Java does not have a good solution to this problem makes me feel hopeless about finding one for darcs. We could of course say that for darcs, filenames are Unicode text. Then we should check upon darcs add that a filename is valid according to the current locale, and encode the file name using the locale encoding whenever darcs does filesystem operations on Unix. But refusing to 'darcs add' a file because its name does not fit our model may anger some users. And I haven't even thought about backward compatibility. Reinier
msg11183 (view)	Author: stephen	Date: 2010-06-02.07:47:39
Reinier Lamers writes: > As a Unix guy I never think of filenames as text. If our patch files are > UTF-8, how do we represent patches to the issue1763 repo with its Hungarian > characters in a single-byte encoding? PEP 383 is the best solution I've seen yet. http://www.python.org/dev/peps/pep-0383/
msg14132 (view)	Author: darcswatch	Date: 2011-05-10.18:36:41
This patch bundle (with 1 patches) was just applied to the repository http://darcs.net/reviewed. This message was brought to you by DarcsWatch http://darcswatch.nomeata.de/repo_http:__darcs.net_reviewed.html#bundle-12e2dfba7e5fb635ffde5856e22f2dfc6f0fe34b

History
Date	User	Action	Args
2010-05-27 12:15:56	mornfall	create
2010-05-27 12:42:34	darcswatch	set	darcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-12e2dfba7e5fb635ffde5856e22f2dfc6f0fe34b
2010-05-27 17:04:08	kowey	set	nosy: + kowey messages: + msg11125
2010-05-27 17:10:58	kowey	set	messages: + msg11126
2010-05-27 17:48:28	mornfall	set	messages: + msg11127
2010-05-27 18:27:06	kowey	set	messages: + msg11128
2010-05-27 21:22:27	tux_rocker	set	nosy: + tux_rocker messages: + msg11129
2010-05-28 09:04:31	kowey	set	messages: + msg11131
2010-05-28 09:04:34	darcswatch	set	status: needs-review -> accepted messages: + msg11132
2010-06-02 06:33:06	tux_rocker	set	messages: + msg11182
2010-06-02 07:47:39	stephen	set	nosy: + stephen messages: + msg11183
2010-06-22 07:28:45	ganesh	set	issues: + encoding of non-ascii file names into patches is broken; triggers assert later
2011-05-10 18:36:41	darcswatch	set	messages: + msg14132

Patch 252 Resolve issue1763: use correct filename encoding in co...

Attachments