darcs

Patch 252 Resolve issue1763: use correct filename encoding in co...

Title Resolve issue1763: use correct filename encoding in co...
Superseder Nosy List kowey, mornfall, stephen, tux_rocker
Related Issues encoding of non-ascii file names into patches is broken; triggers assert later
View: 1763
Status accepted Assigned To
Milestone

Created on 2010-05-27.12:15:56 by mornfall, last changed 2011-05-10.18:36:41 by darcswatch. Tracked on DarcsWatch.

Files
File name Status Uploaded Type Edit Remove
resolve-issue1763_-use-correct-filename-encoding-in-conflictors_.dpatch mornfall, 2010-05-27.12:15:56 text/x-darcs-patch
unnamed mornfall, 2010-05-27.12:15:56
See mailing list archives for discussion on individual patches.
Messages
msg11119 (view) Author: mornfall Date: 2010-05-27.12:15:56
Hi,

the patch itself should be self-explanatory. The catch was, that conflictors
were written out using showPatch which uses the legacy OldFormat encoding. In
reading them, this is not reflected and the result is bogus conflictors.

Ideally, we'd just ban showPatch use on Prims, but this would require further
changes. For now, we should hope that this does not happen anywhere else in
darcs-2 patch core.

Yours,
   Petr.

1 patch for repository darcs-unstable@darcs.net:darcs:

Thu May 27 14:10:19 CEST 2010  Petr Rockai <me@mornfall.net>
  * Resolve issue1763: use correct filename encoding in conflictors.
Attachments
msg11125 (view) Author: kowey Date: 2010-05-27.17:04:07
On Thu, May 27, 2010 at 12:15:56 +0000, Petr Ročkai wrote:
> the patch itself should be self-explanatory. The catch was, that conflictors
> were written out using showPatch which uses the legacy OldFormat encoding. In
> reading them, this is not reflected and the result is bogus conflictors.

Nice work.

I hesitate to apply this because I'm not paying enough this patch the attention
it deserves, but in case my comments help Petr or another reviewer...

The context behind this patch is that we have two styles of filename, the
second style being introduced in Darcs 2 (Petr claims that that the new format
patches are only used for darcs-2 patches and not say darcs-1 patches.  I hope
that's right, but in any case it's unlikely that darcs-2 patches would use
OldFormat!).

(Code is slightly paraphrased for hopefully clarity)

data FileNameFormat = OldFormat | NewFormat

readFileName :: FileNameFormat -> B.ByteString -> FileName
readFileName OldFormat = fp2fn . decodeWhite . unpackPSFromUTF8
readFileName NewFormat = fp2fn . decodeWhite . BC.unpack

formatFileName :: FileNameFormat -> FileName -> Doc
formatFileName OldFormat = packedString . packStringToUTF8 . encodeWhite . fn2fp
formatFileName NewFormat = text                            . encodeWhite . fn2fp

(packedString and text are just wrappers around the Doc constructors)

So in the OldFormat we seem to assume that Darcs.Patch.FileName uses Unicode
filenames encoded in UTF-8.  Does this mean that in the NewFormat, we just
treat filenames as just sequences of bytes?  If so that (very superficially
and unthinkingly said) sounds like a step backwards.  I wonder what exactly
the thinking behind this was...  One thing that worries me though is if
printing filenames will ever locale-dependent in GHC 6.12.  I seem to remember
that the put/get audit said that we're always outputting in binary mode where
it matters (and so treating the Strings as padded octets a la GHC 6.10)

> Ideally, we'd just ban showPatch use on Prims, but this would require further
> changes. For now, we should hope that this does not happen anywhere else in
> darcs-2 patch core.

Alternatively, we would have patches that know if they are NewFormat or
OldFormat (and then safely use showPatch?)

> +showPrimFL :: FileNameFormat -> FL Prim C(a b) -> Doc
> +showPrimFL f xs = vcat (mapFL (showPrim f) xs)

Presumably this comes from the instance of ShowPatch for FL (which makes
sense).

> -        showPatch cs $$
> +        showPrimFL NewFormat cs $$

What about showNon and the rotcilfnoc case?  Are there other cases where
we'd need to pay attention to this?

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
msg11126 (view) Author: kowey Date: 2010-05-27.17:10:58
On Thu, May 27, 2010 at 18:04:36 +0100, Eric Kow wrote:
> readFileName :: FileNameFormat -> B.ByteString -> FileName
> readFileName OldFormat = fp2fn . decodeWhite . unpackPSFromUTF8
> readFileName NewFormat = fp2fn . decodeWhite . BC.unpack
> 
> formatFileName :: FileNameFormat -> FileName -> Doc
> formatFileName OldFormat = packedString . packStringToUTF8 . encodeWhite . fn2fp
> formatFileName NewFormat = text                            . encodeWhite . fn2fp

> So in the OldFormat we seem to assume that Darcs.Patch.FileName uses Unicode
> filenames encoded in UTF-8.  Does this mean that in the NewFormat, we just
> treat filenames as just sequences of bytes? 

Argh, sorry I'm being super super sloppy in my thinking here.

Darcs.Patch.FileName is just a newtype around String so as far as Darcs
is concerned, FileNames are Unicode Strings.  The question is just what
happens when you encode/decode.

In the old-style, whenever we have a bytestring that corresponds to a
filename, we assume it was UTF-8 encoded (which seems wrong, which may
be the motivation behind the switch).  In the new-style, we *seem* to
assume it's a superset of ISO 8859-1 (ie. each octet = Unicode code
point, unless Data.ByteString.Char really is checking to see if the code
point corresponds to holes in the code page, which I).

But in any case, Unicode inside.
It's the conversion to/from that worries me...

Anyway, I'm going to shut up before I add any misinformation to the mix.
Hopefully Reinier is paying attention and will yell at any stupid things
I've said,

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
msg11127 (view) Author: mornfall Date: 2010-05-27.17:48:28
Hi,

Eric Kow <kowey@darcs.net> writes:
> So in the OldFormat we seem to assume that Darcs.Patch.FileName uses Unicode
> filenames encoded in UTF-8.  Does this mean that in the NewFormat, we just
> treat filenames as just sequences of bytes?  If so that (very superficially
> and unthinkingly said) sounds like a step backwards.  I wonder what exactly
> the thinking behind this was...
just to put this line of reasoning back on the right track... The story
is that UNIX stores a sequence of bytes for a filepath. This could be
basically anything. We never decode anything we read from the
filesystem, either. So in OldFormat, what happens is that a sequence of
bytes (the filepath) is taken and encoded as a sequence of codepoints <
256 and stored as UTF8.

With new format, the "store as UTF8" bit is skipped, and we just store
bytes.

It needs to be emphasised, that the UTF8 step in OldFormat is completely
superfluous, as the filepath *is never decoded*, so the codepoints are
completely bogus. Whether we encode them as UTF8 or as raw bytes doesn't
matter, other than that there is an encoding difference for those
filepath bytes that fall between 127 and 256 -- with OldFormat, each
takes 2 storage bytes while with NewFormat they just use a single
storage byte.

The net result is that in the common case (i.e. UTF8 filenames),
OldFormat is storing double-encoded UTF8.

Hope this makes things clearer, and it shows why NewFormat is actually a
step forward.

Yours,
   Petr.
msg11128 (view) Author: kowey Date: 2010-05-27.18:27:06
On Thu, May 27, 2010 at 17:48:28 +0000, Petr Ročkai wrote:
> It needs to be emphasised, that the UTF8 step in OldFormat is completely
> superfluous, as the filepath *is never decoded*, so the codepoints are
> completely bogus.

Oh!  Hmm, did I just draw the wrong conclusion from the below?

readFileName :: FileNameFormat -> B.ByteString -> FileName
readFileName OldFormat = ps2fn
readFileName NewFormat = fp2fn . decodeWhite . BC.unpack

ps2fn :: B.ByteString -> FileName
ps2fn ps = FN $ decodeWhite $ unpackPSFromUTF8 ps

I assume (without checking) that this function readFileName is used
when reading patches from disk.  (is that right?)

Also what do you mean when you say Unix?  I hope that's not a dumb
question.  I just hear that on MacOS X you have this thing going on
where filenames are automagically normalised into NFD?  So presumably
somewhere along the way you have some kind of low-level Unicode
awareness as opposed to treating filenames as just sequences of bytes.

Also should we be worrying about what happens with Windows?

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
msg11129 (view) Author: tux_rocker Date: 2010-05-27.21:22:27
Hi all,

Op donderdag 27 mei 2010 20:27 schreef Eric Kow:
> Also what do you mean when you say Unix?  I hope that's not a dumb
> question.  I just hear that on MacOS X you have this thing going on
> where filenames are automagically normalised into NFD?  So presumably
> somewhere along the way you have some kind of low-level Unicode
> awareness as opposed to treating filenames as just sequences of bytes.

I don't know about OS X. I have the same question you have here.

> Also should we be worrying about what happens with Windows?

On Windows, filenames are Unicode strings and not sequences of bytes as on 
Linux. This discrepancy can bring down Java actually. Try creating a file with 
an invalid UTF-8 name and feeding it to a Java program via the command line on 
a modern Linux distro. Java won't be able to find it.

I talked about this with Duncan Coutts on #ghc and he said that this is why 
command line arguments are always interpreted as latin1 by GHC. So even a file 
has a name with multibyte characters, you'll get the individual bytes as 
Char's in the String's that you get from getArgs. So these Char's should be < 
256 and should work. But I admit that sounds a bit flaky.

Petr's solution looks alright to me BTW.

Reinier





> 
> -- 
> Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
> PGP Key ID: 08AC04F9
> 
> __________________________________
> Darcs bug tracker <bugs@darcs.net>
> <http://bugs.darcs.net/patch252>
> __________________________________
> _______________________________________________
> darcs-users mailing list
> darcs-users@darcs.net
> http://lists.osuosl.org/mailman/listinfo/darcs-users
>
msg11131 (view) Author: kowey Date: 2010-05-28.09:04:30
On Thu, May 27, 2010 at 12:15:56 +0000, Petr Ročkai wrote:
> Thu May 27 14:10:19 CEST 2010  Petr Rockai <me@mornfall.net>
>   * Resolve issue1763: use correct filename encoding in conflictors.

OK, we've had two people (Reinier, Eric) look at this and OK it, so I guess it
makes sense for me to push it now with some thoughts about future work.

Resolve issue1763: use correct filename encoding in conflictors.
----------------------------------------------------------------
> hunk ./src/Darcs/Patch/Real.hs 716
>          blueText "conflictor" <+> showNons i <+> blueText "[]" $$ showNon p
>      showPatch (Conflictor i cs p) =
>          blueText "conflictor" <+> showNons i <+> blueText "[" $$
> -        showPatch cs $$
> +        showPrimFL NewFormat cs $$
>          blueText "]" $$
>          showNon p
>      showPatch (InvConflictor i NilFL p) =
> 

I'm still concerned that we're not being systematic enough about really
fixing this (eg. show we worry about rotcifnoc? showNon? etc)

[The mental image I have is those old cartoons where you have the
 character on a boat and a leak forms, so he plugs it with a finger,
 and then another leak, and another finger, and another leak...]

I also notice this:

  instance ReadPatch Prim where
    readPatch' _ = readPrim OldFormat

  -- this and other darcs-2 format patches use readPrim NewFormat
  readNons :: (ReadPatch p, ParserM m) => m [Non p C(x)]
  readNons = peekfor "{{" rns (return [])
      where rns = peekfor "}}" (return []) $
                  do Just (Sealed ps) <- readPatch' False
                     lexChar ':'
                     Just (Sealed p) <- readPrim NewFormat
                     (Non ps p :) `liftM` rns
     

and in the read code for Non and RealPatch (I think these are darcs-2 style
patches), readPatch eventually uses readPrim NewFormat.  So that makes sense:
the double-encoding comes from reading UTF-8 bytes [this is where Petr's
assertion that "the filepath *is never decoded*" makes sense] as code-points,
and then trying to encode those code-points into UTF-8 bytes.

Plan for future work? (Prim FileNameFormat)
-------------------------------------------
How does this plan sound: introduce two new wrapper types OldFormatPrim and
NewFormatPrim whose read/show instances use OldFormat/NewFormat
respectively, thus ensuring that readPatch and showPatch automagically
do the right thing?

(or even one parametrisable type (although I imagine that involves turning
on some extension for instances))

Plan for future work? (two kinds of read/show)
----------------------------------------------
Complementary plan: we should distinguish between decoding/encoding
filepaths from the operating system, and decoding/encoding filepaths
to patch files and patch bundles.

Basically the picture looks like this:

    OS <--> darcs <---> patch files

The reason why I initially thought that NewFormat was a step backwards
was that I was thinking about the darcs <--> patch files part.  IMHO,
what you want is for darcs <--> patch files to always use UTF-8.  On the
other hand, the OS <--> darcs part needs some more thought.

This is a little half-baked right, but maybe somebody else can run with
the idea?

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
msg11132 (view) Author: darcswatch Date: 2010-05-28.09:04:34
This patch bundle (with 1 patches) was just applied to the repository http://darcs.net/.
This message was brought to you by DarcsWatch
http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-12e2dfba7e5fb635ffde5856e22f2dfc6f0fe34b
msg11182 (view) Author: tux_rocker Date: 2010-06-02.06:33:06
Hi all,

Op vrijdag 28 mei 2010 11:05 schreef Eric Kow:
> Plan for future work? (two kinds of read/show)
> ----------------------------------------------
> Complementary plan: we should distinguish between decoding/encoding
> filepaths from the operating system, and decoding/encoding filepaths
> to patch files and patch bundles.
> 
> Basically the picture looks like this:
> 
>     OS <--> darcs <---> patch files
> 
> The reason why I initially thought that NewFormat was a step backwards
> was that I was thinking about the darcs <--> patch files part.  IMHO,
> what you want is for darcs <--> patch files to always use UTF-8.  On the
> other hand, the OS <--> darcs part needs some more thought.
> 
> This is a little half-baked right, but maybe somebody else can run with
> the idea?

As a Unix guy I never think of filenames as text. If our patch files are 
UTF-8, how do we represent patches to the issue1763 repo with its Hungarian 
characters in a single-byte encoding? Note that if I copy these files to my 
machine that has a UTF-8 locale, the file names will still be valid single-
byte-hungarian and invalid UTF-8!
Given that even enterprisey Java does not have a good solution to this problem 
makes me feel hopeless about finding one for darcs.

We could of course say that for darcs, filenames are Unicode text. Then we 
should check upon darcs add that a filename is valid according to the current 
locale, and encode the file name using the locale encoding whenever darcs does 
filesystem operations on Unix. But refusing to 'darcs add' a file because its 
name does not fit our model may anger some users. And I haven't even thought 
about backward compatibility.

Reinier
msg11183 (view) Author: stephen Date: 2010-06-02.07:47:39
Reinier Lamers writes:

 > As a Unix guy I never think of filenames as text. If our patch files are 
 > UTF-8, how do we represent patches to the issue1763 repo with its Hungarian 
 > characters in a single-byte encoding?

PEP 383 is the best solution I've seen yet.

http://www.python.org/dev/peps/pep-0383/
msg14132 (view) Author: darcswatch Date: 2011-05-10.18:36:41
This patch bundle (with 1 patches) was just applied to the repository http://darcs.net/reviewed.
This message was brought to you by DarcsWatch
http://darcswatch.nomeata.de/repo_http:__darcs.net_reviewed.html#bundle-12e2dfba7e5fb635ffde5856e22f2dfc6f0fe34b
History
Date User Action Args
2010-05-27 12:15:56mornfallcreate
2010-05-27 12:42:34darcswatchsetdarcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-12e2dfba7e5fb635ffde5856e22f2dfc6f0fe34b
2010-05-27 17:04:08koweysetnosy: + kowey
messages: + msg11125
2010-05-27 17:10:58koweysetmessages: + msg11126
2010-05-27 17:48:28mornfallsetmessages: + msg11127
2010-05-27 18:27:06koweysetmessages: + msg11128
2010-05-27 21:22:27tux_rockersetnosy: + tux_rocker
messages: + msg11129
2010-05-28 09:04:31koweysetmessages: + msg11131
2010-05-28 09:04:34darcswatchsetstatus: needs-review -> accepted
messages: + msg11132
2010-06-02 06:33:06tux_rockersetmessages: + msg11182
2010-06-02 07:47:39stephensetnosy: + stephen
messages: + msg11183
2010-06-22 07:28:45ganeshsetissues: + encoding of non-ascii file names into patches is broken; triggers assert later
2011-05-10 18:36:41darcswatchsetmessages: + msg14132