Created on 2010-05-27.12:15:56 by mornfall, last changed 2011-05-10.18:36:41 by darcswatch. Tracked on DarcsWatch.
See mailing list archives
for discussion on individual patches.
msg11119 (view) |
Author: mornfall |
Date: 2010-05-27.12:15:56 |
|
Hi,
the patch itself should be self-explanatory. The catch was, that conflictors
were written out using showPatch which uses the legacy OldFormat encoding. In
reading them, this is not reflected and the result is bogus conflictors.
Ideally, we'd just ban showPatch use on Prims, but this would require further
changes. For now, we should hope that this does not happen anywhere else in
darcs-2 patch core.
Yours,
Petr.
1 patch for repository darcs-unstable@darcs.net:darcs:
Thu May 27 14:10:19 CEST 2010 Petr Rockai <me@mornfall.net>
* Resolve issue1763: use correct filename encoding in conflictors.
Attachments
|
msg11125 (view) |
Author: kowey |
Date: 2010-05-27.17:04:07 |
|
On Thu, May 27, 2010 at 12:15:56 +0000, Petr Ročkai wrote:
> the patch itself should be self-explanatory. The catch was, that conflictors
> were written out using showPatch which uses the legacy OldFormat encoding. In
> reading them, this is not reflected and the result is bogus conflictors.
Nice work.
I hesitate to apply this because I'm not paying enough this patch the attention
it deserves, but in case my comments help Petr or another reviewer...
The context behind this patch is that we have two styles of filename, the
second style being introduced in Darcs 2 (Petr claims that that the new format
patches are only used for darcs-2 patches and not say darcs-1 patches. I hope
that's right, but in any case it's unlikely that darcs-2 patches would use
OldFormat!).
(Code is slightly paraphrased for hopefully clarity)
data FileNameFormat = OldFormat | NewFormat
readFileName :: FileNameFormat -> B.ByteString -> FileName
readFileName OldFormat = fp2fn . decodeWhite . unpackPSFromUTF8
readFileName NewFormat = fp2fn . decodeWhite . BC.unpack
formatFileName :: FileNameFormat -> FileName -> Doc
formatFileName OldFormat = packedString . packStringToUTF8 . encodeWhite . fn2fp
formatFileName NewFormat = text . encodeWhite . fn2fp
(packedString and text are just wrappers around the Doc constructors)
So in the OldFormat we seem to assume that Darcs.Patch.FileName uses Unicode
filenames encoded in UTF-8. Does this mean that in the NewFormat, we just
treat filenames as just sequences of bytes? If so that (very superficially
and unthinkingly said) sounds like a step backwards. I wonder what exactly
the thinking behind this was... One thing that worries me though is if
printing filenames will ever locale-dependent in GHC 6.12. I seem to remember
that the put/get audit said that we're always outputting in binary mode where
it matters (and so treating the Strings as padded octets a la GHC 6.10)
> Ideally, we'd just ban showPatch use on Prims, but this would require further
> changes. For now, we should hope that this does not happen anywhere else in
> darcs-2 patch core.
Alternatively, we would have patches that know if they are NewFormat or
OldFormat (and then safely use showPatch?)
> +showPrimFL :: FileNameFormat -> FL Prim C(a b) -> Doc
> +showPrimFL f xs = vcat (mapFL (showPrim f) xs)
Presumably this comes from the instance of ShowPatch for FL (which makes
sense).
> - showPatch cs $$
> + showPrimFL NewFormat cs $$
What about showNon and the rotcilfnoc case? Are there other cases where
we'd need to pay attention to this?
--
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
|
msg11126 (view) |
Author: kowey |
Date: 2010-05-27.17:10:58 |
|
On Thu, May 27, 2010 at 18:04:36 +0100, Eric Kow wrote:
> readFileName :: FileNameFormat -> B.ByteString -> FileName
> readFileName OldFormat = fp2fn . decodeWhite . unpackPSFromUTF8
> readFileName NewFormat = fp2fn . decodeWhite . BC.unpack
>
> formatFileName :: FileNameFormat -> FileName -> Doc
> formatFileName OldFormat = packedString . packStringToUTF8 . encodeWhite . fn2fp
> formatFileName NewFormat = text . encodeWhite . fn2fp
> So in the OldFormat we seem to assume that Darcs.Patch.FileName uses Unicode
> filenames encoded in UTF-8. Does this mean that in the NewFormat, we just
> treat filenames as just sequences of bytes?
Argh, sorry I'm being super super sloppy in my thinking here.
Darcs.Patch.FileName is just a newtype around String so as far as Darcs
is concerned, FileNames are Unicode Strings. The question is just what
happens when you encode/decode.
In the old-style, whenever we have a bytestring that corresponds to a
filename, we assume it was UTF-8 encoded (which seems wrong, which may
be the motivation behind the switch). In the new-style, we *seem* to
assume it's a superset of ISO 8859-1 (ie. each octet = Unicode code
point, unless Data.ByteString.Char really is checking to see if the code
point corresponds to holes in the code page, which I).
But in any case, Unicode inside.
It's the conversion to/from that worries me...
Anyway, I'm going to shut up before I add any misinformation to the mix.
Hopefully Reinier is paying attention and will yell at any stupid things
I've said,
--
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
|
msg11127 (view) |
Author: mornfall |
Date: 2010-05-27.17:48:28 |
|
Hi,
Eric Kow <kowey@darcs.net> writes:
> So in the OldFormat we seem to assume that Darcs.Patch.FileName uses Unicode
> filenames encoded in UTF-8. Does this mean that in the NewFormat, we just
> treat filenames as just sequences of bytes? If so that (very superficially
> and unthinkingly said) sounds like a step backwards. I wonder what exactly
> the thinking behind this was...
just to put this line of reasoning back on the right track... The story
is that UNIX stores a sequence of bytes for a filepath. This could be
basically anything. We never decode anything we read from the
filesystem, either. So in OldFormat, what happens is that a sequence of
bytes (the filepath) is taken and encoded as a sequence of codepoints <
256 and stored as UTF8.
With new format, the "store as UTF8" bit is skipped, and we just store
bytes.
It needs to be emphasised, that the UTF8 step in OldFormat is completely
superfluous, as the filepath *is never decoded*, so the codepoints are
completely bogus. Whether we encode them as UTF8 or as raw bytes doesn't
matter, other than that there is an encoding difference for those
filepath bytes that fall between 127 and 256 -- with OldFormat, each
takes 2 storage bytes while with NewFormat they just use a single
storage byte.
The net result is that in the common case (i.e. UTF8 filenames),
OldFormat is storing double-encoded UTF8.
Hope this makes things clearer, and it shows why NewFormat is actually a
step forward.
Yours,
Petr.
|
msg11128 (view) |
Author: kowey |
Date: 2010-05-27.18:27:06 |
|
On Thu, May 27, 2010 at 17:48:28 +0000, Petr Ročkai wrote:
> It needs to be emphasised, that the UTF8 step in OldFormat is completely
> superfluous, as the filepath *is never decoded*, so the codepoints are
> completely bogus.
Oh! Hmm, did I just draw the wrong conclusion from the below?
readFileName :: FileNameFormat -> B.ByteString -> FileName
readFileName OldFormat = ps2fn
readFileName NewFormat = fp2fn . decodeWhite . BC.unpack
ps2fn :: B.ByteString -> FileName
ps2fn ps = FN $ decodeWhite $ unpackPSFromUTF8 ps
I assume (without checking) that this function readFileName is used
when reading patches from disk. (is that right?)
Also what do you mean when you say Unix? I hope that's not a dumb
question. I just hear that on MacOS X you have this thing going on
where filenames are automagically normalised into NFD? So presumably
somewhere along the way you have some kind of low-level Unicode
awareness as opposed to treating filenames as just sequences of bytes.
Also should we be worrying about what happens with Windows?
--
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
|
msg11129 (view) |
Author: tux_rocker |
Date: 2010-05-27.21:22:27 |
|
Hi all,
Op donderdag 27 mei 2010 20:27 schreef Eric Kow:
> Also what do you mean when you say Unix? I hope that's not a dumb
> question. I just hear that on MacOS X you have this thing going on
> where filenames are automagically normalised into NFD? So presumably
> somewhere along the way you have some kind of low-level Unicode
> awareness as opposed to treating filenames as just sequences of bytes.
I don't know about OS X. I have the same question you have here.
> Also should we be worrying about what happens with Windows?
On Windows, filenames are Unicode strings and not sequences of bytes as on
Linux. This discrepancy can bring down Java actually. Try creating a file with
an invalid UTF-8 name and feeding it to a Java program via the command line on
a modern Linux distro. Java won't be able to find it.
I talked about this with Duncan Coutts on #ghc and he said that this is why
command line arguments are always interpreted as latin1 by GHC. So even a file
has a name with multibyte characters, you'll get the individual bytes as
Char's in the String's that you get from getArgs. So these Char's should be <
256 and should work. But I admit that sounds a bit flaky.
Petr's solution looks alright to me BTW.
Reinier
>
> --
> Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
> PGP Key ID: 08AC04F9
>
> __________________________________
> Darcs bug tracker <bugs@darcs.net>
> <http://bugs.darcs.net/patch252>
> __________________________________
> _______________________________________________
> darcs-users mailing list
> darcs-users@darcs.net
> http://lists.osuosl.org/mailman/listinfo/darcs-users
>
|
msg11131 (view) |
Author: kowey |
Date: 2010-05-28.09:04:30 |
|
On Thu, May 27, 2010 at 12:15:56 +0000, Petr Ročkai wrote:
> Thu May 27 14:10:19 CEST 2010 Petr Rockai <me@mornfall.net>
> * Resolve issue1763: use correct filename encoding in conflictors.
OK, we've had two people (Reinier, Eric) look at this and OK it, so I guess it
makes sense for me to push it now with some thoughts about future work.
Resolve issue1763: use correct filename encoding in conflictors.
----------------------------------------------------------------
> hunk ./src/Darcs/Patch/Real.hs 716
> blueText "conflictor" <+> showNons i <+> blueText "[]" $$ showNon p
> showPatch (Conflictor i cs p) =
> blueText "conflictor" <+> showNons i <+> blueText "[" $$
> - showPatch cs $$
> + showPrimFL NewFormat cs $$
> blueText "]" $$
> showNon p
> showPatch (InvConflictor i NilFL p) =
>
I'm still concerned that we're not being systematic enough about really
fixing this (eg. show we worry about rotcifnoc? showNon? etc)
[The mental image I have is those old cartoons where you have the
character on a boat and a leak forms, so he plugs it with a finger,
and then another leak, and another finger, and another leak...]
I also notice this:
instance ReadPatch Prim where
readPatch' _ = readPrim OldFormat
-- this and other darcs-2 format patches use readPrim NewFormat
readNons :: (ReadPatch p, ParserM m) => m [Non p C(x)]
readNons = peekfor "{{" rns (return [])
where rns = peekfor "}}" (return []) $
do Just (Sealed ps) <- readPatch' False
lexChar ':'
Just (Sealed p) <- readPrim NewFormat
(Non ps p :) `liftM` rns
and in the read code for Non and RealPatch (I think these are darcs-2 style
patches), readPatch eventually uses readPrim NewFormat. So that makes sense:
the double-encoding comes from reading UTF-8 bytes [this is where Petr's
assertion that "the filepath *is never decoded*" makes sense] as code-points,
and then trying to encode those code-points into UTF-8 bytes.
Plan for future work? (Prim FileNameFormat)
-------------------------------------------
How does this plan sound: introduce two new wrapper types OldFormatPrim and
NewFormatPrim whose read/show instances use OldFormat/NewFormat
respectively, thus ensuring that readPatch and showPatch automagically
do the right thing?
(or even one parametrisable type (although I imagine that involves turning
on some extension for instances))
Plan for future work? (two kinds of read/show)
----------------------------------------------
Complementary plan: we should distinguish between decoding/encoding
filepaths from the operating system, and decoding/encoding filepaths
to patch files and patch bundles.
Basically the picture looks like this:
OS <--> darcs <---> patch files
The reason why I initially thought that NewFormat was a step backwards
was that I was thinking about the darcs <--> patch files part. IMHO,
what you want is for darcs <--> patch files to always use UTF-8. On the
other hand, the OS <--> darcs part needs some more thought.
This is a little half-baked right, but maybe somebody else can run with
the idea?
--
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
|
msg11132 (view) |
Author: darcswatch |
Date: 2010-05-28.09:04:34 |
|
This patch bundle (with 1 patches) was just applied to the repository http://darcs.net/.
This message was brought to you by DarcsWatch
http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-12e2dfba7e5fb635ffde5856e22f2dfc6f0fe34b
|
msg11182 (view) |
Author: tux_rocker |
Date: 2010-06-02.06:33:06 |
|
Hi all,
Op vrijdag 28 mei 2010 11:05 schreef Eric Kow:
> Plan for future work? (two kinds of read/show)
> ----------------------------------------------
> Complementary plan: we should distinguish between decoding/encoding
> filepaths from the operating system, and decoding/encoding filepaths
> to patch files and patch bundles.
>
> Basically the picture looks like this:
>
> OS <--> darcs <---> patch files
>
> The reason why I initially thought that NewFormat was a step backwards
> was that I was thinking about the darcs <--> patch files part. IMHO,
> what you want is for darcs <--> patch files to always use UTF-8. On the
> other hand, the OS <--> darcs part needs some more thought.
>
> This is a little half-baked right, but maybe somebody else can run with
> the idea?
As a Unix guy I never think of filenames as text. If our patch files are
UTF-8, how do we represent patches to the issue1763 repo with its Hungarian
characters in a single-byte encoding? Note that if I copy these files to my
machine that has a UTF-8 locale, the file names will still be valid single-
byte-hungarian and invalid UTF-8!
Given that even enterprisey Java does not have a good solution to this problem
makes me feel hopeless about finding one for darcs.
We could of course say that for darcs, filenames are Unicode text. Then we
should check upon darcs add that a filename is valid according to the current
locale, and encode the file name using the locale encoding whenever darcs does
filesystem operations on Unix. But refusing to 'darcs add' a file because its
name does not fit our model may anger some users. And I haven't even thought
about backward compatibility.
Reinier
|
msg11183 (view) |
Author: stephen |
Date: 2010-06-02.07:47:39 |
|
Reinier Lamers writes:
> As a Unix guy I never think of filenames as text. If our patch files are
> UTF-8, how do we represent patches to the issue1763 repo with its Hungarian
> characters in a single-byte encoding?
PEP 383 is the best solution I've seen yet.
http://www.python.org/dev/peps/pep-0383/
|
msg14132 (view) |
Author: darcswatch |
Date: 2011-05-10.18:36:41 |
|
This patch bundle (with 1 patches) was just applied to the repository http://darcs.net/reviewed.
This message was brought to you by DarcsWatch
http://darcswatch.nomeata.de/repo_http:__darcs.net_reviewed.html#bundle-12e2dfba7e5fb635ffde5856e22f2dfc6f0fe34b
|
|
Date |
User |
Action |
Args |
2010-05-27 12:15:56 | mornfall | create | |
2010-05-27 12:42:34 | darcswatch | set | darcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-12e2dfba7e5fb635ffde5856e22f2dfc6f0fe34b |
2010-05-27 17:04:08 | kowey | set | nosy:
+ kowey messages:
+ msg11125 |
2010-05-27 17:10:58 | kowey | set | messages:
+ msg11126 |
2010-05-27 17:48:28 | mornfall | set | messages:
+ msg11127 |
2010-05-27 18:27:06 | kowey | set | messages:
+ msg11128 |
2010-05-27 21:22:27 | tux_rocker | set | nosy:
+ tux_rocker messages:
+ msg11129 |
2010-05-28 09:04:31 | kowey | set | messages:
+ msg11131 |
2010-05-28 09:04:34 | darcswatch | set | status: needs-review -> accepted messages:
+ msg11132 |
2010-06-02 06:33:06 | tux_rocker | set | messages:
+ msg11182 |
2010-06-02 07:47:39 | stephen | set | nosy:
+ stephen messages:
+ msg11183 |
2010-06-22 07:28:45 | ganesh | set | issues:
+ encoding of non-ascii file names into patches is broken; triggers assert later |
2011-05-10 18:36:41 | darcswatch | set | messages:
+ msg14132 |
|