Created on 2010-06-06.12:28:55 by kowey, last changed 2017-07-30.23:59:34 by gh.
msg11131 (view) |
Author: kowey |
Date: 2010-05-28.09:04:30 |
|
On Thu, May 27, 2010 at 12:15:56 +0000, Petr Ročkai wrote:
> Thu May 27 14:10:19 CEST 2010 Petr Rockai <me@mornfall.net>
> * Resolve issue1763: use correct filename encoding in conflictors.
OK, we've had two people (Reinier, Eric) look at this and OK it, so I guess it
makes sense for me to push it now with some thoughts about future work.
Resolve issue1763: use correct filename encoding in conflictors.
----------------------------------------------------------------
> hunk ./src/Darcs/Patch/Real.hs 716
> blueText "conflictor" <+> showNons i <+> blueText "[]" $$ showNon p
> showPatch (Conflictor i cs p) =
> blueText "conflictor" <+> showNons i <+> blueText "[" $$
> - showPatch cs $$
> + showPrimFL NewFormat cs $$
> blueText "]" $$
> showNon p
> showPatch (InvConflictor i NilFL p) =
>
I'm still concerned that we're not being systematic enough about really
fixing this (eg. show we worry about rotcifnoc? showNon? etc)
[The mental image I have is those old cartoons where you have the
character on a boat and a leak forms, so he plugs it with a finger,
and then another leak, and another finger, and another leak...]
I also notice this:
instance ReadPatch Prim where
readPatch' _ = readPrim OldFormat
-- this and other darcs-2 format patches use readPrim NewFormat
readNons :: (ReadPatch p, ParserM m) => m [Non p C(x)]
readNons = peekfor "{{" rns (return [])
where rns = peekfor "}}" (return []) $
do Just (Sealed ps) <- readPatch' False
lexChar ':'
Just (Sealed p) <- readPrim NewFormat
(Non ps p :) `liftM` rns
and in the read code for Non and RealPatch (I think these are darcs-2 style
patches), readPatch eventually uses readPrim NewFormat. So that makes sense:
the double-encoding comes from reading UTF-8 bytes [this is where Petr's
assertion that "the filepath *is never decoded*" makes sense] as code-points,
and then trying to encode those code-points into UTF-8 bytes.
Plan for future work? (Prim FileNameFormat)
-------------------------------------------
How does this plan sound: introduce two new wrapper types OldFormatPrim and
NewFormatPrim whose read/show instances use OldFormat/NewFormat
respectively, thus ensuring that readPatch and showPatch automagically
do the right thing?
(or even one parametrisable type (although I imagine that involves turning
on some extension for instances))
Plan for future work? (two kinds of read/show)
----------------------------------------------
Complementary plan: we should distinguish between decoding/encoding
filepaths from the operating system, and decoding/encoding filepaths
to patch files and patch bundles.
Basically the picture looks like this:
OS <--> darcs <---> patch files
The reason why I initially thought that NewFormat was a step backwards
was that I was thinking about the darcs <--> patch files part. IMHO,
what you want is for darcs <--> patch files to always use UTF-8. On the
other hand, the OS <--> darcs part needs some more thought.
This is a little half-baked right, but maybe somebody else can run with
the idea?
--
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
|
msg11182 (view) |
Author: tux_rocker |
Date: 2010-06-02.06:33:06 |
|
Hi all,
Op vrijdag 28 mei 2010 11:05 schreef Eric Kow:
> Plan for future work? (two kinds of read/show)
> ----------------------------------------------
> Complementary plan: we should distinguish between decoding/encoding
> filepaths from the operating system, and decoding/encoding filepaths
> to patch files and patch bundles.
>
> Basically the picture looks like this:
>
> OS <--> darcs <---> patch files
>
> The reason why I initially thought that NewFormat was a step backwards
> was that I was thinking about the darcs <--> patch files part. IMHO,
> what you want is for darcs <--> patch files to always use UTF-8. On the
> other hand, the OS <--> darcs part needs some more thought.
>
> This is a little half-baked right, but maybe somebody else can run with
> the idea?
As a Unix guy I never think of filenames as text. If our patch files are
UTF-8, how do we represent patches to the issue1763 repo with its Hungarian
characters in a single-byte encoding? Note that if I copy these files to my
machine that has a UTF-8 locale, the file names will still be valid single-
byte-hungarian and invalid UTF-8!
Given that even enterprisey Java does not have a good solution to this problem
makes me feel hopeless about finding one for darcs.
We could of course say that for darcs, filenames are Unicode text. Then we
should check upon darcs add that a filename is valid according to the current
locale, and encode the file name using the locale encoding whenever darcs does
filesystem operations on Unix. But refusing to 'darcs add' a file because its
name does not fit our model may anger some users. And I haven't even thought
about backward compatibility.
Reinier
|
msg11183 (view) |
Author: stephen |
Date: 2010-06-02.07:47:39 |
|
Reinier Lamers writes:
> As a Unix guy I never think of filenames as text. If our patch files are
> UTF-8, how do we represent patches to the issue1763 repo with its Hungarian
> characters in a single-byte encoding?
PEP 383 is the best solution I've seen yet.
http://www.python.org/dev/peps/pep-0383/
|
msg11258 (view) |
Author: kowey |
Date: 2010-06-06.12:35:27 |
|
This issue was split from patch252 (hmm! sharing msg objects between
patches and issues).
I still think it'd be nice to think about having text filenames in Darcs
3. If I understand correctly, PEP 383 that Stephen pointed would make it
possible to cope relatively gracefully with various encoding troubles
that could creep up.
Now if it turns out I'm wrong and text filenames really are a bad idea,
we can always wont-fix this ticket. But repo portability is a big
feature to have IMHO.
|
msg11345 (view) |
Author: tux_rocker |
Date: 2010-06-10.06:18:31 |
|
Hi,
Op zondag 06 juni 2010 14:35 schreef je:
> I still think it'd be nice to think about having text filenames in Darcs
> 3. If I understand correctly, PEP 383 that Stephen pointed would make it
> possible to cope relatively gracefully with various encoding troubles
> that could creep up.
About this PEP 383, I wonder whether we can also use it to embed these
encodingwise unwellformed filenames in XML for instance. If we can't, we might
be better of using some private use area of Unicode instead of lone
surrogates.
Reinier
|
msg11347 (view) |
Author: stephen |
Date: 2010-06-10.08:46:11 |
|
Reinier Lamers writes:
>
> Reinier Lamers <tux_rocker@reinier.de> added the comment:
>
> Hi,
>
> Op zondag 06 juni 2010 14:35 schreef je:
> > I still think it'd be nice to think about having text filenames in Darcs
> > 3. If I understand correctly, PEP 383 that Stephen pointed would make it
> > possible to cope relatively gracefully with various encoding troubles
> > that could creep up.
>
> About this PEP 383, I wonder whether we can also use it to embed
> these encodingwise unwellformed filenames in XML for instance.
No. By definition of UTF-16, the lone surrogates are also malformed,
so they cannot appear in UTF-16-encoded XML. Surrogates cannot appear
at all in UTF-8. Most Unicode applications will yell at you if you
try this, so it's a bad idea. But they won't DTRT with a private use
area, either, because that requires a non-standard private agreement
on encoding. So basically these encodings will only be useful in
darcs.
I'm not an XML expert, but IIRC there is some way to embed binary in
XML, and that's what you would want to use rather than private use
characters or surrogates.
> If we can't, we might be better of using some private use area of
> Unicode instead of lone surrogates.
This is equivalent, except that (a) all internal strings will be
well-formed (good) but (b) you now need to worry about collisions with
well-formed Unicode that happens to use the same private space (ugly).
|
|
Date |
User |
Action |
Args |
2010-06-06 12:28:55 | kowey | create | |
2010-06-06 12:32:01 | admin | set | nosy:
+ kowey, mornfall, stephen, tux_rocker messages:
+ msg11131, msg11182, msg11183 |
2010-06-06 12:35:27 | kowey | set | messages:
+ msg11258 |
2010-06-10 06:18:32 | tux_rocker | set | messages:
+ msg11345 |
2010-06-10 08:46:12 | stephen | set | messages:
+ msg11347 |
2010-06-10 09:28:46 | kowey | link | issue1143 superseder |
2010-06-15 21:10:49 | admin | set | topic:
- Target-3.0 |
2010-06-15 21:10:49 | admin | set | milestone: 3.0.0 |
2017-07-30 23:59:34 | gh | set | status: deferred -> given-up |
|