darcs

Issue 1863 unicode filenames

Title unicode filenames
Priority feature Status given-up
Milestone 3.0.0 Resolved in
Superseder Nosy List dmitry.kurochkin, kowey, mornfall, stephen, tux_rocker
Assigned To
Topics

Created on 2010-06-06.12:28:55 by kowey, last changed 2017-07-30.23:59:34 by gh.

Messages
msg11131 (view) Author: kowey Date: 2010-05-28.09:04:30
On Thu, May 27, 2010 at 12:15:56 +0000, Petr Ročkai wrote:
> Thu May 27 14:10:19 CEST 2010  Petr Rockai <me@mornfall.net>
>   * Resolve issue1763: use correct filename encoding in conflictors.

OK, we've had two people (Reinier, Eric) look at this and OK it, so I guess it
makes sense for me to push it now with some thoughts about future work.

Resolve issue1763: use correct filename encoding in conflictors.
----------------------------------------------------------------
> hunk ./src/Darcs/Patch/Real.hs 716
>          blueText "conflictor" <+> showNons i <+> blueText "[]" $$ showNon p
>      showPatch (Conflictor i cs p) =
>          blueText "conflictor" <+> showNons i <+> blueText "[" $$
> -        showPatch cs $$
> +        showPrimFL NewFormat cs $$
>          blueText "]" $$
>          showNon p
>      showPatch (InvConflictor i NilFL p) =
> 

I'm still concerned that we're not being systematic enough about really
fixing this (eg. show we worry about rotcifnoc? showNon? etc)

[The mental image I have is those old cartoons where you have the
 character on a boat and a leak forms, so he plugs it with a finger,
 and then another leak, and another finger, and another leak...]

I also notice this:

  instance ReadPatch Prim where
    readPatch' _ = readPrim OldFormat

  -- this and other darcs-2 format patches use readPrim NewFormat
  readNons :: (ReadPatch p, ParserM m) => m [Non p C(x)]
  readNons = peekfor "{{" rns (return [])
      where rns = peekfor "}}" (return []) $
                  do Just (Sealed ps) <- readPatch' False
                     lexChar ':'
                     Just (Sealed p) <- readPrim NewFormat
                     (Non ps p :) `liftM` rns
     

and in the read code for Non and RealPatch (I think these are darcs-2 style
patches), readPatch eventually uses readPrim NewFormat.  So that makes sense:
the double-encoding comes from reading UTF-8 bytes [this is where Petr's
assertion that "the filepath *is never decoded*" makes sense] as code-points,
and then trying to encode those code-points into UTF-8 bytes.

Plan for future work? (Prim FileNameFormat)
-------------------------------------------
How does this plan sound: introduce two new wrapper types OldFormatPrim and
NewFormatPrim whose read/show instances use OldFormat/NewFormat
respectively, thus ensuring that readPatch and showPatch automagically
do the right thing?

(or even one parametrisable type (although I imagine that involves turning
on some extension for instances))

Plan for future work? (two kinds of read/show)
----------------------------------------------
Complementary plan: we should distinguish between decoding/encoding
filepaths from the operating system, and decoding/encoding filepaths
to patch files and patch bundles.

Basically the picture looks like this:

    OS <--> darcs <---> patch files

The reason why I initially thought that NewFormat was a step backwards
was that I was thinking about the darcs <--> patch files part.  IMHO,
what you want is for darcs <--> patch files to always use UTF-8.  On the
other hand, the OS <--> darcs part needs some more thought.

This is a little half-baked right, but maybe somebody else can run with
the idea?

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
msg11182 (view) Author: tux_rocker Date: 2010-06-02.06:33:06
Hi all,

Op vrijdag 28 mei 2010 11:05 schreef Eric Kow:
> Plan for future work? (two kinds of read/show)
> ----------------------------------------------
> Complementary plan: we should distinguish between decoding/encoding
> filepaths from the operating system, and decoding/encoding filepaths
> to patch files and patch bundles.
> 
> Basically the picture looks like this:
> 
>     OS <--> darcs <---> patch files
> 
> The reason why I initially thought that NewFormat was a step backwards
> was that I was thinking about the darcs <--> patch files part.  IMHO,
> what you want is for darcs <--> patch files to always use UTF-8.  On the
> other hand, the OS <--> darcs part needs some more thought.
> 
> This is a little half-baked right, but maybe somebody else can run with
> the idea?

As a Unix guy I never think of filenames as text. If our patch files are 
UTF-8, how do we represent patches to the issue1763 repo with its Hungarian 
characters in a single-byte encoding? Note that if I copy these files to my 
machine that has a UTF-8 locale, the file names will still be valid single-
byte-hungarian and invalid UTF-8!
Given that even enterprisey Java does not have a good solution to this problem 
makes me feel hopeless about finding one for darcs.

We could of course say that for darcs, filenames are Unicode text. Then we 
should check upon darcs add that a filename is valid according to the current 
locale, and encode the file name using the locale encoding whenever darcs does 
filesystem operations on Unix. But refusing to 'darcs add' a file because its 
name does not fit our model may anger some users. And I haven't even thought 
about backward compatibility.

Reinier
msg11183 (view) Author: stephen Date: 2010-06-02.07:47:39
Reinier Lamers writes:

 > As a Unix guy I never think of filenames as text. If our patch files are 
 > UTF-8, how do we represent patches to the issue1763 repo with its Hungarian 
 > characters in a single-byte encoding?

PEP 383 is the best solution I've seen yet.

http://www.python.org/dev/peps/pep-0383/
msg11258 (view) Author: kowey Date: 2010-06-06.12:35:27
This issue was split from patch252 (hmm! sharing msg objects between
patches and issues).

I still think it'd be nice to think about having text filenames in Darcs
3. If I understand correctly, PEP 383 that Stephen pointed would make it
possible to cope relatively gracefully with various encoding troubles
that could creep up.

Now if it turns out I'm wrong and text filenames really are a bad idea,
we can always wont-fix this ticket.  But repo portability is a big
feature to have IMHO.
msg11345 (view) Author: tux_rocker Date: 2010-06-10.06:18:31
Hi,

Op zondag 06 juni 2010 14:35 schreef je:
> I still think it'd be nice to think about having text filenames in Darcs
> 3. If I understand correctly, PEP 383 that Stephen pointed would make it
> possible to cope relatively gracefully with various encoding troubles
> that could creep up.

About this PEP 383, I wonder whether we can also use it to embed these 
encodingwise unwellformed filenames in XML for instance. If we can't, we might 
be better of using some private use area of Unicode instead of lone 
surrogates.

Reinier
msg11347 (view) Author: stephen Date: 2010-06-10.08:46:11
Reinier Lamers writes:
 > 
 > Reinier Lamers <tux_rocker@reinier.de> added the comment:
 > 
 > Hi,
 > 
 > Op zondag 06 juni 2010 14:35 schreef je:
 > > I still think it'd be nice to think about having text filenames in Darcs
 > > 3. If I understand correctly, PEP 383 that Stephen pointed would make it
 > > possible to cope relatively gracefully with various encoding troubles
 > > that could creep up.
 > 
 > About this PEP 383, I wonder whether we can also use it to embed
 > these encodingwise unwellformed filenames in XML for instance.

No.  By definition of UTF-16, the lone surrogates are also malformed,
so they cannot appear in UTF-16-encoded XML.  Surrogates cannot appear
at all in UTF-8.  Most Unicode applications will yell at you if you
try this, so it's a bad idea.  But they won't DTRT with a private use
area, either, because that requires a non-standard private agreement
on encoding.  So basically these encodings will only be useful in
darcs.

I'm not an XML expert, but IIRC there is some way to embed binary in
XML, and that's what you would want to use rather than private use
characters or surrogates.

 > If we can't, we might be better of using some private use area of
 > Unicode instead of lone surrogates.

This is equivalent, except that (a) all internal strings will be
well-formed (good) but (b) you now need to worry about collisions with
well-formed Unicode that happens to use the same private space (ugly).
History
Date User Action Args
2010-06-06 12:28:55koweycreate
2010-06-06 12:32:01adminsetnosy: + kowey, mornfall, stephen, tux_rocker
messages: + msg11131, msg11182, msg11183
2010-06-06 12:35:27koweysetmessages: + msg11258
2010-06-10 06:18:32tux_rockersetmessages: + msg11345
2010-06-10 08:46:12stephensetmessages: + msg11347
2010-06-10 09:28:46koweylinkissue1143 superseder
2010-06-15 21:10:49adminsettopic: - Target-3.0
2010-06-15 21:10:49adminsetmilestone: 3.0.0
2017-07-30 23:59:34ghsetstatus: deferred -> given-up