darcs

Issue 1442 does the darcs encoding/decoding of filenames round-trip?

Title does the darcs encoding/decoding of filenames round-trip?
Priority bug Status needs-reproduction
Milestone Resolved in
Superseder Nosy List darcs-devel, dmitry.kurochkin, kowey, thorkilnaur, tux_rocker, twb
Assigned To
Topics UI

Created on 2009-04-16.11:04:52 by kowey, last changed 2020-07-29.17:04:47 by bfrk.

Messages
msg7709 (view) Author: kowey Date: 2009-04-16.11:04:48
% darcs changes 首頁.page -v 
Changes to 首頁.page:

Wed Apr 15 19:35:06 BST 2009  Eric Kow <E.Y.Kow@brighton.ac.uk>
  * Convert \e9\a6\96\e9\a0\81.page from MoinMoin syntax to RST.
    hunk ./\e9\a6\96\e9\160\\81.page 1
    -#language zh-tw
    -#redirect FrontPage

Sun Dec 31 14:13:58 GMT 2006  Eric Kow <q15kxe002@sneakemail.com>
  * \e9\a6\96\e9\a0\81: redirect to FrontPage
    addfile ./\e9\a6\96\e9\160\\81.page
    hunk ./\e9\a6\96\e9\160\\81.page 1
    +#language zh-tw
    +#redirect FrontPage

And also

% DARCS_DONT_ESCAPE_ANYTHING=1 darcs changes 首頁.page -v 
Changes to 首頁.page:

Wed Apr 15 19:35:06 BST 2009  Eric Kow <E.Y.Kow@brighton.ac.uk>
  * Convert 首頁.page from MoinMoin syntax to RST.
    hunk ./首?\160\?.page 1
    -#language zh-tw
    -#redirect FrontPage

Sun Dec 31 14:13:58 GMT 2006  Eric Kow <q15kxe002@sneakemail.com>
  * 首頁: redirect to FrontPage
    addfile ./首?\160\?.page
    hunk ./首?\160\?.page 1
    +#language zh-tw
    +#redirect FrontPage

What's going on with those filenames?  What's with the \160 and the double-slash?

Nevermind the actual encoding of the filenames for now.  Once we've got the
filename as a sequence of bytes, do we preserve that sequence?

Hopefully we'll find out that this is just a call for more testing and that
everything is OK.

I claim that the next action required is to study the darcs source code (maybe
Darcs.Patch.FileName) and provide an explanation for \160 and the double slash
that follows (see encode_white).

After that, if all is well, we should probably extend our test suite or make
sure that our test suite covers this stuff.

Marking this as a bug in the absence of full knowledge.
msg7711 (view) Author: twb Date: 2009-04-16.11:11:37
I will write some tests for this issue RSN.
msg7712 (view) Author: kowey Date: 2009-04-16.12:14:17
This made me curious so I spent some time on
  http://irclog.perlgeek.de/darcs/2009-04-16#i_1068098
trying to track down the issue

> Wed Apr 15 19:35:06 BST 2009  Eric Kow <E.Y.Kow@brighton.ac.uk>
>   * Convert \e9\a6\96\e9\a0\81.page from MoinMoin syntax to RST.

These escape sequences are generated by Darcs.ColorPrinter.
Notice \a0 in particular...

>     hunk ./\e9\a6\96\e9\160\\81.page 1

We use a function encode_white in Darcs.Patch.FileName, which encodes
all whitespace characters like this: \xxx\

Filenames are treated as Unicode Chars but they are read and written in
the usual Haskell way (no encoding, makes it as if it were in latin-1).
So it's as if we were treating filenames as bytestrings except for the
occasions where we wanted to look for slashes, dots and spaces.  In this
particular case, the encode_white function uses Data.Char.isSpace to
work out that the character corresponding to \a0 is a space character,
so it encodes it as \160\.

>   * 首頁: redirect to FrontPage
>     addfile ./首?\160\?.page
>     hunk ./首?\160\?.page 1
>     +#language zh-tw
>     +#redirect FrontPage

So doesn't this display correctly?

Here we are telling Darcs.ColorPrinter not to escape characters, so this
is why it leaves the bytes corresponding to "首" as they are.  But
before passing the filename to colorprinter, our patch display code
does this whitespace-escaping thing.

This means that we end up printing the byte
  e9 (first question mark)
then the string
  \160\ (escaped whitespace due to Darcs.Patch.FileName)
then the byte
  81 (second question mark)

So phew!
The weird gobbledygook on your screen does not constitute a bug.

I'm not sure what to do about this, though.  I guess one interpretation
is that we can say that when the user passes in DONT_ESCAPE_ANYTHING,
this includes darcs's own lower-level escapings and not just the higher
level ColorPrinter ones.  But that sounds like potentially messy code.
Should we somehow extend the printer code to say something like "I want
to escape this thing X as Y"?  Should we just treat this as a corner
case and document it on the wiki somewhere?

> After that, if all is well, we should probably extend our test suite or make
> sure that our test suite covers this stuff.
msg8194 (view) Author: kowey Date: 2009-08-17.06:15:31
Trent has kindly written up a test for this.

To be clear: there is no hanky-panky going on here.  It's just a problem of us
using the same code to render filenames for patch bundles as we do for putting
them on the screen.

The darcs-specific convention of representing whitespace as \xxx\ is
inappropriate for printing on screen.

This just needs somebody to study the darcs code and work out a solution of
doing two kinds of printing without us having too much code duplication. 
Perhaps an alternative show class?
msg8201 (view) Author: twb Date: 2009-08-17.08:15:45
On Mon, Aug 17, 2009 at 06:15:35AM +0000, Eric Kow wrote:
> This just needs somebody to study the darcs code and work out a
> solution of doing two kinds of printing without us having too much
> code duplication.  Perhaps an alternative show class?

There was discussion on-list about replacing our in-house Printer
module with the HughesPJ one.  I dunno if that's relevant.
msg8202 (view) Author: kowey Date: 2009-08-17.08:42:59
On Mon, Aug 17, 2009 at 08:15:47 +0000, Trent W. Buck wrote:
> On Mon, Aug 17, 2009 at 06:15:35AM +0000, Eric Kow wrote:
> > This just needs somebody to study the darcs code and work out a
> > solution of doing two kinds of printing without us having too much
> > code duplication.  Perhaps an alternative show class?
> 
> There was discussion on-list about replacing our in-house Printer
> module with the HughesPJ one.  I dunno if that's relevant.

I don't think so.

The printer controls all these fancy layout functions like blueText
and ($$).  Here the issue is really Darcs-specific, as in the function
that is at 'fault' is Darcs.Patch.FileName.encode_filename
msg10513 (view) Author: kowey Date: 2010-03-25.11:51:11
Reinier, I'm making you nosy on this bug because of your experience with
issue1739 (which is currently failing on buildbots BTW).  Maybe you'd be
interested.
msg10523 (view) Author: kowey Date: 2010-03-26.00:48:39
On Thu, Mar 25, 2010 at 21:02:37 +0100, Reinier Lamers wrote:
> This /may/ also have something to do with the non-roundtrip-safe filepath 
> munging that I presume causes http://bugs.darcs.net/issue1763 . I may look 
> into this one after I have fixed 1763 and 1599, and also released some 3-ish 
> darcs versions or so.

http://bugs.darcs.net/msg7712 has a more optimistic analysis (but maybe
it's a wrong one!)

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
msg21050 (view) Author: bfrk Date: 2019-08-08.12:43:07
Just noting that, according to Ganesh, this test still fails on Windows, while it does not on Linux.
msg21060 (view) Author: bfrk Date: 2019-08-09.09:59:21
> Just noting that, according to Ganesh, this test still fails on
> Windows, while it does not on Linux.

Or rather, if I understood correctly, the test script needs to be
re-written to be more portable, so at the moment it is unclear whether
there is an actual problem with this on Windows.
msg21062 (view) Author: ganesh Date: 2019-08-09.10:51:53
I think it was actually failing, but I can't remember for sure.

Every so often I get up the energy to really dig into an
encoding problem on Windows, but now is not one of those times.
msg22283 (view) Author: bfrk Date: 2020-07-29.17:04:45
Ganesh, I think this should work on Windows, in principle. It may be 
simply a problem with the test script. Please attach the full output 
of the test.
History
Date User Action Args
2009-04-16 11:04:52koweycreate
2009-04-16 11:11:39twbsetstatus: unread -> unknown
assignedto: twb
messages: + msg7711
nosy: + twb
2009-04-16 12:14:20koweysetnosy: kowey, simon, twb, thorkilnaur, dmitry.kurochkin
messages: + msg7712
2009-04-23 00:56:25twbsetassignedto: twb ->
nosy: kowey, simon, twb, thorkilnaur, dmitry.kurochkin
2009-08-17 06:15:35koweysetstatus: unknown -> needs-reproduction
nosy: kowey, simon, twb, thorkilnaur, dmitry.kurochkin
topic: + UI
messages: + msg8194
title: does the darcs encoding/decoding of filenames round-trip? -> filenames printed to screen with junk in place of whitespace
2009-08-17 08:15:47twbsetnosy: kowey, simon, twb, thorkilnaur, dmitry.kurochkin
messages: + msg8201
2009-08-17 08:43:01koweysetnosy: kowey, simon, twb, thorkilnaur, dmitry.kurochkin
messages: + msg8202
2009-08-25 17:43:43adminsetnosy: + darcs-devel, - simon
2009-08-27 14:25:18adminsetnosy: kowey, darcs-devel, twb, thorkilnaur, dmitry.kurochkin
2010-03-25 11:51:13koweysetnosy: + tux_rocker
messages: + msg10513
2010-03-26 00:48:40koweysetmessages: + msg10523
title: filenames printed to screen with junk in place of whitespace -> does the darcs encoding/decoding of filenames round-trip?
2019-08-08 12:43:09bfrksetmessages: + msg21050
2019-08-09 09:59:23bfrksetmessages: + msg21060
2019-08-09 10:51:55ganeshsetmessages: + msg21062
2020-07-29 17:04:47bfrksetmessages: + msg22283