darcs

Issue 2095 Filename handling has changed with GHC 7.2

Title Filename handling has changed with GHC 7.2
Priority Status resolved
Milestone 2.8.1 Resolved in 2.10.0 HEAD
Superseder Nosy List atsampson, ganesh, iain, mndrix, nomeata
Assigned To
Topics

Created on 2011-08-14.11:14:46 by atsampson, last changed 2012-05-14.13:26:02 by noreply.

Messages
msg14656 (view) Author: atsampson Date: 2011-08-14.11:14:44
I've just been trying out Darcs (head) with GHC 7.2, and spent a while
tracking down why some of the tests fail: I thought it worth documenting
it here to save others the effort...

The tests that fail are those that use non-ASCII filenames. For example,
the issue1763 test creates a file with a filename that contains
non-ASCII characters, then tries to use "darcs rec -l" to add it -- but
darcs compiled with GHC 7.2 ignores it (or, rather, treats it as boring).

It's caused by a behaviour change in the GHC standard libraries. In GHC
< 7.2, System.Directory.getDirectoryContents returned filenames as
Strings where each character represents a byte returned by the operating
system. In GHC 7.2, getDirectoryContents attempts to decode the filename
using the current locale, and returns the decoded version. This sort of
does the right thing when the current locale's encoding matches the
filesystem encoding (although it cannot possibly know whether that's the
case, and in any case it returns different results from previous
versions, so Darcs will need to be adapted to not do any decoding
itself). However, when the locale *doesn't* match the filename system --
e.g. when LANG=C in the tests -- it maps characters it doesn't recognise
to private-use code points.

For example (trimmed a bit):

$ ls
_darcs  foo  xá
$ LANG=en_GB.UTF-8 ghci-7.0.4
GHCi, version 7.0.4: http://www.haskell.org/ghc/  :? for help
Prelude System.Directory> getDirectoryContents "."
["foo","x\195\161","..","_darcs","."]
$ LANG=C ghci-7.0.4
GHCi, version 7.0.4: http://www.haskell.org/ghc/  :? for help
Prelude System.Directory> getDirectoryContents "."
["foo","x\195\161","..","_darcs","."]
$ LANG=en_GB.UTF-8 ghci-7.2.1
GHCi, version 7.2.1: http://www.haskell.org/ghc/  :? for help
Prelude System.Directory> getDirectoryContents "."
["foo","x\225","..","_darcs","."]
$ LANG=C ghci-7.2.1
GHCi, version 7.2.1: http://www.haskell.org/ghc/  :? for help
Prelude System.Directory> getDirectoryContents "."
["foo","x\61379\61345","..","_darcs","."]

It seems to me that this makes it awkward to write Haskell programs that
manipulate filenames in a predictable way; any thoughts?
msg14658 (view) Author: ganesh Date: 2011-08-14.11:36:48
Thanks very much for this detective work.

Can you just confirm this is on Linux? I thought I'd run all the tests on 
Windows (mingw) and they passed, but I now realise that the issue1763 test 
is just skipped on Windows!
msg14659 (view) Author: atsampson Date: 2011-08-14.11:57:01
Yep, this is on Linux. I'd expect to see the same problem (just with
different combinations of encodings!) on Windows; it may be that fixing
this also fixes whatever used to break the issue1763 test...
msg14749 (view) Author: markstos Date: 2011-10-13.12:46:26
atsampson, thanks for the report. 

What mix of documentation and logic changes in darcs do you recommend to 
address this?

Can it solved by setting the right locale in the correct places?

Is is possible that people will experience breakage if they upgrade to a 
darcs compiled with GHC 7.2 and we don't make other changes?

Thanks for your further input on this. We are trying to assess, what if 
any parts of this need to be addressesd for already-delayed 2.8 release.
msg14773 (view) Author: atsampson Date: 2011-10-13.13:27:30
I'm not sure what the best way to work around this in Darcs would be;
I'm not really familiar with Darcs' internals. I'd suggest talking to
whoever made the change in GHC, and asking what they intended it to do.

Forcing the locale in Darcs when doing operations that use filenames
would work, I suppose: find a locale that passed through all 8-bit
values directly -- e.g. ISO-8859-1 -- and that'll give you the same
results that earlier versions of GHC gave. That'd definitely be a hack,
though, not a permanent solution!

Yes, people upgrading to GHC 7.2 will experience breakage if they ever
use filenames with non-ASCII characters in them. The test suite will
certainly fail, unless either GHC or Darcs have changed since I opened
this ticket...
msg14806 (view) Author: atsampson Date: 2011-11-12.14:08:11
Simon Marlow has just posted a patch to the unix package which fixes
this problem by providing a raw API alongside the filename-mangling one:

http://thread.gmane.org/gmane.comp.lang.haskell.libraries/16556

So it'll be possible to unbreak this with GHC 7.4.1, by making it use
System.Posix.ByteString. I don't know what makes sense in terms of
backwards compatibility, though.
msg14807 (view) Author: ganesh Date: 2011-11-12.14:21:23
My current inclination is to not support GHC 7.2 at all (it's described as 
a "technology preview" anyway). I'm still not quite clear about the Win32 
situation though, or more generally whether we need/want ByteString-ified 
versions of the modules in the whole stack, including directory as well as 
unix.
msg14928 (view) Author: ganesh Date: 2011-12-31.22:54:19
I tried building darcs with GHC 7.4 against a locally hacked up version 
of the directory and unix packages that use the ByteString variants of 
Posix functions and convert to/from Strings using 
Data.ByteString.Char8.pack/unpack, but the tests still fail. The hacked 
up versions do exhibit the "correct" behaviour (i.e. like GHC 7.0) with 
getDirectoryContents, so I'm not quite sure what's going on yet.

For now I've committed a patch to darcs that prevents building with base 
4.4 or above (i.e. GHC 7.2+) unless explicitly requested.
msg15371 (view) Author: ganesh Date: 2012-03-21.13:24:45
Unfortunately, as far as I can tell there's quite a bit more to do than 
just switch to System.Posix.Bytestring.

That's just the low-level library. There's a tower of code on top of 
that (in the directory package) that we use directly, and we need to 
make our own variants of that code (or get standard libraries 
added/changed).

I did some work on this, but it didn't work out so far; anyone else 
should feel free to pick it up (there's probably nothing much reusable 
from what I've already done). I do plan to work on this further as soon 
as I have time - e.g. in next week's sprint.
msg15468 (view) Author: noreply Date: 2012-04-01.10:01:48
The following patch sent by Ganesh Sittampalam <ganesh@earth.li> updated issue issue2095 with
status=has-patch

* resolve issue2095: avoid new GHC encoding behaviour using a global setting 
Ignore-this: 5df28e26ef3dd51dad0d0a006dcddf5f
msg15653 (view) Author: jack.bargain Date: 2012-05-06.03:29:52
As a user Everyday under non-ASCII environment.

After the GHC 7.2 's change, many path error gone not just in darcs.
Work around and handle bytes by each project is error prone. 

Handle unicode returen by FilePath can stop the poor non-ASCII support in
darcs.
msg15687 (view) Author: noreply Date: 2012-05-14.13:26:01
The following patch sent by Ganesh Sittampalam <ganesh@earth.li> updated issue issue2095 with
status=resolved;resolvedin=2.10.0 HEAD

* resolve issue2095: avoid new GHC encoding behaviour using a global setting 
Ignore-this: 5df28e26ef3dd51dad0d0a006dcddf5f
History
Date User Action Args
2011-08-14 11:14:46atsampsoncreate
2011-08-14 11:36:49ganeshsetmessages: + msg14658
2011-08-14 11:36:59ganeshsetnosy: + ganesh
milestone: 2.8.0
2011-08-14 11:57:02atsampsonsetmessages: + msg14659
2011-10-13 12:46:27markstossetstatus: unknown -> waiting-for
messages: + msg14749
title: Filename handling with GHC 7.2 -> Filename handling has changed with GHC 7.2
2011-10-13 13:27:31atsampsonsetmessages: + msg14773
2011-11-12 14:08:13atsampsonsetmessages: + msg14806
2011-11-12 14:21:24ganeshsetmessages: + msg14807
2011-12-31 22:54:20ganeshsetmessages: + msg14928
2011-12-31 22:54:25ganeshsetmilestone: 2.8.0 -> 2.10.0 HEAD
2011-12-31 23:00:14ganeshsetmilestone: 2.10.0 HEAD -> 2.8.1
2012-03-21 12:29:26nomeatasetnosy: + nomeata
messages: + msg15370
2012-03-21 13:24:46ganeshsetmessages: + msg15371
2012-03-21 14:05:57iainsetnosy: + iain
2012-03-22 14:06:22mndrixsetnosy: + mndrix
2012-04-01 10:01:49noreplysetstatus: waiting-for -> has-patch
messages: + msg15468
2012-04-20 12:00:32bfsetmessages: - msg15370
2012-05-06 03:29:53jack.bargainsetmessages: + msg15653
2012-05-14 13:26:02noreplysetstatus: has-patch -> resolved
messages: + msg15687
resolvedin: 2.10.0 HEAD