darcs

Patch 1226 resolve issue2416: removed obsolete Darcs.Util.ByteStr...

Title resolve issue2416: removed obsolete Darcs.Util.ByteStr...
Superseder Nosy List bfrk, ganesh
Related Issues
Status obsoleted Assigned To
Milestone

Created on 2014-11-15.22:32:29 by bfrk, last changed 2014-11-17.22:26:47 by bfrk. Tracked on DarcsWatch.

Files
File name Status Uploaded Type Edit Remove
make-the-options-type-used-by-a-command-into-a-type-parameter_and_4_others.dpatch bfrk, 2014-11-16.20:11:28 application/x-darcs-patch
patch-preview.txt bfrk, 2014-11-15.22:32:29 text/x-darcs-patch
resolve-issue2416_-removed-obsolete-darcs_util_bytestring_decodestring.dpatch bfrk, 2014-11-15.22:32:29 application/x-darcs-patch
unnamed bfrk, 2014-11-15.22:32:29
See mailing list archives for discussion on individual patches.
Messages
msg17812 (view) Author: bfrk Date: 2014-11-15.22:32:29
Note this only resolves the bug, it doesn't add any unit tests. I am now
convinced that decodeString is actually harmful with modern (i.e. > 7.0)
ghc.

1 patch for repository http://darcs.net/screened:

patch 5f4b9bb00fe09b25ee6b15e5171c765230f8db76
Author: Ben Franksen <benjamin.franksen@helmholtz-berlin.de>
Date:   Sat Nov 15 23:27:38 CET 2014
  * resolve issue2416: removed obsolete Darcs.Util.ByteString.decodeString
Attachments
msg17814 (view) Author: ganesh Date: 2014-11-15.23:00:48
Unfortunately this seems to break the utf8.sh shell test :-( I agree we 
really should be getting rid of things like this though.
msg17815 (view) Author: bfrk Date: 2014-11-15.23:24:44
Hm, it works fine when I run the test on my machine. I am running ubuntu
14.04 with ghc-7.6.3 (ubuntu package). My locale is de_DE.UTF-8.

Can you say where exactly the test fails?
msg17816 (view) Author: ganesh Date: 2014-11-15.23:37:19
Looks like it's at "patch name from command line not UTF-8-encoded!" - 

I'm running NixOS 14.04, locale en_GB.UTF-8.
msg17817 (view) Author: bfrk Date: 2014-11-16.00:39:26
I ran the test as

> ./dist/build/darcs-test/darcs-test --tests=utf

which gave me no indication that I have to generate an iso885915
encoding for the test to do something ;)

When I add this locale is fails for me, too. I begin to understand why
this is the case. What I had trouble to understand: how does
decodeString not destroy entering utf-8 encoded strings? I think it does
do that -- but only for characters with a code point above 255 and I
happen to be part of the lucky minority of people whose native language
encoding involves only code points below.

So the next question is: do we still want to support the 8-bit-only
non-unicode encodings at the cost of alienating those whose native
encoding needs code points above 255? That would entail scrapping the so
called utf8.sh test (which should actually be called iso88915.sh).

I am pretty sure, if we asked around how other projects handle this, it
would turn out we are the only ones sticking to latin encodings.

Note this whole issue is purely about strings we get from the command line.
msg17818 (view) Author: bfrk Date: 2014-11-16.00:42:37
I wrote:
> That would entail scrapping the so
> called utf8.sh test (which should actually be called iso88915.sh).

The "that" refers to the implicit proposal (framed as a question) to no
longer support latin encodings for strings we get from the command line.
msg17819 (view) Author: ganesh Date: 2014-11-16.02:02:43
I'm quite happy to make a principled change to darcs' target behaviour, but I can't pretend to 
understand the issues fully. Unfortunately encoding issues tend to be quite complex :-( Some 
general thoughts, in a somewhat random order:

I did do some work on the encoding code a while ago to manage the transition to GHC 7.2's 
behaviour, but I've mostly forgotten what I learnt then. From what I remember, I mostly 
tinkered around the edges to change the way we dealt with incoming data, rather than really 
understanding exactly what was going on.

I think there are also various open bugs relating to our internationalization, so I'm sure the 
current behaviour is not perfect.

Is it worth writing a test to demonstrate your hypothesis about destroying incoming utf-8? I 
also agree that the code implies this would happen but it would be good to be sure.

I would also like to understand why we can't support both scenarios, depending on the user's 
current locale. Is it logically impossible or just difficult because of the way darcs's code 
is written, for example.
msg17820 (view) Author: bfrk Date: 2014-11-16.13:44:34
You raise a number of good questions.

* Writing a test to demonstrate my claim: yes, good idea. I'll try to do
that.

* Support both scenarios: That would be ideal, of course. Now that you ask,
  I think this should be possible. However, digging into the base
library sources
  and searching around in the available libraries I found no simple way
to do that.
  I have sent a message to haskell-cafe, perhaps someone has an idea.
msg17822 (view) Author: ganesh Date: 2014-11-16.15:32:05
I (belatedly) now remembering a bit more of the history behind the 
current situation. Despite being responsible for the recent changes  
it's rather a blank in my memory :-)

 - we actually force the encoding we read from the command-line to 
'char8': see patch777.

 - some of the current encoding code was imported from haskeline 
when that stopped exporting the helper functionality we were 
previously using: see patch1013. Not sure if that's directly 
relevant to the current situation.

Generally I'm not suggesting we should investigate this issue to 
death and come up with a perfect solution. But we should be a bit 
careful not just to flip-flop based on one specific problem at hand, 
and we should document what we expect to support with tests (and of 
course remove the tests we no longer wish to support).
msg17823 (view) Author: bfrk Date: 2014-11-16.17:08:04
OMG. This is all so messed up I am getting headaches just from thinking
about it.

So (since patch777) we do 'setFileSystemEncoding char8'. And since we do
that before 'System.IO.initForeignEncoding' has been called for the
first time, it permanently overrides the users locale, as seen by all
Haskell libraries. IIUC, the main reason it's been done that way is so
that things like file names can be round-tripped no matter what the
encoding is.

On the other hand, we want to store patch meta data in utf-8. But stuff
from the command line is (unfortunately?) also subject to encoding,
using initForeignEncoding, which we globally set to char8. Therefore the
decodeString function.

We can solve the original problem (issue2416) by adding a reverse
function to Darcs.Util.ByteString and using that in the Option
definitions when defining ounparse.

The larger problem (the reason we enforce char8 as the default) should
perhaps be discussed elsewhere.
msg17824 (view) Author: ganesh Date: 2014-11-16.17:46:25
> IIUC, the main reason it's been done that way
> is so that things like file names can be round-tripped
> no matter what the encoding is.

Yes, I think that's the motivation. I spent a while investigating 
alternatives (see issue2095) but this ended up being the simplest 
"solution".

I had some vague thoughts in the long-term of using newtypes around 
String and ByteString to identify where in our code what encoding is 
in use (the obvious ones being Char8, UTF-8 or user-locale), but 
that's not going to be for 2.10. As you say that's probably best 
discussed elsewhere.

> We can solve the original problem (issue2416) by
> adding a reverse function to Darcs.Util.ByteString
> and using that in the Option definitions when
> defining ounparse.

Isn't the problem here that 'decodeString' isn't invertible? The 
example I gave in issue2416 was derived from running the utf8.sh 
with patch1222, and it reflects a tag name being passed on the 
command-line.
msg17825 (view) Author: bfrk Date: 2014-11-16.18:54:44
> > We can solve the original problem (issue2416) by
> > adding a reverse function to Darcs.Util.ByteString
> > and using that in the Option definitions when
> > defining ounparse.
> Isn't the problem here that 'decodeString' isn't invertible? The 
> example I gave in issue2416 was derived from running the utf8.sh 
> with patch1222, and it reflects a tag name being passed on the 
> command-line.

You are probably right. What I am certain of is that applying
decodeString when creating the DarcsFlag but not applying an inverse
when going back from the DarcsFlag to a String is wrong. You say there
is no inverse, so I think the only viable solution is to either apply
decodeString before options are parsed or afterwards when they are used.

I you have an idea how to get out of this mess, please go ahead.
msg17826 (view) Author: bfrk Date: 2014-11-16.20:11:28
Okay, here is another patch bundle that implements a solution according
to what Ganesh proposed. With this patch I no longer get failures for
the utf-8.sh test. It also includes some small bugfix patches. However,
workingdir.sh is broken now.
Attachments
msg17830 (view) Author: ganesh Date: 2014-11-17.06:56:19
A quick update on test failures following our IRC conversation.

The send failures are entirely caused by my "make the options type into a type parameter" 
patches. I don't know why I didn't see them before, but now they happen without any of your 
patches from this bundle. Your patches from this bundle commute with the options type changes, 
so we can sort them out separately unless you expect some semantic dependency.

The workingdir.sh failure is actually caused by "fix: the toPatch option was broken, probably 
due to some cut/paste". I haven't figured out exactly how we go to this state yet; the fix is 
clearly correct, but it means the flag isn't recognised by the call to havePatchsetMatch in 
Darcs.Repository.copyRepoAndGoToChosenVersion.

I'll update the bug if I figure out more.
msg17831 (view) Author: ganesh Date: 2014-11-17.07:19:14
OK, I figured out the workingdir problem. Before the options rewrite, --to-patch was 
actually parsed in two different ways depending on command:

matchOneContext =
    DarcsMultipleChoiceOption
    [DarcsArgOption [] ["to-match"] OnePattern "PATTERN"
     "select changes up to a patch matching PATTERN",
     DarcsArgOption [] ["to-patch"] OnePatch "REGEXP"
     "select changes up to a patch matching REGEXP",
     __tag,
     DarcsAbsPathOption [] ["context"] Context "FILENAME"
     "version specified by the context in FILENAME"
    ]

and

matchOneContext =
    DarcsMultipleChoiceOption
    [DarcsArgOption [] ["to-match"] OnePattern "PATTERN"
     "select changes up to a patch matching PATTERN",
     DarcsArgOption [] ["to-patch"] OnePatch "REGEXP"
     "select changes up to a patch matching REGEXP",
     __tag,
     DarcsAbsPathOption [] ["context"] Context "FILENAME"
     "version specified by the context in FILENAME"
    ]

It looks like matchOneContext was only used by clone.

I can sort of see why this approach made sense for clone because you can't get a 
proper range of patches, you really have to start at the beginning of the repo, but 
the user would expect --to-patch to behave as it says. On the other hand having two 
different parsings inside Darcs is pretty confusing. I guess we should just change 
the clone code to understand 'UpToPatch' etc properly.
msg17838 (view) Author: bfrk Date: 2014-11-17.22:26:47
Thanks for the analysis Ganesh. I thinks this explains a lot of the
strange things I have seen. I agree that the better approach is to deal
with this in the clone command.
History
Date User Action Args
2014-11-15 22:32:29bfrkcreate
2014-11-15 23:00:49ganeshsetstatus: needs-screening -> in-discussion
nosy: + ganesh
messages: + msg17814
2014-11-15 23:24:44bfrksetmessages: + msg17815
2014-11-15 23:37:19ganeshsetmessages: + msg17816
2014-11-16 00:39:27bfrksetmessages: + msg17817
2014-11-16 00:42:37bfrksetmessages: + msg17818
2014-11-16 02:02:44ganeshsetmessages: + msg17819
2014-11-16 13:44:34bfrksetmessages: + msg17820
2014-11-16 15:32:05ganeshsetmessages: + msg17822
2014-11-16 17:08:04bfrksetmessages: + msg17823
2014-11-16 17:46:25ganeshsetmessages: + msg17824
2014-11-16 18:54:44bfrksetmessages: + msg17825
2014-11-16 20:11:28bfrksetfiles: + make-the-options-type-used-by-a-command-into-a-type-parameter_and_4_others.dpatch
messages: + msg17826
2014-11-16 20:12:05darcswatchsetdarcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_reviewed.html#bundle-58bd717d4811a1331ddb8569c1806f15dbc3e175
2014-11-17 06:56:20ganeshsetmessages: + msg17830
2014-11-17 07:19:14ganeshsetmessages: + msg17831
2014-11-17 22:26:47bfrksetstatus: in-discussion -> obsoleted
messages: + msg17838