See mailing list archives
for discussion on individual patches.
msg17812 (view) |
Author: bfrk |
Date: 2014-11-15.22:32:29 |
|
Note this only resolves the bug, it doesn't add any unit tests. I am now
convinced that decodeString is actually harmful with modern (i.e. > 7.0)
ghc.
1 patch for repository http://darcs.net/screened:
patch 5f4b9bb00fe09b25ee6b15e5171c765230f8db76
Author: Ben Franksen <benjamin.franksen@helmholtz-berlin.de>
Date: Sat Nov 15 23:27:38 CET 2014
* resolve issue2416: removed obsolete Darcs.Util.ByteString.decodeString
Attachments
|
msg17814 (view) |
Author: ganesh |
Date: 2014-11-15.23:00:48 |
|
Unfortunately this seems to break the utf8.sh shell test :-( I agree we
really should be getting rid of things like this though.
|
msg17815 (view) |
Author: bfrk |
Date: 2014-11-15.23:24:44 |
|
Hm, it works fine when I run the test on my machine. I am running ubuntu
14.04 with ghc-7.6.3 (ubuntu package). My locale is de_DE.UTF-8.
Can you say where exactly the test fails?
|
msg17816 (view) |
Author: ganesh |
Date: 2014-11-15.23:37:19 |
|
Looks like it's at "patch name from command line not UTF-8-encoded!" -
I'm running NixOS 14.04, locale en_GB.UTF-8.
|
msg17817 (view) |
Author: bfrk |
Date: 2014-11-16.00:39:26 |
|
I ran the test as
> ./dist/build/darcs-test/darcs-test --tests=utf
which gave me no indication that I have to generate an iso885915
encoding for the test to do something ;)
When I add this locale is fails for me, too. I begin to understand why
this is the case. What I had trouble to understand: how does
decodeString not destroy entering utf-8 encoded strings? I think it does
do that -- but only for characters with a code point above 255 and I
happen to be part of the lucky minority of people whose native language
encoding involves only code points below.
So the next question is: do we still want to support the 8-bit-only
non-unicode encodings at the cost of alienating those whose native
encoding needs code points above 255? That would entail scrapping the so
called utf8.sh test (which should actually be called iso88915.sh).
I am pretty sure, if we asked around how other projects handle this, it
would turn out we are the only ones sticking to latin encodings.
Note this whole issue is purely about strings we get from the command line.
|
msg17818 (view) |
Author: bfrk |
Date: 2014-11-16.00:42:37 |
|
I wrote:
> That would entail scrapping the so
> called utf8.sh test (which should actually be called iso88915.sh).
The "that" refers to the implicit proposal (framed as a question) to no
longer support latin encodings for strings we get from the command line.
|
msg17819 (view) |
Author: ganesh |
Date: 2014-11-16.02:02:43 |
|
I'm quite happy to make a principled change to darcs' target behaviour, but I can't pretend to
understand the issues fully. Unfortunately encoding issues tend to be quite complex :-( Some
general thoughts, in a somewhat random order:
I did do some work on the encoding code a while ago to manage the transition to GHC 7.2's
behaviour, but I've mostly forgotten what I learnt then. From what I remember, I mostly
tinkered around the edges to change the way we dealt with incoming data, rather than really
understanding exactly what was going on.
I think there are also various open bugs relating to our internationalization, so I'm sure the
current behaviour is not perfect.
Is it worth writing a test to demonstrate your hypothesis about destroying incoming utf-8? I
also agree that the code implies this would happen but it would be good to be sure.
I would also like to understand why we can't support both scenarios, depending on the user's
current locale. Is it logically impossible or just difficult because of the way darcs's code
is written, for example.
|
msg17820 (view) |
Author: bfrk |
Date: 2014-11-16.13:44:34 |
|
You raise a number of good questions.
* Writing a test to demonstrate my claim: yes, good idea. I'll try to do
that.
* Support both scenarios: That would be ideal, of course. Now that you ask,
I think this should be possible. However, digging into the base
library sources
and searching around in the available libraries I found no simple way
to do that.
I have sent a message to haskell-cafe, perhaps someone has an idea.
|
msg17822 (view) |
Author: ganesh |
Date: 2014-11-16.15:32:05 |
|
I (belatedly) now remembering a bit more of the history behind the
current situation. Despite being responsible for the recent changes
it's rather a blank in my memory :-)
- we actually force the encoding we read from the command-line to
'char8': see patch777.
- some of the current encoding code was imported from haskeline
when that stopped exporting the helper functionality we were
previously using: see patch1013. Not sure if that's directly
relevant to the current situation.
Generally I'm not suggesting we should investigate this issue to
death and come up with a perfect solution. But we should be a bit
careful not just to flip-flop based on one specific problem at hand,
and we should document what we expect to support with tests (and of
course remove the tests we no longer wish to support).
|
msg17823 (view) |
Author: bfrk |
Date: 2014-11-16.17:08:04 |
|
OMG. This is all so messed up I am getting headaches just from thinking
about it.
So (since patch777) we do 'setFileSystemEncoding char8'. And since we do
that before 'System.IO.initForeignEncoding' has been called for the
first time, it permanently overrides the users locale, as seen by all
Haskell libraries. IIUC, the main reason it's been done that way is so
that things like file names can be round-tripped no matter what the
encoding is.
On the other hand, we want to store patch meta data in utf-8. But stuff
from the command line is (unfortunately?) also subject to encoding,
using initForeignEncoding, which we globally set to char8. Therefore the
decodeString function.
We can solve the original problem (issue2416) by adding a reverse
function to Darcs.Util.ByteString and using that in the Option
definitions when defining ounparse.
The larger problem (the reason we enforce char8 as the default) should
perhaps be discussed elsewhere.
|
msg17824 (view) |
Author: ganesh |
Date: 2014-11-16.17:46:25 |
|
> IIUC, the main reason it's been done that way
> is so that things like file names can be round-tripped
> no matter what the encoding is.
Yes, I think that's the motivation. I spent a while investigating
alternatives (see issue2095) but this ended up being the simplest
"solution".
I had some vague thoughts in the long-term of using newtypes around
String and ByteString to identify where in our code what encoding is
in use (the obvious ones being Char8, UTF-8 or user-locale), but
that's not going to be for 2.10. As you say that's probably best
discussed elsewhere.
> We can solve the original problem (issue2416) by
> adding a reverse function to Darcs.Util.ByteString
> and using that in the Option definitions when
> defining ounparse.
Isn't the problem here that 'decodeString' isn't invertible? The
example I gave in issue2416 was derived from running the utf8.sh
with patch1222, and it reflects a tag name being passed on the
command-line.
|
msg17825 (view) |
Author: bfrk |
Date: 2014-11-16.18:54:44 |
|
> > We can solve the original problem (issue2416) by
> > adding a reverse function to Darcs.Util.ByteString
> > and using that in the Option definitions when
> > defining ounparse.
> Isn't the problem here that 'decodeString' isn't invertible? The
> example I gave in issue2416 was derived from running the utf8.sh
> with patch1222, and it reflects a tag name being passed on the
> command-line.
You are probably right. What I am certain of is that applying
decodeString when creating the DarcsFlag but not applying an inverse
when going back from the DarcsFlag to a String is wrong. You say there
is no inverse, so I think the only viable solution is to either apply
decodeString before options are parsed or afterwards when they are used.
I you have an idea how to get out of this mess, please go ahead.
|
msg17826 (view) |
Author: bfrk |
Date: 2014-11-16.20:11:28 |
|
Okay, here is another patch bundle that implements a solution according
to what Ganesh proposed. With this patch I no longer get failures for
the utf-8.sh test. It also includes some small bugfix patches. However,
workingdir.sh is broken now.
Attachments
|
msg17830 (view) |
Author: ganesh |
Date: 2014-11-17.06:56:19 |
|
A quick update on test failures following our IRC conversation.
The send failures are entirely caused by my "make the options type into a type parameter"
patches. I don't know why I didn't see them before, but now they happen without any of your
patches from this bundle. Your patches from this bundle commute with the options type changes,
so we can sort them out separately unless you expect some semantic dependency.
The workingdir.sh failure is actually caused by "fix: the toPatch option was broken, probably
due to some cut/paste". I haven't figured out exactly how we go to this state yet; the fix is
clearly correct, but it means the flag isn't recognised by the call to havePatchsetMatch in
Darcs.Repository.copyRepoAndGoToChosenVersion.
I'll update the bug if I figure out more.
|
msg17831 (view) |
Author: ganesh |
Date: 2014-11-17.07:19:14 |
|
OK, I figured out the workingdir problem. Before the options rewrite, --to-patch was
actually parsed in two different ways depending on command:
matchOneContext =
DarcsMultipleChoiceOption
[DarcsArgOption [] ["to-match"] OnePattern "PATTERN"
"select changes up to a patch matching PATTERN",
DarcsArgOption [] ["to-patch"] OnePatch "REGEXP"
"select changes up to a patch matching REGEXP",
__tag,
DarcsAbsPathOption [] ["context"] Context "FILENAME"
"version specified by the context in FILENAME"
]
and
matchOneContext =
DarcsMultipleChoiceOption
[DarcsArgOption [] ["to-match"] OnePattern "PATTERN"
"select changes up to a patch matching PATTERN",
DarcsArgOption [] ["to-patch"] OnePatch "REGEXP"
"select changes up to a patch matching REGEXP",
__tag,
DarcsAbsPathOption [] ["context"] Context "FILENAME"
"version specified by the context in FILENAME"
]
It looks like matchOneContext was only used by clone.
I can sort of see why this approach made sense for clone because you can't get a
proper range of patches, you really have to start at the beginning of the repo, but
the user would expect --to-patch to behave as it says. On the other hand having two
different parsings inside Darcs is pretty confusing. I guess we should just change
the clone code to understand 'UpToPatch' etc properly.
|
msg17838 (view) |
Author: bfrk |
Date: 2014-11-17.22:26:47 |
|
Thanks for the analysis Ganesh. I thinks this explains a lot of the
strange things I have seen. I agree that the better approach is to deal
with this in the clone command.
|
|
Date |
User |
Action |
Args |
2014-11-15 22:32:29 | bfrk | create | |
2014-11-15 23:00:49 | ganesh | set | status: needs-screening -> in-discussion nosy:
+ ganesh messages:
+ msg17814 |
2014-11-15 23:24:44 | bfrk | set | messages:
+ msg17815 |
2014-11-15 23:37:19 | ganesh | set | messages:
+ msg17816 |
2014-11-16 00:39:27 | bfrk | set | messages:
+ msg17817 |
2014-11-16 00:42:37 | bfrk | set | messages:
+ msg17818 |
2014-11-16 02:02:44 | ganesh | set | messages:
+ msg17819 |
2014-11-16 13:44:34 | bfrk | set | messages:
+ msg17820 |
2014-11-16 15:32:05 | ganesh | set | messages:
+ msg17822 |
2014-11-16 17:08:04 | bfrk | set | messages:
+ msg17823 |
2014-11-16 17:46:25 | ganesh | set | messages:
+ msg17824 |
2014-11-16 18:54:44 | bfrk | set | messages:
+ msg17825 |
2014-11-16 20:11:28 | bfrk | set | files:
+ make-the-options-type-used-by-a-command-into-a-type-parameter_and_4_others.dpatch messages:
+ msg17826 |
2014-11-16 20:12:05 | darcswatch | set | darcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_reviewed.html#bundle-58bd717d4811a1331ddb8569c1806f15dbc3e175 |
2014-11-17 06:56:20 | ganesh | set | messages:
+ msg17830 |
2014-11-17 07:19:14 | ganesh | set | messages:
+ msg17831 |
2014-11-17 22:26:47 | bfrk | set | status: in-discussion -> obsoleted messages:
+ msg17838 |