darcs

Issue 1693 Check displaying of Unicode patch metadata

Title Check displaying of Unicode patch metadata
Priority bug Status needs-diagnosis/design
Milestone 3.0.0 Resolved in
Superseder Nosy List asr, darcs-devel, dmitry.kurochkin, ganesh, nad, rpglover64, tux_rocker, xancorreu
Assigned To ganesh
Topics

Created on 2009-11-15.18:08:14 by tux_rocker, last changed 2015-08-11.13:54:08 by xancorreu.

Messages
msg9352 (view) Author: tux_rocker Date: 2009-11-15.18:08:11
When issue64 has been solved, darcs stores patch metadata as Unicode strings.
However, the code that displays the metadata was not written with non-ASCII
characters in mind. I believe I saw that it replaces non-ASCII characters with
numerical escapes, but I have not been able to find code for that in the Printer
module.

There is also the conceptual question of how to make Printer.renderPS treat
non-ASCII characters.
msg13112 (view) Author: kowey Date: 2010-11-19.10:13:45
Is this worthy of a Darcs 2.5.1?
Lele just experienced an issue similar to issue1990
msg13156 (view) Author: tux_rocker Date: 2010-11-21.16:24:24
If we get a fix, yes.

I looked at it for a bit and ultimately what happens is that hPutStr
writes only the lowest 8 bits of a character with a code point > 255. To
solve this, we must either explicitly encode the text with the locale
encoding before hPutStr'ing it, or we require GHC >= 6.12 and don't
switch file handles to binary mode.

The latter solution seems better because it is what the IO library is
designed for. But I don't know which parts of darcs require the file
handles to be in binary mode, and why.

Note that for the Windows line ending behavior that the "binary mode"
flag in C is for, you don't have to put the handle in binary mode in
Haskell. 'hSetNewlineMode noNewlineTranslation' will do that for you and
still write characters outside of latin1 correctly.
msg13160 (view) Author: mornfall Date: 2010-11-21.17:17:51
Reinier Lamers <bugs@darcs.net> writes:
> I looked at it for a bit and ultimately what happens is that hPutStr
> writes only the lowest 8 bits of a character with a code point > 255. To
> solve this, we must either explicitly encode the text with the locale
> encoding before hPutStr'ing it, or we require GHC >= 6.12 and don't
> switch file handles to binary mode.

We can't do the latter, since it breaks windows. Non-binary handles only
work with ghc <= 6.10 on that platform, encountering a "bad" character
with 6.12 or later will crash the program. Happens to darcs a lot for
various reasons. You really need to use binary handles for everything.

Yours,
  Petr.
msg13356 (view) Author: kowey Date: 2010-12-16.16:44:14
We made an impromptu attempt at working out what was going on in
http://irclog.perlgeek.de/darcs/2010-12-16#i_3094602

It probably duplicates a lot of stuff that Reinier already knows, but 
never hurts to repeat the what-the-heck-is-darcs-doing exercise across 
different members of the team
msg13373 (view) Author: kowey Date: 2010-12-17.10:11:20
These two emails may be relevant:

 - http://lists.osuosl.org/pipermail/darcs-users/2010-
December/025932.html
 - http://lists.osuosl.org/pipermail/darcs-users/2010-May/024023.html
msg13466 (view) Author: ganesh Date: 2011-01-05.23:16:15
Bumping to 2.5.2 - if a fix appears soon it might still make 2.5.1.
msg14549 (view) Author: nad Date: 2011-06-23.16:25:03
I just want to note that, due to this bug (in particular, the kind of
output described in issue 1990), I am still using darcs 2.4.4.
msg17050 (view) Author: rpglover64 Date: 2013-09-25.18:47:30
This bug's status is "needs-reproduction". Does this mean that the
maintainers have not been able to replicate it?

If that's the case, here's a quick way I can reproduce the bug:

> darcs init foo
> cd foo
> touch foo
> darcs add foo
> darcs record
As the patch name, use "берегам" (or any other unicode-containing string)
> darcs log

This displays

  * <U+0431><U+0435><U+0440><U+0435><U+0433><U+0430><U+043C>

if DARCS_DONT_ESCAPE_8BIT=0 and

  * 15@530<
if DARCS_DONT_ESCAPE_8BIT=1
msg17859 (view) Author: ganesh Date: 2014-11-26.06:34:19
We shouldn't have left this hanging so long, it's pretty fundamental to 
anyone in a non-ASCII locale. I'm taking a look at it now.
msg17884 (view) Author: ganesh Date: 2014-12-08.19:20:21
I've spent some time looking at this and sent draft patch1239

I don't really like it as it's a layer of sticking plaster on top of an already messy 
situation. It also doesn't make the situation any better on Windows or in non-UTF8 
locales, though I don't think it makes it any worse.

The patch basically adds a new parameter to all the printing code to decide whether to 
locale-encode Strings as they are output or not. Bytestrings are left alone as they 
typically seem to be sourced from patch metadata which is already UTF8-encoded.

I don't think we can do this translation unconditionally because the same code is used 
to print out patch files on disk etc. It seems to me that the right long-term solution 
would be to use phantom types to tag both Strings and Bytestrings with their encoding 
(e.g. locale, UTF8, 7bit-only). This would make it clear where in the code we should 
convert things. We should also cleanly split out the "print for user consumption" and 
"print for persistence" use cases; they're currently tangled together all the way up 
to classes like ShowPatch.

Users will also need to set the environment variable DARCS_DONT_ESCAPE_8BIT=1 to get 
UTF8 output. It feels like we should make that the default, but I don't fully 
understand the implications and why it isn't already the default.
msg18016 (view) Author: bf Date: 2015-02-05.14:49:53
Patch1239 was a first attempt at bringing this closer to a conclusion.
It makes it apparent (and easily changeable) when and where a possible
encoding step gets inserted.

However, the question remains, how do we know in general when to pass
Encode and when to pass Standard? Deciding this requires non-local
knowledge about what kind of data we are processing which is error prone.

My gut feeling is that the *producer* should put that information into
the Doc, rather than the consumer guessing it. If I am right with that,
the distinction can and should be made apparent in the type of the data,
for instance by adding a type parameter

data Doc (enc :: RenderMode) = ...

Here, RenderMode is promoted to a kind. This needs the DataKinds
extension which is available since ghc-7.6.x.

I have moved the milestone to 3.0.0 because I do not think we have a
definite solution yet.
msg18722 (view) Author: xancorreu Date: 2015-08-11.13:54:06
It works with `export DARCS_DONT_ESCAPE_8BIT=1`.

Perhaps we could display *always* UTF-8 as UTF-8 except otherwise is
noted (perhaps putting some file in _darcs directory for explicit use of
encodings other than UTF-8).
History
Date User Action Args
2009-11-15 18:08:14tux_rockercreate
2010-11-07 16:01:12tux_rockerlinkissue1990 superseder
2010-11-07 16:06:42tux_rockersetassignedto: tux_rocker
2010-11-19 10:13:46koweysetmessages: + msg13112
milestone: 2.5.0
2010-11-21 16:24:25tux_rockersetmessages: + msg13156
2010-11-21 17:17:52mornfallsetmessages: + msg13160
2010-12-10 16:32:16koweysetmilestone: 2.5.0 -> 2.5.1
2010-12-16 16:44:15koweysetmessages: + msg13356
2010-12-17 10:11:21koweysetmessages: + msg13373
2011-01-05 23:16:16ganeshsetmessages: + msg13466
milestone: 2.5.1 -> 2.5.2
2011-01-12 14:05:58asrsetnosy: + asr
2011-06-23 16:25:03nadsetnosy: + nad
messages: + msg14549
2013-06-21 13:34:11rpglover64setnosy: + rpglover64
2013-09-25 18:47:32rpglover64setmessages: + msg17050
2014-11-26 06:34:21ganeshsetstatus: needs-reproduction -> needs-diagnosis/design
nosy: + ganesh
messages: + msg17859
assignedto: tux_rocker -> ganesh
milestone: 2.5.2 -> 2.10.0
superseder: - Should store patch metadata in utf-8
2014-12-08 19:14:59ganeshlinkpatch1239 issues
2014-12-08 19:20:23ganeshsetmessages: + msg17884
2015-02-05 14:49:56bfsetmessages: + msg18016
milestone: 2.10.0 -> 3.0.0
2015-08-10 06:07:18xancorreusetnosy: + xancorreu
2015-08-11 13:54:08xancorreusetmessages: + msg18722