Issue 1693: Check displaying of Unicode patch metadata

Title	Check displaying of Unicode patch metadata
Priority	bug	Status	needs-diagnosis/design
Milestone	3.0.0	Resolved in
Superseder		Nosy List	asr, darcs-devel, dmitry.kurochkin, ganesh, nad, rpglover64, tux_rocker, xancorreu
Assigned To	ganesh	Topics

Created on 2009-11-15.18:08:14 by tux_rocker, last changed 2015-08-11.13:54:08 by xancorreu.

Messages
msg9352 (view)	Author: tux_rocker	Date: 2009-11-15.18:08:11
When issue64 has been solved, darcs stores patch metadata as Unicode strings. However, the code that displays the metadata was not written with non-ASCII characters in mind. I believe I saw that it replaces non-ASCII characters with numerical escapes, but I have not been able to find code for that in the Printer module. There is also the conceptual question of how to make Printer.renderPS treat non-ASCII characters.
msg13112 (view)	Author: kowey	Date: 2010-11-19.10:13:45
Is this worthy of a Darcs 2.5.1? Lele just experienced an issue similar to issue1990
msg13156 (view)	Author: tux_rocker	Date: 2010-11-21.16:24:24
If we get a fix, yes. I looked at it for a bit and ultimately what happens is that hPutStr writes only the lowest 8 bits of a character with a code point > 255. To solve this, we must either explicitly encode the text with the locale encoding before hPutStr'ing it, or we require GHC >= 6.12 and don't switch file handles to binary mode. The latter solution seems better because it is what the IO library is designed for. But I don't know which parts of darcs require the file handles to be in binary mode, and why. Note that for the Windows line ending behavior that the "binary mode" flag in C is for, you don't have to put the handle in binary mode in Haskell. 'hSetNewlineMode noNewlineTranslation' will do that for you and still write characters outside of latin1 correctly.
msg13160 (view)	Author: mornfall	Date: 2010-11-21.17:17:51
Reinier Lamers <bugs@darcs.net> writes: > I looked at it for a bit and ultimately what happens is that hPutStr > writes only the lowest 8 bits of a character with a code point > 255. To > solve this, we must either explicitly encode the text with the locale > encoding before hPutStr'ing it, or we require GHC >= 6.12 and don't > switch file handles to binary mode. We can't do the latter, since it breaks windows. Non-binary handles only work with ghc <= 6.10 on that platform, encountering a "bad" character with 6.12 or later will crash the program. Happens to darcs a lot for various reasons. You really need to use binary handles for everything. Yours, Petr.
msg13356 (view)	Author: kowey	Date: 2010-12-16.16:44:14
We made an impromptu attempt at working out what was going on in http://irclog.perlgeek.de/darcs/2010-12-16#i_3094602 It probably duplicates a lot of stuff that Reinier already knows, but never hurts to repeat the what-the-heck-is-darcs-doing exercise across different members of the team
msg13373 (view)	Author: kowey	Date: 2010-12-17.10:11:20
These two emails may be relevant: - http://lists.osuosl.org/pipermail/darcs-users/2010- December/025932.html - http://lists.osuosl.org/pipermail/darcs-users/2010-May/024023.html
msg13466 (view)	Author: ganesh	Date: 2011-01-05.23:16:15
Bumping to 2.5.2 - if a fix appears soon it might still make 2.5.1.
msg14549 (view)	Author: nad	Date: 2011-06-23.16:25:03
I just want to note that, due to this bug (in particular, the kind of output described in issue 1990), I am still using darcs 2.4.4.
msg17050 (view)	Author: rpglover64	Date: 2013-09-25.18:47:30
This bug's status is "needs-reproduction". Does this mean that the maintainers have not been able to replicate it? If that's the case, here's a quick way I can reproduce the bug: > darcs init foo > cd foo > touch foo > darcs add foo > darcs record As the patch name, use "берегам" (or any other unicode-containing string) > darcs log This displays * <U+0431><U+0435><U+0440><U+0435><U+0433><U+0430><U+043C> if DARCS_DONT_ESCAPE_8BIT=0 and * 15@530< if DARCS_DONT_ESCAPE_8BIT=1
msg17859 (view)	Author: ganesh	Date: 2014-11-26.06:34:19
We shouldn't have left this hanging so long, it's pretty fundamental to anyone in a non-ASCII locale. I'm taking a look at it now.
msg17884 (view)	Author: ganesh	Date: 2014-12-08.19:20:21
I've spent some time looking at this and sent draft patch1239 I don't really like it as it's a layer of sticking plaster on top of an already messy situation. It also doesn't make the situation any better on Windows or in non-UTF8 locales, though I don't think it makes it any worse. The patch basically adds a new parameter to all the printing code to decide whether to locale-encode Strings as they are output or not. Bytestrings are left alone as they typically seem to be sourced from patch metadata which is already UTF8-encoded. I don't think we can do this translation unconditionally because the same code is used to print out patch files on disk etc. It seems to me that the right long-term solution would be to use phantom types to tag both Strings and Bytestrings with their encoding (e.g. locale, UTF8, 7bit-only). This would make it clear where in the code we should convert things. We should also cleanly split out the "print for user consumption" and "print for persistence" use cases; they're currently tangled together all the way up to classes like ShowPatch. Users will also need to set the environment variable DARCS_DONT_ESCAPE_8BIT=1 to get UTF8 output. It feels like we should make that the default, but I don't fully understand the implications and why it isn't already the default.
msg18016 (view)	Author: bfrk	Date: 2015-02-05.14:49:53
Patch1239 was a first attempt at bringing this closer to a conclusion. It makes it apparent (and easily changeable) when and where a possible encoding step gets inserted. However, the question remains, how do we know in general when to pass Encode and when to pass Standard? Deciding this requires non-local knowledge about what kind of data we are processing which is error prone. My gut feeling is that the producer should put that information into the Doc, rather than the consumer guessing it. If I am right with that, the distinction can and should be made apparent in the type of the data, for instance by adding a type parameter data Doc (enc :: RenderMode) = ... Here, RenderMode is promoted to a kind. This needs the DataKinds extension which is available since ghc-7.6.x. I have moved the milestone to 3.0.0 because I do not think we have a definite solution yet.
msg18722 (view)	Author: xancorreu	Date: 2015-08-11.13:54:06
It works with `export DARCS_DONT_ESCAPE_8BIT=1`. Perhaps we could display always UTF-8 as UTF-8 except otherwise is noted (perhaps putting some file in _darcs directory for explicit use of encodings other than UTF-8).

History
Date	User	Action	Args
2009-11-15 18:08:14	tux_rocker	create
2010-11-07 16:01:12	tux_rocker	link	issue1990 superseder
2010-11-07 16:06:42	tux_rocker	set	assignedto: tux_rocker
2010-11-19 10:13:46	kowey	set	messages: + msg13112 milestone: 2.5.0
2010-11-21 16:24:25	tux_rocker	set	messages: + msg13156
2010-11-21 17:17:52	mornfall	set	messages: + msg13160
2010-12-10 16:32:16	kowey	set	milestone: 2.5.0 -> 2.5.1
2010-12-16 16:44:15	kowey	set	messages: + msg13356
2010-12-17 10:11:21	kowey	set	messages: + msg13373
2011-01-05 23:16:16	ganesh	set	messages: + msg13466 milestone: 2.5.1 -> 2.5.2
2011-01-12 14:05:58	asr	set	nosy: + asr
2011-06-23 16:25:03	nad	set	nosy: + nad messages: + msg14549
2013-06-21 13:34:11	rpglover64	set	nosy: + rpglover64
2013-09-25 18:47:32	rpglover64	set	messages: + msg17050
2014-11-26 06:34:21	ganesh	set	status: needs-reproduction -> needs-diagnosis/design nosy: + ganesh messages: + msg17859 assignedto: tux_rocker -> ganesh milestone: 2.5.2 -> 2.10.0 superseder: - Should store patch metadata in utf-8
2014-12-08 19:14:59	ganesh	link	patch1239 issues
2014-12-08 19:20:23	ganesh	set	messages: + msg17884
2015-02-05 14:49:56	bfrk	set	messages: + msg18016 milestone: 2.10.0 -> 3.0.0
2015-08-10 06:07:18	xancorreu	set	nosy: + xancorreu
2015-08-11 13:54:08	xancorreu	set	messages: + msg18722

Issue 1693 Check displaying of Unicode patch metadata