darcs

Issue 2389 non-Latin (e.g. Cyrillic) letters not printed in the output of whatsnew (and record)

Title non-Latin (e.g. Cyrillic) letters not printed in the output of whatsnew (and record)
Priority bug Status needs-diagnosis/design
Milestone Resolved in
Superseder Nosy List Serware, imz, jaredj
Assigned To
Topics Regression, UI

Created on 2014-05-19.00:48:29 by imz, last changed 2014-05-21.09:00:36 by imz.

Messages
msg17460 (view) Author: imz Date: 2014-05-19.00:48:23
1. Summarise the issue (what were doing, what went wrong?)

In the output "darcs whatsnew", Cyrillic letters are not printed as is, but
rather as codes, so they are not readable. (Although the locale is
Cyrillic.)

$ darcs init
$ echo Здравствуйте > hello.txt
$ darcs add -r .
$ darcs wh 
addfile ./hello.txt
hunk ./hello.txt 1
+<U+00D0><U+0097><U+00D0><U+00B4><U+00D1><U+0080><U+00D0><U+00B0><U+00D0><U+00B2><U+00D1><U+0081><U+00D1><U+0082><U+00D0><U+00B2><U+00D1><U+0083><U+00D0><U+00B9><U+00D1><U+0082><U+00D0><U+00B5>
$ locale
LANG=ru_RU.UTF-8
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_TIME="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_PAPER="ru_RU.UTF-8"
LC_NAME="ru_RU.UTF-8"
LC_ADDRESS="ru_RU.UTF-8"
LC_TELEPHONE="ru_RU.UTF-8"
LC_MEASUREMENT="ru_RU.UTF-8"
LC_IDENTIFICATION="ru_RU.UTF-8"
LC_ALL=
$ 

As far as I remember, this wasn't like this in some previous version of
darcs,
probably 2.1.2-alt2 --
http://packages.altlinux.org/en/Platform5/srpms/darcs .

(Perhaps, there are other commands where this happens, and I'll report them
later, but I'm not sure whether this is the case.)

2. What behaviour were you expecting instead?

The letters that can be printed in a readable form according to the
locale are printed unescaped.

Workaround: use "darcs diff", which uses the external diff tool (and
hence is
slower):

$ darcs diff
diff -rN -u old-darcs-Cyr/hello.txt new-darcs-Cyr/hello.txt
--- old-darcs-Cyr/hello.txt    1970-01-01 03:00:00.000000000 +0300
+++ new-darcs-Cyr/hello.txt    2014-05-19 04:11:15.678289159 +0400
@@ -0,0 +1 @@
+Здравствуйте
$ 


3. What darcs version are you using? (Try: darcs --exact-version)

ghc7.6.1-darcs-2.8.4-alt2

$ darcs --exact-version
darcs compiled on Feb 26 2013, at 18:09:42

Context:

[TAG 2.8.4
Ganesh Sittampalam <ganesh@earth.li>**20130127231845
 Ignore-this: d032f69540341ecfd5858fce7aee1457
] 

[Resolve issue2155: Expurgate the non-functional annotate --xml-output
option
Dave Love <fx@gnu.org>**20130127231835
 Ignore-this: eb03207031e75687968091d56fb008f8
 backported from HEAD by Ganesh Sittampalam <ganesh@earth.li>
] 

[Resolve issue2155: Expurgate the non-functional annotate --xml-output
option
Dave Love <fx@gnu.org>**20130120121739
 Ignore-this: 8a9ce6409a50b71cd0d2fdabbc181b1a
 backported from HEAD by Ganesh Sittampalam <ganesh@earth.li>
] 

[note dependency bumps in NEWS
Ganesh Sittampalam <ganesh@earth.li>**20130120170310
 Ignore-this: 48cf181c89ec1b69fc6e9e701734ff19
] 

[bump version to 2.8.4
Ganesh Sittampalam <ganesh@earth.li>**20130120154856
 Ignore-this: 2f2542e9825b66cda3a0a17275b5e311
] 

[resolve issue2199: getMatchingTag needs to commute for dirty tags
Ganesh Sittampalam <ganesh@earth.li>**20121218191024
 Ignore-this: 947252cd8e084b793044aff564f0462d
 backported from HEAD
] 

[accept issue2199: darcs get --tag gets too much
Ganesh Sittampalam <ganesh@earth.li>**20120528164525
 Ignore-this: 8c138a80c294e6181a3ef9250593fa31
] 

[constrain haskeline version on old GHC
Ganesh Sittampalam <ganesh@earth.li>**20130119234432
 Ignore-this: 8af00cc3d3c1ad223a8b35712c06bae
] 

[Add option -a to darcs changes in Setup.lhs
Ben Franksen <ben.franksen@online.de>**20120811195807
 Ignore-this: f23d2e558f7248fec8d07b0391d9a7e8
 
 Some (potential) contributors (like me) have 'changes interactive'
 in their ~/.darcs/defaults and then wonder why their build hangs.
] 

[add copyright notices for the imported haskeline code
Ganesh Sittampalam <ganesh@earth.li>**20130119163610
 Ignore-this: afcdc8048f8b3233fa17d3ab0c9c311f
 licence/copyright taken from haskeline 0.6.4.7:
 BSD3, copyright Judah Jacobson
] 

[import encoding code from haskeline: switch over
Ganesh Sittampalam <ganesh@earth.li>**20130118231907
 Ignore-this: b423a92ba93e74520d0578ac21aceab3
] 

[import encoding code from haskeline: source files
Ganesh Sittampalam <ganesh@earth.li>**20130118225947
 Ignore-this: c2d1e228fa4cce3e66e90a14fa2f3200
] 

[import encoding code from haskeline: cabal file changes
Ganesh Sittampalam <ganesh@earth.li>**20130118070642
 Ignore-this: d2ed13887d0c547cb7498bd5a2aef46f
] 

[import encoding code from haskeline: Setup.lhs changes
Ganesh Sittampalam <ganesh@earth.li>**20130115181040
 Ignore-this: 31ccdca76001bff769464fb7a8e574e9
] 

[ROLLBACK: conditionally use bytestring-handle
Ganesh Sittampalam <ganesh@earth.li>**20130111213829
 Ignore-this: d3c18b61f765bdfcb574b4977185197b
 It doesn't skip invalid byte sequences when decoding so breaks on 
 repositories with non-UTF8 encoded metadata.
] 

[bump network dependency
Ganesh Sittampalam <ganesh@earth.li>**20130111184607
 Ignore-this: 54e55fa09793008d55572b6acad1a7b8
] 

[add some comments about "nearby" darcs, and print out the one that was
found
Ganesh Sittampalam <ganesh@earth.li>**20130111211817
 Ignore-this: 57385f3248fc539435ed9de069a40bd5
 backported from HEAD
] 

[make darcs-test look "nearby" for a darcs exe to use
Ganesh Sittampalam <ganesh@earth.li>**20130111211445
 Ignore-this: d952cf330c0d9510c5c973dd41b191e0
 backported from HEAD
] 

[need different path separator on Windows
Ganesh Sittampalam <ganesh@earth.li>**20121219191221
 Ignore-this: 42c7ba1e46f5b6b600d23838e5a162cb
] 

[test for issue2286: make sure we can read repos with non-UTF8 metadata
Ganesh Sittampalam <ganesh@earth.li>**20130102222735
 Ignore-this: adc6165d5d5d991383ebf0e6547f7bf4
] 

[We can use chcp to switch encodings on Windows
Ganesh Sittampalam <ganesh@earth.li>**20130101122254
 Ignore-this: bc115467e31e144694a33e43dca3fb6c
 This means that the tests that require different encodings can run.
] 

[Find latin9 locale on OS X too
Michael Hendricks <michael@ndrix.org>**20120420202408
 Ignore-this: c87db3b97312234ed2380d2ca11a8ca0
 
 Most Linux systems describe latin9 as "iso885915".  OS X
 describes it with "ISO8859-15".  The new regex catches both.
] 

[windows test fix: replace shell script with a Haskell program
Ganesh Sittampalam <ganesh@earth.li>**20121231224332
 Ignore-this: de01ab8647e7d62c18d8c266d514b054
] 

[unsetting DARCS_TEST_PREFS_DIR in utf8 test doesn't seem to be necessary
Ganesh Sittampalam <ganesh@earth.li>**20130101120246
 Ignore-this: ed74710d8b358b920d742e86a7f008d8
 
 It was causing problems on Windows because getAppUserDataDirectory
 still returns the normal user directory.
 
 It also means that the repository type choice isn't picked up.
] 

[improve diagnostics when utf8 test fails
Ganesh Sittampalam <ganesh@earth.li>**20121231224600
 Ignore-this: 63db587bc36f8826c66dc6913a4fdb2d
] 

[need to do case-insensitive comparison on Windows
Ganesh Sittampalam <ganesh@earth.li>**20121228214823
 Ignore-this: ef309d4aef22e87d5c3da3222926af0e
] 

[update NEWS
Ganesh Sittampalam <ganesh@earth.li>**20121216202654
 Ignore-this: aef3e47204a4157504f90554ffc3a327
] 

[conditionally use bytestring-handle instead of haskeline for encoding
Ganesh Sittampalam <ganesh@earth.li>**20121216201745
 Ignore-this: fd758796b689d090d01e003e660e405
 This is transitional because we need to support GHC 6.10: we can switch
 over to bytestring-handle unconditionally on HEAD.
] 

[bump deps for GHC 7.6/latest hackage
Ganesh Sittampalam <ganesh@earth.li>**20121216201543
 Ignore-this: 77829b074a4bff635a421879bdd04be0
] 

[conditionally support tar 0.4
Ganesh Sittampalam <ganesh@earth.li>**20121216201014
 Ignore-this: 8eff0330e6af196727bdd736ef31db25
] 

[recent test-framework seems to require Typeable
Ganesh Sittampalam <ganesh@earth.li>**20121216175601
 Ignore-this: a8cce6b69984bfc2335b5c19688950b3
] 

[stop using Prelude.catch
Ganesh Sittampalam <ganesh@earth.li>**20121216163240
 Ignore-this: b4bfc48775b3337f8f7ebe275be1a058
 
 backported from HEAD
] 

[import constructors of C types to deal with a GHC change
Ganesh Sittampalam <ganesh@earth.li>**20120401132500
 Ignore-this: ab7cf2fb5e9a2494c14fe7394200da9b
] 

[TAG 2.8.3
Ganesh Sittampalam <ganesh@earth.li>**20121104174910
 Ignore-this: 3198f6deecf3d1b44df6e05a2657d9ca
] 

Compiled with:

HTTP-4000.2.6
array-0.4.0.1
base-4.6.0.0
bytestring-0.10.0.0
containers-0.5.0.0
directory-1.2.0.0
extensible-exceptions-0.1.1.4
filepath-1.3.0.1
hashed-storage-0.5.10
haskeline-0.7.0.3
html-1.0.1.2
mmap-0.5.8
mtl-2.1.2
network-2.3.2.0
old-time-1.1.0.1
parsec-3.1.3
process-1.1.0.2
random-1.0.1.1
regex-compat-0.95.1
tar-0.4.0.1
terminfo-0.3.2.5
text-0.11.2.3
unix-2.6.0.0
utf8-string-0.3.7
vector-0.10.0.1
zlib-0.5.4.0
$ 

4. What operating system are you running?

ALT Linux
msg17461 (view) Author: imz Date: 2014-05-19.00:50:14
This was initially written down by me at
https://bugzilla.altlinux.org/show_bug.cgi?id=30083 .

More bad behavior:

Also the Cyrillic letters in the output of "darcs whatsnew" generate soem
strange garbage around them on dumb terminals:

$ echo world > world.txt
$ darcs add world.txt 
$ darcs wh | cat
addfile ./hello.txt
hunk ./hello.txt 1
+[_<U+00D0>_][_<U+0097>_][_<U+00D0>_][_<U+00B4>_][_<U+00D1>_][_<U+0080>_][_<U+00D0>_][_<U+00B0>_][_<U+00D0>_][_<U+00B2>_][_<U+00D1>_][_<U+0081>_][_<U+00D1>_][_<U+0082>_][_<U+00D0>_][_<U+00B2>_][_<U+00D1>_][_<U+0083>_][_<U+00D0>_][_<U+00B9>_][_<U+00D1>_][_<U+0082>_][_<U+00D0>_][_<U+00B5>_]
addfile ./world.txt
hunk ./world.txt 1
+world
$ 

Perhaps, this are the colored highlighting for these codes (they are printed
red)... but this garbage doesn't appear around other highlighted text; for
example, above the words "addfile"and"hunk"are printed in blue in a
terminal.
msg17463 (view) Author: imz Date: 2014-05-19.01:41:03
Actually, the printed codes seem a bit strange to me -- they can hardly
correspond to the individual letters; for example, the beginning has two
repeating codes:

<U+00D0><U+0097><U+00D0>

but in the word Здравствуйте there are no 2 repeating letters in the
beginning.
msg17464 (view) Author: stephen Date: 2014-05-19.02:53:04
Ivan Zakharyaschev writes:
 > 
 > Ivan Zakharyaschev <imz@altlinux.org> added the comment:
 > 
 > Actually, the printed codes seem a bit strange to me -- they can hardly
 > correspond to the individual letters; for example, the beginning has two
 > repeating codes:
 > 
 > <U+00D0><U+0097><U+00D0>
 > 
 > but in the word Здравствуйте there are no 2 repeating letters in the
 > beginning.

I can't speak to the Darcs issue, but what's happening here is that
for some reason each octet is being treated as a Unicode character.
0xD0 0x97 is indeed the UTF-8 encoding of "З".
msg17466 (view) Author: kowey Date: 2014-05-19.15:16:28
I notice there are a couple of other (otherwise seemingly unrelated) 
encodings-related issues on the trackers (regarding patch names and 
filenames).

Thanks to Stephen for noticing that the sequence of 3 Unicode code points 
actually corresponds to what would be a single char encoded in UTF-8 (3 
octets for that char).

I don't have a full diagnosis myself, but I'll note that Darcs mostly 
treats text files as bytestrings (with the exception that it assumes some 
sort of 8-bit encoding when looking for the "\n" char).  So it's not 
entirely surprising that that you see individual bytes in the output 
representation.

Of course, it's quite wrong on Darcs' part to be treating these as 
individual Unicode code points, so something isn't quite going right on 
the way from its internal representation of the file contents (bytes) to 
the display on the screen (to text and back to bytes again)
msg17467 (view) Author: imz Date: 2014-05-19.16:38:32
2014-05-19 19:16 UTC+04:00, Eric Kow <bugs@darcs.net>:

> Thanks to Stephen for noticing that the sequence of 3 Unicode code points
> actually corresponds to what would be a single char encoded in UTF-8 (3
> octets for that char).

Just to be correct: it's a character encoded by 2 octets in UTF-8, and
then there goes another one encoded by 2 octets. But I wasn't smart
enough to notice this, so I quoted only the first three printed chars
(which correspond to one and a half real UTF-8 encodings of the
Cyrillic letters).

Also note that there is another problem: there is some garbage on dumb
terminals around these symbols. Something is wrong with the function
that prints these colored codes, because it wants to put some garbage
on a dumb terminal (it should not attempt to color things on dumb
terminals or in pipes).

Regards,
-- 
Ivan
msg17468 (view) Author: kowey Date: 2014-05-19.16:43:46
On 19 May 2014 12:38, Ivan Zakharyaschev <bugs@darcs.net> wrote:
> Just to be correct: it's a character encoded by 2 octets in UTF-8, and
> then there goes another one encoded by 2 octets. But I wasn't smart
> enough to notice this, so I quoted only the first three printed chars
> (which correspond to one and a half real UTF-8 encodings of the
> Cyrillic letters).

Ah, I see, sorry for my lack of attention.
I think I must have tripped up because “З” looks like “3”

> Also note that there is another problem: there is some garbage on dumb
> terminals around these symbols. Something is wrong with the function
> that prints these colored codes, because it wants to put some garbage
> on a dumb terminal (it should not attempt to color things on dumb
> terminals or in pipes).

If it's not too much trouble, I'd appreciate it if you could file a
separate bug on this matter.
issue918 looks like it's related (and the good news is that we seem to
have had some work on it)

-- 
Eric Kow <http://erickow.com>
msg17472 (view) Author: imz Date: 2014-05-20.08:25:44
http://bugs.darcs.net/issue2391 is about the other mentioned problem
with dumb terminals.
msg17476 (view) Author: imz Date: 2014-05-21.09:00:35
Well, as for a workaround for "darcs wh", there is "darcs diff" -- to
get readable output even with non-Latin content.

But this also happend when running "darcs record", and it asks you about
the changes intercatively. Then, I do not know of a workaround for this
problem.
History
Date User Action Args
2014-05-19 00:48:29imzcreate
2014-05-19 00:50:15imzsetmessages: + msg17461
2014-05-19 01:41:04imzsetmessages: + msg17463
2014-05-19 02:53:05stephensetmessages: + msg17464
title: non-Latin (e.g. Cyrillic) letters not printed in the output of whatsnew -> non-Latin (e.g. Cyrillic) letters not printed in the output of whatsnew
2014-05-19 15:16:39koweysetpriority: bug
status: unknown -> needs-diagnosis/design
topic: - ProbablyEasy, Darcs2
messages: + msg17466
title: non-Latin (e.g. Cyrillic) letters not printed in the output of whatsnew -> non-Latin (e.g. Cyrillic) letters not printed in the output of whatsnew
2014-05-19 16:38:33imzsetmessages: + msg17467
2014-05-19 16:43:48koweysetmessages: + msg17468
2014-05-20 08:25:45imzsetmessages: + msg17472
2014-05-21 09:00:36imzsetmessages: + msg17476
title: non-Latin (e.g. Cyrillic) letters not printed in the output of whatsnew -> non-Latin (e.g. Cyrillic) letters not printed in the output of whatsnew (and record)