Created on 2014-05-19.00:48:29 by imz, last changed 2014-05-21.09:00:36 by imz.
msg17460 (view) |
Author: imz |
Date: 2014-05-19.00:48:23 |
|
1. Summarise the issue (what were doing, what went wrong?)
In the output "darcs whatsnew", Cyrillic letters are not printed as is, but
rather as codes, so they are not readable. (Although the locale is
Cyrillic.)
$ darcs init
$ echo Здравствуйте > hello.txt
$ darcs add -r .
$ darcs wh
addfile ./hello.txt
hunk ./hello.txt 1
+<U+00D0><U+0097><U+00D0><U+00B4><U+00D1><U+0080><U+00D0><U+00B0><U+00D0><U+00B2><U+00D1><U+0081><U+00D1><U+0082><U+00D0><U+00B2><U+00D1><U+0083><U+00D0><U+00B9><U+00D1><U+0082><U+00D0><U+00B5>
$ locale
LANG=ru_RU.UTF-8
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_TIME="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_PAPER="ru_RU.UTF-8"
LC_NAME="ru_RU.UTF-8"
LC_ADDRESS="ru_RU.UTF-8"
LC_TELEPHONE="ru_RU.UTF-8"
LC_MEASUREMENT="ru_RU.UTF-8"
LC_IDENTIFICATION="ru_RU.UTF-8"
LC_ALL=
$
As far as I remember, this wasn't like this in some previous version of
darcs,
probably 2.1.2-alt2 --
http://packages.altlinux.org/en/Platform5/srpms/darcs .
(Perhaps, there are other commands where this happens, and I'll report them
later, but I'm not sure whether this is the case.)
2. What behaviour were you expecting instead?
The letters that can be printed in a readable form according to the
locale are printed unescaped.
Workaround: use "darcs diff", which uses the external diff tool (and
hence is
slower):
$ darcs diff
diff -rN -u old-darcs-Cyr/hello.txt new-darcs-Cyr/hello.txt
--- old-darcs-Cyr/hello.txt 1970-01-01 03:00:00.000000000 +0300
+++ new-darcs-Cyr/hello.txt 2014-05-19 04:11:15.678289159 +0400
@@ -0,0 +1 @@
+Здравствуйте
$
3. What darcs version are you using? (Try: darcs --exact-version)
ghc7.6.1-darcs-2.8.4-alt2
$ darcs --exact-version
darcs compiled on Feb 26 2013, at 18:09:42
Context:
[TAG 2.8.4
Ganesh Sittampalam <ganesh@earth.li>**20130127231845
Ignore-this: d032f69540341ecfd5858fce7aee1457
]
[Resolve issue2155: Expurgate the non-functional annotate --xml-output
option
Dave Love <fx@gnu.org>**20130127231835
Ignore-this: eb03207031e75687968091d56fb008f8
backported from HEAD by Ganesh Sittampalam <ganesh@earth.li>
]
[Resolve issue2155: Expurgate the non-functional annotate --xml-output
option
Dave Love <fx@gnu.org>**20130120121739
Ignore-this: 8a9ce6409a50b71cd0d2fdabbc181b1a
backported from HEAD by Ganesh Sittampalam <ganesh@earth.li>
]
[note dependency bumps in NEWS
Ganesh Sittampalam <ganesh@earth.li>**20130120170310
Ignore-this: 48cf181c89ec1b69fc6e9e701734ff19
]
[bump version to 2.8.4
Ganesh Sittampalam <ganesh@earth.li>**20130120154856
Ignore-this: 2f2542e9825b66cda3a0a17275b5e311
]
[resolve issue2199: getMatchingTag needs to commute for dirty tags
Ganesh Sittampalam <ganesh@earth.li>**20121218191024
Ignore-this: 947252cd8e084b793044aff564f0462d
backported from HEAD
]
[accept issue2199: darcs get --tag gets too much
Ganesh Sittampalam <ganesh@earth.li>**20120528164525
Ignore-this: 8c138a80c294e6181a3ef9250593fa31
]
[constrain haskeline version on old GHC
Ganesh Sittampalam <ganesh@earth.li>**20130119234432
Ignore-this: 8af00cc3d3c1ad223a8b35712c06bae
]
[Add option -a to darcs changes in Setup.lhs
Ben Franksen <ben.franksen@online.de>**20120811195807
Ignore-this: f23d2e558f7248fec8d07b0391d9a7e8
Some (potential) contributors (like me) have 'changes interactive'
in their ~/.darcs/defaults and then wonder why their build hangs.
]
[add copyright notices for the imported haskeline code
Ganesh Sittampalam <ganesh@earth.li>**20130119163610
Ignore-this: afcdc8048f8b3233fa17d3ab0c9c311f
licence/copyright taken from haskeline 0.6.4.7:
BSD3, copyright Judah Jacobson
]
[import encoding code from haskeline: switch over
Ganesh Sittampalam <ganesh@earth.li>**20130118231907
Ignore-this: b423a92ba93e74520d0578ac21aceab3
]
[import encoding code from haskeline: source files
Ganesh Sittampalam <ganesh@earth.li>**20130118225947
Ignore-this: c2d1e228fa4cce3e66e90a14fa2f3200
]
[import encoding code from haskeline: cabal file changes
Ganesh Sittampalam <ganesh@earth.li>**20130118070642
Ignore-this: d2ed13887d0c547cb7498bd5a2aef46f
]
[import encoding code from haskeline: Setup.lhs changes
Ganesh Sittampalam <ganesh@earth.li>**20130115181040
Ignore-this: 31ccdca76001bff769464fb7a8e574e9
]
[ROLLBACK: conditionally use bytestring-handle
Ganesh Sittampalam <ganesh@earth.li>**20130111213829
Ignore-this: d3c18b61f765bdfcb574b4977185197b
It doesn't skip invalid byte sequences when decoding so breaks on
repositories with non-UTF8 encoded metadata.
]
[bump network dependency
Ganesh Sittampalam <ganesh@earth.li>**20130111184607
Ignore-this: 54e55fa09793008d55572b6acad1a7b8
]
[add some comments about "nearby" darcs, and print out the one that was
found
Ganesh Sittampalam <ganesh@earth.li>**20130111211817
Ignore-this: 57385f3248fc539435ed9de069a40bd5
backported from HEAD
]
[make darcs-test look "nearby" for a darcs exe to use
Ganesh Sittampalam <ganesh@earth.li>**20130111211445
Ignore-this: d952cf330c0d9510c5c973dd41b191e0
backported from HEAD
]
[need different path separator on Windows
Ganesh Sittampalam <ganesh@earth.li>**20121219191221
Ignore-this: 42c7ba1e46f5b6b600d23838e5a162cb
]
[test for issue2286: make sure we can read repos with non-UTF8 metadata
Ganesh Sittampalam <ganesh@earth.li>**20130102222735
Ignore-this: adc6165d5d5d991383ebf0e6547f7bf4
]
[We can use chcp to switch encodings on Windows
Ganesh Sittampalam <ganesh@earth.li>**20130101122254
Ignore-this: bc115467e31e144694a33e43dca3fb6c
This means that the tests that require different encodings can run.
]
[Find latin9 locale on OS X too
Michael Hendricks <michael@ndrix.org>**20120420202408
Ignore-this: c87db3b97312234ed2380d2ca11a8ca0
Most Linux systems describe latin9 as "iso885915". OS X
describes it with "ISO8859-15". The new regex catches both.
]
[windows test fix: replace shell script with a Haskell program
Ganesh Sittampalam <ganesh@earth.li>**20121231224332
Ignore-this: de01ab8647e7d62c18d8c266d514b054
]
[unsetting DARCS_TEST_PREFS_DIR in utf8 test doesn't seem to be necessary
Ganesh Sittampalam <ganesh@earth.li>**20130101120246
Ignore-this: ed74710d8b358b920d742e86a7f008d8
It was causing problems on Windows because getAppUserDataDirectory
still returns the normal user directory.
It also means that the repository type choice isn't picked up.
]
[improve diagnostics when utf8 test fails
Ganesh Sittampalam <ganesh@earth.li>**20121231224600
Ignore-this: 63db587bc36f8826c66dc6913a4fdb2d
]
[need to do case-insensitive comparison on Windows
Ganesh Sittampalam <ganesh@earth.li>**20121228214823
Ignore-this: ef309d4aef22e87d5c3da3222926af0e
]
[update NEWS
Ganesh Sittampalam <ganesh@earth.li>**20121216202654
Ignore-this: aef3e47204a4157504f90554ffc3a327
]
[conditionally use bytestring-handle instead of haskeline for encoding
Ganesh Sittampalam <ganesh@earth.li>**20121216201745
Ignore-this: fd758796b689d090d01e003e660e405
This is transitional because we need to support GHC 6.10: we can switch
over to bytestring-handle unconditionally on HEAD.
]
[bump deps for GHC 7.6/latest hackage
Ganesh Sittampalam <ganesh@earth.li>**20121216201543
Ignore-this: 77829b074a4bff635a421879bdd04be0
]
[conditionally support tar 0.4
Ganesh Sittampalam <ganesh@earth.li>**20121216201014
Ignore-this: 8eff0330e6af196727bdd736ef31db25
]
[recent test-framework seems to require Typeable
Ganesh Sittampalam <ganesh@earth.li>**20121216175601
Ignore-this: a8cce6b69984bfc2335b5c19688950b3
]
[stop using Prelude.catch
Ganesh Sittampalam <ganesh@earth.li>**20121216163240
Ignore-this: b4bfc48775b3337f8f7ebe275be1a058
backported from HEAD
]
[import constructors of C types to deal with a GHC change
Ganesh Sittampalam <ganesh@earth.li>**20120401132500
Ignore-this: ab7cf2fb5e9a2494c14fe7394200da9b
]
[TAG 2.8.3
Ganesh Sittampalam <ganesh@earth.li>**20121104174910
Ignore-this: 3198f6deecf3d1b44df6e05a2657d9ca
]
Compiled with:
HTTP-4000.2.6
array-0.4.0.1
base-4.6.0.0
bytestring-0.10.0.0
containers-0.5.0.0
directory-1.2.0.0
extensible-exceptions-0.1.1.4
filepath-1.3.0.1
hashed-storage-0.5.10
haskeline-0.7.0.3
html-1.0.1.2
mmap-0.5.8
mtl-2.1.2
network-2.3.2.0
old-time-1.1.0.1
parsec-3.1.3
process-1.1.0.2
random-1.0.1.1
regex-compat-0.95.1
tar-0.4.0.1
terminfo-0.3.2.5
text-0.11.2.3
unix-2.6.0.0
utf8-string-0.3.7
vector-0.10.0.1
zlib-0.5.4.0
$
4. What operating system are you running?
ALT Linux
|
msg17461 (view) |
Author: imz |
Date: 2014-05-19.00:50:14 |
|
This was initially written down by me at
https://bugzilla.altlinux.org/show_bug.cgi?id=30083 .
More bad behavior:
Also the Cyrillic letters in the output of "darcs whatsnew" generate soem
strange garbage around them on dumb terminals:
$ echo world > world.txt
$ darcs add world.txt
$ darcs wh | cat
addfile ./hello.txt
hunk ./hello.txt 1
+[_<U+00D0>_][_<U+0097>_][_<U+00D0>_][_<U+00B4>_][_<U+00D1>_][_<U+0080>_][_<U+00D0>_][_<U+00B0>_][_<U+00D0>_][_<U+00B2>_][_<U+00D1>_][_<U+0081>_][_<U+00D1>_][_<U+0082>_][_<U+00D0>_][_<U+00B2>_][_<U+00D1>_][_<U+0083>_][_<U+00D0>_][_<U+00B9>_][_<U+00D1>_][_<U+0082>_][_<U+00D0>_][_<U+00B5>_]
addfile ./world.txt
hunk ./world.txt 1
+world
$
Perhaps, this are the colored highlighting for these codes (they are printed
red)... but this garbage doesn't appear around other highlighted text; for
example, above the words "addfile"and"hunk"are printed in blue in a
terminal.
|
msg17463 (view) |
Author: imz |
Date: 2014-05-19.01:41:03 |
|
Actually, the printed codes seem a bit strange to me -- they can hardly
correspond to the individual letters; for example, the beginning has two
repeating codes:
<U+00D0><U+0097><U+00D0>
but in the word Здравствуйте there are no 2 repeating letters in the
beginning.
|
msg17464 (view) |
Author: stephen |
Date: 2014-05-19.02:53:04 |
|
Ivan Zakharyaschev writes:
>
> Ivan Zakharyaschev <imz@altlinux.org> added the comment:
>
> Actually, the printed codes seem a bit strange to me -- they can hardly
> correspond to the individual letters; for example, the beginning has two
> repeating codes:
>
> <U+00D0><U+0097><U+00D0>
>
> but in the word Здравствуйте there are no 2 repeating letters in the
> beginning.
I can't speak to the Darcs issue, but what's happening here is that
for some reason each octet is being treated as a Unicode character.
0xD0 0x97 is indeed the UTF-8 encoding of "З".
|
msg17466 (view) |
Author: kowey |
Date: 2014-05-19.15:16:28 |
|
I notice there are a couple of other (otherwise seemingly unrelated)
encodings-related issues on the trackers (regarding patch names and
filenames).
Thanks to Stephen for noticing that the sequence of 3 Unicode code points
actually corresponds to what would be a single char encoded in UTF-8 (3
octets for that char).
I don't have a full diagnosis myself, but I'll note that Darcs mostly
treats text files as bytestrings (with the exception that it assumes some
sort of 8-bit encoding when looking for the "\n" char). So it's not
entirely surprising that that you see individual bytes in the output
representation.
Of course, it's quite wrong on Darcs' part to be treating these as
individual Unicode code points, so something isn't quite going right on
the way from its internal representation of the file contents (bytes) to
the display on the screen (to text and back to bytes again)
|
msg17467 (view) |
Author: imz |
Date: 2014-05-19.16:38:32 |
|
2014-05-19 19:16 UTC+04:00, Eric Kow <bugs@darcs.net>:
> Thanks to Stephen for noticing that the sequence of 3 Unicode code points
> actually corresponds to what would be a single char encoded in UTF-8 (3
> octets for that char).
Just to be correct: it's a character encoded by 2 octets in UTF-8, and
then there goes another one encoded by 2 octets. But I wasn't smart
enough to notice this, so I quoted only the first three printed chars
(which correspond to one and a half real UTF-8 encodings of the
Cyrillic letters).
Also note that there is another problem: there is some garbage on dumb
terminals around these symbols. Something is wrong with the function
that prints these colored codes, because it wants to put some garbage
on a dumb terminal (it should not attempt to color things on dumb
terminals or in pipes).
Regards,
--
Ivan
|
msg17468 (view) |
Author: kowey |
Date: 2014-05-19.16:43:46 |
|
On 19 May 2014 12:38, Ivan Zakharyaschev <bugs@darcs.net> wrote:
> Just to be correct: it's a character encoded by 2 octets in UTF-8, and
> then there goes another one encoded by 2 octets. But I wasn't smart
> enough to notice this, so I quoted only the first three printed chars
> (which correspond to one and a half real UTF-8 encodings of the
> Cyrillic letters).
Ah, I see, sorry for my lack of attention.
I think I must have tripped up because “З” looks like “3”
> Also note that there is another problem: there is some garbage on dumb
> terminals around these symbols. Something is wrong with the function
> that prints these colored codes, because it wants to put some garbage
> on a dumb terminal (it should not attempt to color things on dumb
> terminals or in pipes).
If it's not too much trouble, I'd appreciate it if you could file a
separate bug on this matter.
issue918 looks like it's related (and the good news is that we seem to
have had some work on it)
--
Eric Kow <http://erickow.com>
|
msg17472 (view) |
Author: imz |
Date: 2014-05-20.08:25:44 |
|
http://bugs.darcs.net/issue2391 is about the other mentioned problem
with dumb terminals.
|
msg17476 (view) |
Author: imz |
Date: 2014-05-21.09:00:35 |
|
Well, as for a workaround for "darcs wh", there is "darcs diff" -- to
get readable output even with non-Latin content.
But this also happend when running "darcs record", and it asks you about
the changes intercatively. Then, I do not know of a workaround for this
problem.
|
|
Date |
User |
Action |
Args |
2014-05-19 00:48:29 | imz | create | |
2014-05-19 00:50:15 | imz | set | messages:
+ msg17461 |
2014-05-19 01:41:04 | imz | set | messages:
+ msg17463 |
2014-05-19 02:53:05 | stephen | set | messages:
+ msg17464 title: non-Latin (e.g. Cyrillic) letters not printed in the output of whatsnew -> non-Latin (e.g. Cyrillic) letters not printed in the output of whatsnew |
2014-05-19 15:16:39 | kowey | set | priority: bug status: unknown -> needs-diagnosis/design topic:
- ProbablyEasy, Darcs2 messages:
+ msg17466 title: non-Latin (e.g. Cyrillic) letters not printed in the output of whatsnew -> non-Latin (e.g. Cyrillic) letters not printed in the output of whatsnew |
2014-05-19 16:38:33 | imz | set | messages:
+ msg17467 |
2014-05-19 16:43:48 | kowey | set | messages:
+ msg17468 |
2014-05-20 08:25:45 | imz | set | messages:
+ msg17472 |
2014-05-21 09:00:36 | imz | set | messages:
+ msg17476 title: non-Latin (e.g. Cyrillic) letters not printed in the output of whatsnew -> non-Latin (e.g. Cyrillic) letters not printed in the output of whatsnew (and record) |
|