darcs

Issue 2648 `darcs convert import` double-encodes cyrillic characters in UTF-8 input stream

Title `darcs convert import` double-encodes cyrillic characters in UTF-8 input stream
Priority bug Status resolved
Milestone Resolved in 2.15.0
Superseder Nosy List alster
Assigned To
Topics

Created on 2020-07-13.22:34:08 by alster, last changed 2020-07-25.23:40:12 by noreply.

Files
File name Uploaded Type Edit Remove
convert-import_-fix-meta-data-and-filepath-encoding-.dpatch bfrk, 2020-07-14.16:33:54 text/plain
Messages
msg22144 (view) Author: alster Date: 2020-07-13.22:34:04
When converting git repo containing cyrillic characters in filenames                                                                   
and/or patch names into darcs repo these characters looks like double                                                                  
encoded.                                                                                                                               
                                                                                                                                       
Tested with default version from ports tree:                                                                                           
% darcs --exact-version                                                                                                                
darcs compiled on May 15 2020, at 04:54:49                                                                                             
                                                                                                                                       
Weak Hash: not available                                                                                                               
                                                                                                                                       
Context:                                                                                                                               
                                                                                                                                       
[TAG 2.14.2                                                                                                                            
Ganesh Sittampalam <ganesh@earth.li>**20190126112346                                                                                   
 Ignore-this: 5c94e042fbcfdf311008d6e980d9cf1d                                                                                         
]                                                                                                                                      
                                                                                                                                       
And also with custom repo:https://urchin.earth.li/darcs/ganesh/temp/darcs-import-utf8-hack

patch a089edf27367f5698dad15ed20d28dd803fb156a                                                                                         
Author: Ganesh Sittampalam <ganesh@earth.li>                                                                                           
Date:   Wed Jun 10 01:58:21 +04 2020                                                                                                   
  * TEST: assume imported patch metadata is UTF8 not Char8                                                                             
                                                                                                                                       
                                                                                                                                       
Running on FreeBSD 12.1
msg22145 (view) Author: bfrk Date: 2020-07-14.08:20:28
Thanks for the bug report. The fix you propose assumes meta data in the
input stream is UTF8 encoded. Is that guaranteed by the git fast-import
file format? (It may be the safest bet to assume it is, even if git does
not guarantee it.)

I guess converting non-ASCII file paths will also fail for similar
reasons, so we need a similar fix for those.

BTW, your repo at

> https://urchin.earth.li/darcs/ganesh/temp/darcs-import-utf8-hack

has this patch:

patch fe2dd87f789dfa0f6f4bbbc65cc9953a68c1e552
Author: Ben Franksen <ben.franksen@online.de>
Date:   Mon Jan 27 20:23:54 CET 2020
  * fix file path encoding in convert export command

which has been obliterated from screened because it does not work on
Windows.
msg22146 (view) Author: alster Date: 2020-07-14.09:16:33
I've verified and guarantee, that provided cyrillic_input_stream file contains common `git fast-export` UTF-8 encoded stream from valid git repo, containing exactly one patch named 'Южноэфиопский грач увёл мышь за хобот на съезд ящериц' with exactly one file with cyrillic filename 'Панграмма.txt' with exactly one line of text 'Широкая электрификация южных губерний даст мощный толчок подъёму сельского хозяйства' and is valid for recreating exactly the same git repo by piping it on `git fast-import`, which I twice tested manually (on 2020-06-29, when creating the patch, and now).

I'm also forgot to mention, that issue only exists for `darcs convert import`, while `darcs convert export` was and is totally OK both with common Darcs 2.14.2 and custom repo with patch from Ganesh.

I was explaining him all the situation at #darcs on IRC on 2020-06-09 [0], so he made this patch especially for me :) for testing.
Other people on #darcs are also very nice, friendly and helpful, and even guided me with building Darcs from source! :)
Sad, but even this didn't help, so he adviced me to open an issue and provide a test case for it, which I wasn't able to do until now (was ill).

Do you think it would help for me to provide similar input stream sample from valid fresh Darcs repo, created with the same UTF-8 cyrillic patch/file name/contents?

[0] https://freenode.logbot.info/darcs/20200609
msg22147 (view) Author: bfrk Date: 2020-07-14.13:22:29
> I've verified and guarantee, that provided cyrillic_input_stream file
> contains common `git fast-export` UTF-8 encoded stream from valid git
> repo, containing exactly one patch named 'Южноэфиопский грач увёл
> мышь за хобот на съезд ящериц' with exactly one file with cyrillic
> filename 'Панграмма.txt' with exactly one line of text 'Широкая
> электрификация южных губерний даст мощный толчок подъёму сельского
> хозяйства' and is valid for recreating exactly the same git repo by
> piping it on `git fast-import`, which I twice tested manually (on
> 2020-06-29, when creating the patch, and now).

Thanks. I think Ganesh's fix for the meta data is okay. However.

Have you checked that the file names in the darcs repo are also the same
as in the git repo? As I see it, the code that does the file name
conversion on import is utterly broken. It uses (floatPath . BC.unpack),
that is, we take the raw bytes of the input stream. Assuming that is
UTF8 encoded, this cast the bytes of the encoded file names each to Char
and then re-encodes them (floatPath calls encodeLocale for each path
component). This simply cannot work.

Indeed I have tested this just now: importing from a git repo with a
file named "müßig" (german Umlaute and sharp s) converts to darcs as:

ben@home[1]:.../scratch/gtest>ll git
total 4
-rw-rw-r-- 1 ben ben 8 Jul 14 13:59 müßig
ben@home[1]:.../scratch/gtest>ll darcs
total 8
drwxrwxr-x 6 ben ben 4096 Jul 14 14:02 _darcs
-rw-rw-r-- 1 ben ben    8 Jul 14 14:02 m303274303237ig

However, decoding plain byte sequences properly is not enough. Looking
at the output of git fast-export I see:

add file müßig
M 100644 :1 "m\303\274\303\237ig"

The first line is encoded in UTF8. But the second line uses the quoted
file name convention. The git manual says "A path can use C-style string
quoting." and apparently git interprets that as "escape any non-ASCII
byte". So when the file name is in quoted form we also have to parse
octal escaped bytes, not only the required '\r', '\n' etc. Sigh.
msg22148 (view) Author: bfrk Date: 2020-07-14.16:33:55
Attached is a patch that hopefully fixes all the encoding issues in
'darcs convert import', including parsing of quoted file paths with
bytes encoded using C's backslash-octal-number notation. (It replaces
and extends Ganeshs' quick fix, so please obliterate that before applying.)

I tested this by creating my own git repo with lots of funny characters
in file paths, meta data and file content, and then manually inspecting
the resulting darcs repo (using darcs log) as well as checking roundtrip
via 'darcs convert export' and comparing the output of 'git log'.

A test case from your side would still be appreciated, ideally as a
tests script.
Attachments
msg22150 (view) Author: alster Date: 2020-07-14.18:00:02
Thank you for clarification. I indeed wasn't precise enough in my words about UTF-8: I thought and talked about only patch data and metadata, filenames are different story. I observe the same Darcs behaviour you mentioned above:

Patch metadata: double encoded (BUG)
Patch data: left as is (OK)
Filenames: escaped somehow (Darcs can't understand it now, FIXME)
msg22151 (view) Author: alster Date: 2020-07-14.18:02:34
Test script is already here: [0]
Should I attach it to this issue?

[0] http://bugs.darcs.net/patch2046
msg22157 (view) Author: bfrk Date: 2020-07-14.18:13:44
Sorry, didn't see your patch right away. I have pushed it together 
with a small fix and my patches that resolve the issue. All should 
work now.
msg22158 (view) Author: alster Date: 2020-07-14.18:17:51
Thank you very much! I'll try to use it with my old git repos…

P.S. When do you plan to release nearest minor Darcs version with this fix included?
msg22260 (view) Author: noreply Date: 2020-07-25.23:40:10
The following patch sent by Ben Franksen <ben.franksen@online.de> updated issue issue2648 with
status=resolved;resolvedin=2.15.0 HEAD

* resolve issue2648: convert import with non-ASCII meta data and filepaths 
Ignore-this: a7366b5498a0c98e1cf47872134cbf159cc620fc977de357eeaf2f49d6886463f23c4947bbf2c480
History
Date User Action Args
2020-07-13 22:34:08alstercreate
2020-07-13 22:50:28alsterlinkpatch2046 issues
2020-07-13 22:51:09alstersetstatus: needs-testcase -> needs-diagnosis/design
2020-07-14 05:41:14alstersetstatus: needs-diagnosis/design -> needs-reproduction
2020-07-14 08:20:31bfrksetmessages: + msg22145
2020-07-14 09:16:36alstersetmessages: + msg22146
2020-07-14 13:22:32bfrksetmessages: + msg22147
2020-07-14 16:33:58bfrksetfiles: + convert-import_-fix-meta-data-and-filepath-encoding-.dpatch
messages: + msg22148
2020-07-14 18:00:06alstersetmessages: + msg22150
2020-07-14 18:02:37alstersetmessages: + msg22151
2020-07-14 18:13:47bfrksetmessages: + msg22157
2020-07-14 18:17:55alstersetmessages: + msg22158
2020-07-22 14:43:18bfrksetstatus: needs-reproduction -> has-patch
2020-07-25 23:40:12noreplysetstatus: has-patch -> resolved
messages: + msg22260
resolvedin: 2.15.0
2020-08-08 09:01:28ganeshlinkissue2652 superseder