darcs

Issue 1644 darcs send doesn't play well with utf8

Title darcs send doesn't play well with utf8
Priority urgent Status duplicate
Milestone Resolved in 2.8.0
Superseder Nosy List Jan.Vornberger, darcs-devel, dmitry.kurochkin, era, kowey
Assigned To
Topics

Created on 2009-10-08.21:00:00 by Jan.Vornberger, last changed 2011-07-05.13:41:15 by kerneis.

Messages
msg8933 (view) Author: Jan.Vornberger Date: 2009-10-08.20:59:58
Hi there!

When using 'darcs send' to send patches to the xmonad mailing list, I
experience problems with patch bundles that contain utf8 strings (for
example in the author names). The mail that darcs prepares for my SMTP
client (mstmp) contains the utf8 string in 'quoted-printable' encoding,
but does not mention a charset. Some mail clients - at least mutt -
seems to default to iso-8859-1 in those cases and the patch bundle is
decoded wrongly on the receiving side, resulting in ?? where a single
utf8 character used to be.

Here is an example file that darcs prepared for the SMTP client:
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=10419#a10419
(You can see an utf8 encoding in: "Marco T=C3=BAlio Gontijo")

The bug was also discussed on the IRC channel - starting timestamp is
2009-10-04, 10:43 ( http://irclog.perlgeek.de/darcs/2009-10-04#i_1567242 ).

To summarize some of the discussion: It was noted that darcs should look
at the locale to determine the encoding of metadata (author name etc.),
but doesn't do it currently. This should and can be fixed, from what I
understood.

However that still leaves the file contents, the hunks, which also get
encoded. It was noted, that darcs has no way of knowing what the
encoding of the *contents* of the files is and thus has no way of
specifying a charset. Therefore it was discussed how to prevent e-mail
clients from re-coding the attachment to a different charset and messing
it up in the process. One idea was to use 'application/x-darcs-patch'
instead of 'text/x-darcs-patch' which should prevent such re-coding.
However, it would also prevent to easily review patches, because such an
attachment would not be included when 'replying'. The idea of sending
both versions and using multipart/alternative was also mentioned and
someone pointed to issue1350 and issue1427 which also deal with how to
mail patch bundles more reliable.

In my case the metadata-only fix would probably be enough, but it would
certainly be preferably if utf8 file contents would also be dealed with
correctly.

For the meantime, I use the following workaround: In ~/.darcs/defaults I
set force-darcs-send-utf8.sh as my mail command:

send sendmail-command /home/jan/bin/force-darcs-send-utf8.sh %<

the scripts looks like this:

cat /dev/stdin | sed "s/text\/x-darcs-patch;/text\/x-darcs-patch;
charset=utf-8;/" | /usr/sbin/sendmail -i -t

Using sed it adds 'charset=utf-8' to the mail before passing it on to
sendmail. This results in mails with patch bundles that are saved
correctly by mutt on the receiving side.

Regards,

Jan
msg8937 (view) Author: kowey Date: 2009-10-09.06:54:07
Thanks very much for the report!  

I've made it the first suggestion in
http://wiki.darcs.net/Troubleshooting#bug%20in%20get_extra%20commuting%20patch

It just bit us recently in
http://lists.osuosl.org/pipermail/darcs-users/2009-October/021809.html

It sounds like we just need to implement the attach-it-two-ways plan, so I'll
leave this for somebody to patch.

PS. I also wonder if we could have a simple workaround by systematically placing
a non ISO8859-1 representable character, say 暝, in the description.
msg9047 (view) Author: era Date: 2009-10-26.21:18:54
For the record, any sane version of sed should read its standard input by
default, so you have a case of Useless Use of Cat.  The following should suffice.

  sed "s%text/x-darcs-patch;%text/x-darcs-patch;
    charset=utf-8;%" | /usr/sbin/sendmail -i -t

As far as the MIME encoding is concerned, there really is no need to put the
charset= parameter on a separate line; if you do that anyway, do take care that
there is at least one space before "charset=".

(Also note the use of alternate delimiters in sed's s%%% command in order to
avoid having to backslash-escape the literal slashes.  Furthermore, note that
options to sendmail may vary from one system to another.)

> PS. I also wonder if we could have a simple workaround by systematically 
> placing a non ISO8859-1 representable character, say 暝, in the description.

I don't see how that is supposed to help.  Could you please elaborate on this idea?
msg9048 (view) Author: era Date: 2009-10-26.21:40:17
One minor clarification / amplification: in the absence of a charset=
identifier, the default character set is US-ASCII (and having quoted-printable
characters which are not US-ASCII, i.e. 7-bit only, is an error).  There is no
"guessing" involved, at least according to the MIME and SMTP standards, to the
extent I am familiar with them.

I've seen something like charset=unknown (or was it x-unknown?) but that hardly
helps the poor recipient who would like to know how to interpret the file.
msg9049 (view) Author: kowey Date: 2009-10-26.22:37:19
On Mon, Oct 26, 2009 at 21:18:56 +0000, era eriksson wrote:
> > PS. I also wonder if we could have a simple workaround by systematically 
> > placing a non ISO8859-1 representable character, say 暝, in the description.
> 
> I don't see how that is supposed to help.  Could you please elaborate on this idea?

I vaguely remember having the experience of having my mail client (mutt)
guess the encoding it should use to send messages based on the lowest
common denominator.  If it thought that there were only ISO8859-1
encodable characters in the message, it would encode it as such;
otherwise as UTF-8.

I don't actually remember if this was true, and in any case, since darcs
send does not normally go through my mail client, this didn't really
make any sense, but I guess some part of me was wondering if this kind
of magic was happening under the hood somewhere.

It was a silly thought, sorry :-)

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
msg9063 (view) Author: era Date: 2009-10-27.16:25:05
> I've seen something like charset=unknown (or was it x-unknown?) but that
> hardly helps the poor recipient who would like to know how to interpret
> the file.

Actually, using charset=X-UNKNOWN is the best approximation of the truth at the
moment.  This does not help improve the user experience at all, but should at
least allow verbatim saving of attachments from MIME-compliant clients (if there
are any ...).  Unknown characters will still display as ? or something like
that, but since Darcs does not know which character encoding was used, guessing
they are UTF-8 (or whatever) will produce wrong results for anyone who is not
working with UTF-8 data.

As another amplification, Darcs should ideally establish at commit time (by
inferring from the committer's locale, or by asking, or whatever) which encoding
the committed data uses, and save this metainformation as part of each commit. 
Asking the user (another user than the committer) what something is just pushes
the burden to people who in the normal case cannot know for sure, and assuming
the user's locale matches the committer's seems quite misdirected.

There are probably scenarios where the committed changes could use a different
encoding than the commit message, for example, but at this point I suppose the
focus should just be on making stuff not break too badly when non-ASCII data is
communicated over email.
msg9077 (view) Author: kowey Date: 2009-10-29.17:56:15
Hi Era,

era eriksson <era+darcs@iki.fi> added the comment:
> Actually, using charset=X-UNKNOWN is the best approximation of the truth at the
> moment.  This does not help improve the user experience at all, but should at
> least allow verbatim saving of attachments from MIME-compliant clients (if there
> are any ...).

Could I perhaps persuade you to submit a patch, and perhaps make some
suggestions for testing?

Is this just a straightforward tweak of src/Darcs/Email.hs?

Thanks!

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
msg14571 (view) Author: kerneis Date: 2011-07-05.13:41:14
This is a duplicate of issue1350, which has been fixed in latest release.
History
Date User Action Args
2009-10-08 21:00:00Jan.Vornbergercreate
2009-10-09 06:54:09koweysetpriority: urgent
status: unknown -> needs-implementation
messages: + msg8937
nosy: + kowey
2009-10-26 21:18:56erasetnosy: + era
messages: + msg9047
2009-10-26 21:40:19erasetmessages: + msg9048
2009-10-26 22:37:21koweysetmessages: + msg9049
2009-10-27 16:25:10erasetmessages: + msg9063
2009-10-29 17:56:19koweysetmessages: + msg9077
2011-07-05 13:41:15kerneissetstatus: needs-implementation -> duplicate
messages: + msg14571
resolvedin: 2.8.0