darcs

Patch 639 Utf-8 encoding for darcs send

Title Utf-8 encoding for darcs send
Superseder Nosy List ganesh, kerneis
Related Issues
Status accepted Assigned To ganesh
Milestone

Created on 2011-07-05.17:49:58 by kerneis, last changed 2011-12-30.10:11:09 by kerneis.

Files
File name Status Uploaded Type Edit Remove
fix-missing-langinfo_h-on-windows.dpatch kerneis, 2011-12-01.09:01:38 application/x-darcs-patch
fix-test-suite-for-new-makeemail-signature.dpatch kerneis, 2011-11-29.15:01:01 application/x-darcs-patch
patch-preview.txt kerneis, 2011-11-29.15:01:01 text/x-darcs-patch
patch-preview.txt ganesh, 2011-11-30.19:40:07 text/x-darcs-patch
patch-preview.txt kerneis, 2011-12-01.09:01:38 text/x-darcs-patch
unnamed kerneis, 2011-11-29.15:01:01
unnamed ganesh, 2011-11-30.19:40:07
unnamed kerneis, 2011-12-01.09:01:38
update-text-dep-to-make-sure-we-have-decodeutf8_.dpatch ganesh, 2011-11-30.19:40:07 application/x-darcs-patch
use-utf_8-charset-for-darcs-send-in-case-of-non_ascii-characters.dpatch dead kerneis, 2011-07-17.19:09:32 application/x-darcs-patch
utility-functions-to-check-locale-encoding.dpatch kerneis, 2011-08-29.15:44:35 application/x-darcs-patch
See mailing list archives for discussion on individual patches.
Messages
msg14574 (view) Author: kerneis Date: 2011-07-05.17:49:58
Hi,

this very simple patch fixes darcs send in order to handle non-ascii
characters in the message.  Note this has nothing to do with the patch
bundle, nor with the patch preview.  It's only about the message (like
the one you are reading currently).

Here is a demo: é è ề Ψ ޡ ߐ ह ჴ Ᏻ ‱ ⁂ ∰ ✈ ⢅ 𝀵 

If the text uses non-ascii characters, I assume it is valid utf-8.  It
will work on any reasonable modern system and it can’t be more broken
than the current behaviour anyway.

And please, don’t ask me to take the locale into account.

Best,
Gabriel

PS: I could write a similar patch for handling patch preview (I am of
course not talking about the actual bundle, only the text/x-darcs-patch
part).  Since I’m not sure it the best option, I’ll first wait for this
one to be accepted.

1 patch for repository http://darcs.net/screened:

Tue Jul  5 15:43:31 CEST 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Use utf-8 charset for darcs send in case of non-ascii characters
Attachments
msg14575 (view) Author: kerneis Date: 2011-07-08.15:50:42
Hi, 

this follow-up patch should fix the main concern about warning non-UTF-8
users.

It might definitely be improved; in particular, I check twice if the
message contains ASCII characters, once in Send.hs and once in Email.hs.
I also wonder if locales are useful at all --- see comment in the fourth
patch about that, maybe detecting invalid UTF-8 encoding is enough (and
certainly safer).  But I'd like to get some feedback before I hack yet
another version of this patch.

Also, would it be okay to depend on the utf8-string package, to use the
Codec.Binary.UTF8.String.isUTF8Encoded function?
http://hackage.haskell.org/packages/archive/utf8-string/

Please let me know what you think about it.

Best,
Gabriel

4 patches for repository http://darcs.net/screened:

Tue Jul  5 15:43:31 CEST 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Use utf-8 charset for darcs send in case of non-ascii characters

Fri Jul  8 17:06:44 CEST 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Add functions to get and check locale encoding in Darcs.Utils

Fri Jul  8 17:11:49 CEST 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Avoid useless re-rendering of contents message

Fri Jul  8 17:32:13 CEST 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Propose to abort before sending mail if current locale is not UTF-8
Attachments
msg14576 (view) Author: kowey Date: 2011-07-08.16:45:24
Hi,

On Fri, Jul 08, 2011 at 15:50:43 +0000, Gabriel Kerneis wrote:
> It might definitely be improved; in particular, I check twice if the
> message contains ASCII characters, once in Send.hs and once in Email.hs.

Checking does seem to be one good way to be considerate in the sense
that we only bother the user if we really have to.

But I was wondering: for patch metadata IO, Darcs already has the
auto-detection habit of trying UTF-8 first and falling back to
locale-specific decoding (see Darcs.Patch.Info).

Could we not do the same for reading in the messages for darcs send
emails?  In other words, try to emit utf-8 systematically; but do not
necessarily require that the user be in a UTF-8 locale.

As for the code, I think we already use Data.Text.Encoding from the text
package in Darcs (see ByteStringUtils)

Thanks,

Eric

PS. I don't think I've mentioned this on list yet, but I've been on a
    sort of Darcs holiday while I attend to some personal matters.
    Hoping to be back in September.

-- 
Eric Kow <http://erickow.com>
msg14589 (view) Author: kerneis Date: 2011-07-13.13:21:21
It is probably clear from the discussion on darcs-users, but just in
case: please do NOT apply this patch to screened, I'll rework it and
send an improved version for 2.10 (it won't make it into 2.8 anyway).
msg14593 (view) Author: kerneis Date: 2011-07-17.19:09:32
Hi,

this is a third attempt at using utf-8 whenever possible in darcs send.

Here is how it works:
- if mail is made of ascii characters, send with content-type charset
  set to ascii and ignore locale completely,
- if mail contains invalid utf-8 characters, propose to either abort
  (and save mail content in a file) or ignore the error and send with
  content-type charset set to utf-8 anyway (no support for other
  charset when sending),
- if mail is valid utf-8, send with content-type charset set to utf-8;
  additionnally print a warning if current locale is not recognised as
  utf-8 (but do not propose to abort, assuming what looks like utf-8 is
  utf-8).

The warnings are respectively:
**********
Warning: darcs expects an UTF-8 encoding when sending emails.
         Your text is not valid UTF-8.  Send anyway? [yn]
*** and ***
Warning: sending email as UTF-8 regardless of your locale (since it seems well-formed).
**********

Please, do not hesitate to rephrase them (Darcs.Commands.Send), I am not
a native speaker.  Apart from this wording issue, I tested the patch
rather extensively and I think it is ready to get into screened.

Best,
Gabriel

7 patches for repository http://darcs.net/screened:

Tue Jul  5 15:43:31 CEST 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Use utf-8 charset for darcs send in case of non-ascii characters

Fri Jul  8 17:06:44 CEST 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Add functions to get and check locale encoding in Darcs.Utils

Fri Jul  8 17:11:49 CEST 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Avoid useless re-rendering of contents message

Fri Jul  8 17:32:13 CEST 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Propose to abort before sending mail if current locale is not UTF-8

Sun Jul 17 20:35:03 CEST 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Add ByteString.isAscii

Sun Jul 17 20:35:52 CEST 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Use ByteString.isAscii in Darcs.Email

Sun Jul 17 20:36:33 CEST 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Improve UI for non-ascii messages in darcs send
Attachments
msg14595 (view) Author: jch Date: 2011-07-19.21:14:00
> - if mail contains invalid utf-8 characters, propose to either abort
>   (and save mail content in a file) or ignore the error and send with
>   content-type charset set to utf-8 anyway (no support for other
>   charset when sending),

Shouldn't that be ``text/plain; charset=x-unknown''?

-- Juliusz
msg14596 (view) Author: kerneis Date: 2011-07-20.05:32:18
On Tue, Jul 19, 2011 at 11:03:19PM +0200, Juliusz Chroboczek wrote:
> > - if mail contains invalid utf-8 characters, propose to either abort
> >   (and save mail content in a file) or ignore the error and send with
> >   content-type charset set to utf-8 anyway (no support for other
> >   charset when sending),
> 
> Shouldn't that be ``text/plain; charset=x-unknown''?

Yeah.  Or charset=unknown-8bit.  Google says both are accepted by most clients.

-- 
Gabriel
msg14630 (view) Author: ganesh Date: 2011-08-13.11:37:25
did you forget to darcs add system_encoding.c?

Also the existing convention for .c files seems to be to be to put them 
in src, not src/Darcs - since none of the ones we have is really Darcs 
specific and I guess that'll also be true of system_encoding.c
msg14698 (view) Author: kerneis Date: 2011-08-29.15:34:27
On Sat, Aug 13, 2011 at 11:37:26AM +0000, Ganesh Sittampalam wrote:
> did you forget to darcs add system_encoding.c?
> 
> Also the existing convention for .c files seems to be to be to put them 
> in src, not src/Darcs - since none of the ones we have is really Darcs 
> specific and I guess that'll also be true of system_encoding.c

Agreed.  I'll send an updated patch.
-- 
Gabriel
msg14699 (view) Author: kerneis Date: 2011-08-29.15:44:35
Hi,

here is the latest (and hopefully the last) version of my patch to fix
charset in darcs send.

It adds a --charset option to let the user specify any charset.
Additionnally, it tries to guess ASCII (easy) and UTF-8 (and bit harder)
charsets.  It also checks locale in case of UTF-8 to make sure there is
no discrepency, and asks before making bold assumptions.

I have tested it and it works for me but making unit tests seems a bit
hard (you would have to intercept the mail somehow and check headers).

Note that this is a fresh set of patches, the previous ones should be
discarded.

Best,
Gabriel

4 patches for repository http://darcs.net/screened:

Mon Aug 29 16:48:18 CEST 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Utility functions to check locale encoding

Mon Aug 29 16:50:46 CEST 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Add ByteStringUtils.isAscii

Mon Aug 29 17:09:29 CEST 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Add a --charset flag for darcs send

Mon Aug 29 17:16:04 CEST 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Guess ascii and utf-8 charsets
  
  In case we cannot guess the charset, or when it is utf-8 but the locale
  differs, ask the user before sending.
Attachments
msg14827 (view) Author: kerneis Date: 2011-11-29.15:01:01
Hi,

one trivial follow-up patch to fix test suite building.

-- Gabriel

1 patch for repository http://darcs.net/screened:

Tue Nov 29 15:43:04 CET 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Fix test suite for new makeEmail signature
Attachments
msg14828 (view) Author: ganesh Date: 2011-11-30.19:40:07
1 patch for repository http://darcs.net/screened:

Wed Nov 30 18:06:17 GMT 2011  ganesh@earth.li
  * update text dep to make sure we have decodeUtf8'
Attachments
msg14830 (view) Author: kerneis Date: 2011-12-01.09:01:38
1 patch for repository http://darcs.net/screened:

Thu Dec  1 09:55:39 CET 2011  Gabriel Kerneis <kerneis@pps.jussieu.fr>
  * Fix missing langinfo.h on Windows
  
  get_system_encoding will always return "utf8" on Windows, because darcs
  only uses it for a sanity check which is useless in most cases.
Attachments
msg14909 (view) Author: ganesh Date: 2011-12-29.20:46:45
Looks fine to me, though unicode always scares me!

I tried sending an patch with the example text with and without these 
patches and the difference was clear in my email client.

I also wrote a basic test for the utf-8 header and will push that along 
with these patches.
msg14914 (view) Author: kerneis Date: 2011-12-30.10:11:08
On Thu, Dec 29, 2011 at 08:46:46PM +0000, Ganesh Sittampalam wrote:
> Looks fine to me, though unicode always scares me!

In fact, I think the locale check could be stricter (issuing a warning if the
locale is ASCII instead of UTF-8).  I'll send a follow-up patch for that
purpose next week, but it is safe to apply as-is in the meantime this case
should almost never happen anyway.

Best regards,
-- 
Gabriel
History
Date User Action Args
2011-07-05 17:49:58kerneiscreate
2011-07-08 15:50:43kerneissetfiles: + unnamed, use-utf_8-charset-for-darcs-send-in-case-of-non_ascii-characters.dpatch, unnamed
messages: + msg14575
2011-07-08 16:45:25koweysetmessages: + msg14576
2011-07-13 13:21:21kerneissetstatus: needs-screening -> in-discussion
messages: + msg14589
2011-07-17 19:08:37kerneissetstatus: in-discussion -> needs-screening
2011-07-17 19:09:33kerneissetfiles: + unnamed, use-utf_8-charset-for-darcs-send-in-case-of-non_ascii-characters.dpatch, unnamed
messages: + msg14593
title: Use utf-8 charset for darcs send in case of non-ascii ... -> Utf-8 encoding for darcs send (reloaded)
2011-07-17 19:13:03kerneissetfiles: - unnamed
2011-07-17 19:13:05kerneissetfiles: - unnamed
2011-07-17 19:13:07kerneissetfiles: - unnamed
2011-07-17 19:13:09kerneissetfiles: - unnamed
2011-07-17 19:13:11kerneissetfiles: - unnamed
2011-07-17 19:13:13kerneissetfiles: - unnamed
2011-07-17 19:13:23kerneissetfiles: - use-utf_8-charset-for-darcs-send-in-case-of-non_ascii-characters.dpatch
2011-07-17 19:13:26kerneissetfiles: - use-utf_8-charset-for-darcs-send-in-case-of-non_ascii-characters.dpatch
2011-07-19 21:14:00jchsetmessages: + msg14595
2011-07-20 05:32:19kerneissetmessages: + msg14596
2011-08-13 11:37:26ganeshsetstatus: needs-screening -> followup-requested
assignedto: kerneis
messages: + msg14630
2011-08-29 15:34:28kerneissetmessages: + msg14698
2011-08-29 15:44:36kerneissetfiles: + unnamed, utility-functions-to-check-locale-encoding.dpatch, unnamed
messages: + msg14699
title: Utf-8 encoding for darcs send (reloaded) -> Correct charset in darcs send
2011-08-29 15:45:04kerneissetstatus: followup-requested -> needs-screening
2011-08-29 15:45:57kerneissetfiles: - unnamed
2011-08-29 15:46:16kerneissetfiles: - unnamed
2011-11-29 15:01:01kerneissetfiles: + patch-preview.txt, fix-test-suite-for-new-makeemail-signature.dpatch, unnamed
messages: + msg14827
title: Correct charset in darcs send -> Follow-up
2011-11-30 19:40:07ganeshsetfiles: + patch-preview.txt, update-text-dep-to-make-sure-we-have-decodeutf8_.dpatch, unnamed
messages: + msg14828
2011-12-01 09:01:38kerneissetfiles: + patch-preview.txt, fix-missing-langinfo_h-on-windows.dpatch, unnamed
messages: + msg14830
title: Follow-up -> Follow-up (fix windows build)
2011-12-16 09:34:06ganeshsetstatus: needs-screening -> needs-review
2011-12-29 20:46:45ganeshsetstatus: needs-review -> accepted-pending-tests
assignedto: kerneis -> ganesh
title: Follow-up (fix windows build) -> Utf-8 encoding for darcs send
messages: + msg14909
nosy: + ganesh
2011-12-29 21:36:22ganeshsetstatus: accepted-pending-tests -> accepted
2011-12-30 10:11:09kerneissetmessages: + msg14914