darcs

Patch 1832 do white space en/decoding directly on the ByteStrings

Title do white space en/decoding directly on the ByteStrings
Superseder Nosy List bf
Related Issues
Status followup-in-progress Assigned To
Milestone

Created on 2019-06-15.10:29:42 by bf, last changed 2019-08-06.09:45:04 by ganesh.

Files
File name Status Uploaded Type Edit Remove
do-white-space-en_decoding-directly-on-the-bytestrings.dpatch bf, 2019-06-15.10:29:42 application/x-darcs-patch
patch-preview.txt bf, 2019-06-15.10:29:42 text/x-darcs-patch
patch-preview.txt bf, 2019-07-28.13:47:51 text/x-darcs-patch
rollback-_do-white-space-en_decoding-directly-on-the-bytestrings_.dpatch bf, 2019-07-28.13:47:51 application/x-darcs-patch
unnamed bf, 2019-06-15.10:29:42 text/plain
unnamed bf, 2019-07-28.13:47:51 text/plain
See mailing list archives for discussion on individual patches.
Messages
msg20828 (view) Author: bf Date: 2019-06-15.10:29:42
One more optimization prompted by profiling the darcs-3 format, but actually
independent of the patch format.

1 patch for repository /home/ben/src/darcs/without-v3:

patch d04e12c10444dea24960406be14129da05c73e22
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sun Feb 17 21:50:01 CET 2019
  * do white space en/decoding directly on the ByteStrings
Attachments
msg20974 (view) Author: ganesh Date: 2019-07-27.19:44:11
Partial review of what I've looked at so far:

> Author: Ben Franksen <ben.franksen@online.de>
> Date:   Sun Feb 17 21:50:01 CET 2019
>   * do white space en/decoding directly on the ByteStrings

encodeWhitePS/decodeWhitePS seem like obvious candidates for 
QuickCheck testing against encodeWhite/decodeWhite, particularly if 
we want to optimise them further.

> hunk ./src/Darcs/Util/Path.hs 157
> +encodeWhitePS :: B.ByteString -> B.ByteString
> +encodeWhitePS x =
> +  case B.findIndex (\c -> isSpace (w2c c) || c == backslash) x of
> +    Nothing -> x
> +    Just i -> B.concat
> +      [ B.take i x
> +      , B.singleton backslash
> +      , BC.pack (show (B.index x i))
> +      , B.singleton backslash
> +      , encodeWhitePS (B.drop (i+1) x)
> +      ]
> +  where
> +    backslash = 0x5C
> +
> hunk ./src/Darcs/Util/Path.hs 604
> -encodeWhiteName = encodeLocale . encodeWhite . decodeLocale . 
unName
> +encodeWhiteName = encodeWhitePS . unName

[existing definition of encodeWhite]
> encodeWhite :: FilePath -> String
> encodeWhite (c:cs) | isSpace c || c == '\\' =
>     '\\' : show (ord c) ++ "\\" ++ encodeWhite cs
> encodeWhite (c:cs) = c : encodeWhite cs
> encodeWhite [] = []

Is the new code actually equivalent to the old code? In the old case 
we properly translate from locale code into Unicode characters and 
then look for any Unicode space. In the new code we just look for 
any byte that looks like an ASCII space in the encoded sequence of 
bytes.

The repeated concats in this code strike me as quite inefficient, 
I'd have thought it'd at least make sense to make a single list 
[B.ByteString] then concatenate at the end.
msg20984 (view) Author: bf Date: 2019-07-28.13:47:38
>> * do white space en/decoding directly on the ByteStrings
> 
> Is the new code actually equivalent to the old code? In the old case
> we properly translate from locale code into Unicode characters and 
> then look for any Unicode space. In the new code we just look for any
> byte that looks like an ASCII space in the encoded sequence of 
> bytes.

Good catch. My change was based on the erroneous assumption that we used
the isSpace function from Darcs.Util.ByteString, when in fact we used
Data.Char.isSpace. So this change is not backward compatible. I have
tested this with an example where I record a file named 'foo\u2004bar':
with darcs-2.12.5 the encoding is

addfile ./foo<U+00C3><U+00A2><U+00C2><U+0080><U+00C2><U+0084>bar

but with screened it is

addfile ./foo^Dbar

This causes lots of inconsistencies when I use the two darcs versions in
an interleaved fashion in the same repo. It is quite unfortunate that we
have to maintain this exact encoding. The only reason the whitespace
encoding was introduced is to allow word-based parsing. For this it
would have been enough to escape ASCII space chars.

Will send a rollback.

> The repeated concats in this code strike me as quite inefficient, I'd
> have thought it'd at least make sense to make a single list 
> [B.ByteString] then concatenate at the end.

Also a good point, but obsolete since I have to rollback this change.
msg20985 (view) Author: bf Date: 2019-07-28.13:47:51
1 patch for repository http://darcs.net/screened:

patch a424dfd10acc3792e21a3752555f493179527a3a
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sun Jul 28 15:48:33 CEST 2019
  * rollback 'do white space en/decoding directly on the ByteStrings'
  
  This change was incompatible with the existing whitespace encoding for file
  paths in that the latter encodes /all/ unicode characters that
  Data.Char.isSpace identifies as space, not only those that are ASCII
  whitespace.
Attachments
msg20993 (view) Author: ganesh Date: 2019-07-28.14:25:02
Thanks - would you mind also adding the example you found as a
test case? (Or I can if you prefer)

I agree that it would have been much better to just look for ASCII
spaces. I guess it would be difficult to change the behaviour even
for V3 without seriously complicating the Prim.V1 code.
msg20997 (view) Author: bf Date: 2019-07-28.20:03:30
> Thanks - would you mind also adding the example you found as a
> test case?

Yes, definitely. I had planned to do that and will. The only problem
here is how to test for compatibility. I guess it means adding yet
another tar ball with a repo recorded with the current whitespace
encoding...

> I agree that it would have been much better to just look for ASCII
> spaces. I guess it would be difficult to change the behaviour even
> for V3 without seriously complicating the Prim.V1 code.

I don't think so. The only place where we make the distinction is in
Darcs.Patch.Read.readFileName and Darcs.Patch.Show.formatFileName. We
just need to add a new FileNameFormat; I think we should rename the
existing ones and define:

data FileNameFormat
    = FileNameFormatV1      -- ^ on-disk format for V1 patches
    | FileNameFormatV2      -- ^ on-disk format for V2 patches
    | FileNameFormatV3      -- ^ on-disk format for V3 patches
    | FileNameFormatDisplay -- ^ display format
    deriving (Eq, Show)

The only complication is that we now need a Darcs.Patch.V3.Prim module /
newtype wrapper for Prim.V1. This is copy and paste and replace work.
Ugly but really quite simple.

And if we do that we could as well throw out the whitespace encoding
completely for V3, and instead prefix the file name with its length in
bytes, decimal encoded.
History
Date User Action Args
2019-06-15 10:29:42bfcreate
2019-06-15 10:30:06bfsetstatus: needs-screening -> needs-review
2019-07-27 19:44:12ganeshsetstatus: needs-review -> review-in-progress
messages: + msg20974
2019-07-28 13:47:39bfsetmessages: + msg20984
2019-07-28 13:47:51bfsetfiles: + patch-preview.txt, rollback-_do-white-space-en_decoding-directly-on-the-bytestrings_.dpatch, unnamed
messages: + msg20985
2019-07-28 14:25:02ganeshsetmessages: + msg20993
2019-07-28 20:03:30bfsetmessages: + msg20997
2019-08-06 09:45:04ganeshsetstatus: review-in-progress -> followup-in-progress