Patch 1832: do white space en/decoding directly on the ByteStrings - Darcs bug tracker

Title	do white space en/decoding directly on the ByteStrings
Superseder		Nosy List	bfrk
Related Issues
Status	followup-in-progress	Assigned To
Milestone

Created on 2019-06-15.10:29:42 by bfrk, last changed 2019-08-06.09:45:04 by ganesh.

Files
File name	Status	Uploaded	Type	Edit	Remove
do-white-space-en_decoding-directly-on-the-bytestrings.dpatch		bfrk, 2019-06-15.10:29:42	application/x-darcs-patch
patch-preview.txt		bfrk, 2019-06-15.10:29:42	text/x-darcs-patch
patch-preview.txt		bfrk, 2019-07-28.13:47:51	text/x-darcs-patch
rollback-_do-white-space-en_decoding-directly-on-the-bytestrings_.dpatch		bfrk, 2019-07-28.13:47:51	application/x-darcs-patch
unnamed		bfrk, 2019-06-15.10:29:42	text/plain
unnamed		bfrk, 2019-07-28.13:47:51	text/plain

See mailing list archives for discussion on individual patches.

Messages
msg20828 (view)	Author: bfrk	Date: 2019-06-15.10:29:42
One more optimization prompted by profiling the darcs-3 format, but actually independent of the patch format. 1 patch for repository /home/ben/src/darcs/without-v3: patch d04e12c10444dea24960406be14129da05c73e22 Author: Ben Franksen <ben.franksen@online.de> Date: Sun Feb 17 21:50:01 CET 2019 * do white space en/decoding directly on the ByteStrings Attachments do-white-space-en_decoding-directly-on-the-bytestrings.dpatch patch-preview.txt unnamed
msg20974 (view)	Author: ganesh	Date: 2019-07-27.19:44:11
Partial review of what I've looked at so far: > Author: Ben Franksen <ben.franksen@online.de> > Date: Sun Feb 17 21:50:01 CET 2019 > * do white space en/decoding directly on the ByteStrings encodeWhitePS/decodeWhitePS seem like obvious candidates for QuickCheck testing against encodeWhite/decodeWhite, particularly if we want to optimise them further. > hunk ./src/Darcs/Util/Path.hs 157 > +encodeWhitePS :: B.ByteString -> B.ByteString > +encodeWhitePS x = > + case B.findIndex (\c -> isSpace (w2c c) \|\| c == backslash) x of > + Nothing -> x > + Just i -> B.concat > + [ B.take i x > + , B.singleton backslash > + , BC.pack (show (B.index x i)) > + , B.singleton backslash > + , encodeWhitePS (B.drop (i+1) x) > + ] > + where > + backslash = 0x5C > + > hunk ./src/Darcs/Util/Path.hs 604 > -encodeWhiteName = encodeLocale . encodeWhite . decodeLocale . unName > +encodeWhiteName = encodeWhitePS . unName [existing definition of encodeWhite] > encodeWhite :: FilePath -> String > encodeWhite (c:cs) \| isSpace c \|\| c == '\\' = > '\\' : show (ord c) ++ "\\" ++ encodeWhite cs > encodeWhite (c:cs) = c : encodeWhite cs > encodeWhite [] = [] Is the new code actually equivalent to the old code? In the old case we properly translate from locale code into Unicode characters and then look for any Unicode space. In the new code we just look for any byte that looks like an ASCII space in the encoded sequence of bytes. The repeated concats in this code strike me as quite inefficient, I'd have thought it'd at least make sense to make a single list [B.ByteString] then concatenate at the end.
msg20984 (view)	Author: bfrk	Date: 2019-07-28.13:47:38
>> * do white space en/decoding directly on the ByteStrings > > Is the new code actually equivalent to the old code? In the old case > we properly translate from locale code into Unicode characters and > then look for any Unicode space. In the new code we just look for any > byte that looks like an ASCII space in the encoded sequence of > bytes. Good catch. My change was based on the erroneous assumption that we used the isSpace function from Darcs.Util.ByteString, when in fact we used Data.Char.isSpace. So this change is not backward compatible. I have tested this with an example where I record a file named 'foo\u2004bar': with darcs-2.12.5 the encoding is addfile ./foo<U+00C3><U+00A2><U+00C2><U+0080><U+00C2><U+0084>bar but with screened it is addfile ./foo^Dbar This causes lots of inconsistencies when I use the two darcs versions in an interleaved fashion in the same repo. It is quite unfortunate that we have to maintain this exact encoding. The only reason the whitespace encoding was introduced is to allow word-based parsing. For this it would have been enough to escape ASCII space chars. Will send a rollback. > The repeated concats in this code strike me as quite inefficient, I'd > have thought it'd at least make sense to make a single list > [B.ByteString] then concatenate at the end. Also a good point, but obsolete since I have to rollback this change.
msg20985 (view)	Author: bfrk	Date: 2019-07-28.13:47:51
1 patch for repository http://darcs.net/screened: patch a424dfd10acc3792e21a3752555f493179527a3a Author: Ben Franksen <ben.franksen@online.de> Date: Sun Jul 28 15:48:33 CEST 2019 * rollback 'do white space en/decoding directly on the ByteStrings' This change was incompatible with the existing whitespace encoding for file paths in that the latter encodes /all/ unicode characters that Data.Char.isSpace identifies as space, not only those that are ASCII whitespace. Attachments patch-preview.txt rollback-_do-white-space-en_decoding-directly-on-the-bytestrings_.dpatch unnamed
msg20993 (view)	Author: ganesh	Date: 2019-07-28.14:25:02
Thanks - would you mind also adding the example you found as a test case? (Or I can if you prefer) I agree that it would have been much better to just look for ASCII spaces. I guess it would be difficult to change the behaviour even for V3 without seriously complicating the Prim.V1 code.
msg20997 (view)	Author: bfrk	Date: 2019-07-28.20:03:30
> Thanks - would you mind also adding the example you found as a > test case? Yes, definitely. I had planned to do that and will. The only problem here is how to test for compatibility. I guess it means adding yet another tar ball with a repo recorded with the current whitespace encoding... > I agree that it would have been much better to just look for ASCII > spaces. I guess it would be difficult to change the behaviour even > for V3 without seriously complicating the Prim.V1 code. I don't think so. The only place where we make the distinction is in Darcs.Patch.Read.readFileName and Darcs.Patch.Show.formatFileName. We just need to add a new FileNameFormat; I think we should rename the existing ones and define: data FileNameFormat = FileNameFormatV1 -- ^ on-disk format for V1 patches \| FileNameFormatV2 -- ^ on-disk format for V2 patches \| FileNameFormatV3 -- ^ on-disk format for V3 patches \| FileNameFormatDisplay -- ^ display format deriving (Eq, Show) The only complication is that we now need a Darcs.Patch.V3.Prim module / newtype wrapper for Prim.V1. This is copy and paste and replace work. Ugly but really quite simple. And if we do that we could as well throw out the whitespace encoding completely for V3, and instead prefix the file name with its length in bytes, decimal encoded.

History
Date	User	Action	Args
2019-06-15 10:29:42	bfrk	create
2019-06-15 10:30:06	bfrk	set	status: needs-screening -> needs-review
2019-07-27 19:44:12	ganesh	set	status: needs-review -> review-in-progress messages: + msg20974
2019-07-28 13:47:39	bfrk	set	messages: + msg20984
2019-07-28 13:47:51	bfrk	set	files: + patch-preview.txt, rollback-_do-white-space-en_decoding-directly-on-the-bytestrings_.dpatch, unnamed messages: + msg20985
2019-07-28 14:25:02	ganesh	set	messages: + msg20993
2019-07-28 20:03:30	bfrk	set	messages: + msg20997
2019-08-06 09:45:04	ganesh	set	status: review-in-progress -> followup-in-progress