darcs

Patch 1823 v3: basic implementation and tests

Title v3: basic implementation and tests
Superseder Nosy List bf, ganesh
Related Issues
Status accepted-pending-tests Assigned To ganesh
Milestone

Created on 2019-06-14.19:05:28 by bf, last changed 2019-07-14.18:36:14 by ganesh.

Files
File name Status Uploaded Type Edit Remove
add-some-functions-we-need-for-v3-to-darcs_patch_ident.dpatch bf, 2019-06-14.19:05:28 application/x-darcs-patch
document-and-add-property-for-primpatchid-invariant.dpatch bf, 2019-07-08.16:17:48 application/x-darcs-patch
patch-preview.txt bf, 2019-06-14.19:05:28 text/x-darcs-patch
patch-preview.txt bf, 2019-07-08.16:17:48 text/x-darcs-patch
unnamed bf, 2019-06-14.19:05:28 text/plain
unnamed bf, 2019-07-08.16:17:48 text/plain
See mailing list archives for discussion on individual patches.
Messages
msg20809 (view) Author: bf Date: 2019-06-14.19:05:28
4 patches for repository /home/ben/src/darcs/without-v3:

patch 201588ce705f87b17bcb800c113b91d75d3a450e
Author: Ben Franksen <ben.franksen@online.de>
Date:   Fri Feb  8 16:50:41 CET 2019
  * add some functions we need for V3 to Darcs.Patch.Ident

patch 2ab2d6ea7d3b4299dee37bc81e36fa35bc9c035d
Author: Ben Franksen <ben.franksen@online.de>
Date:   Fri Feb  8 17:45:13 CET 2019
  * add RepoPatchV3 aka camp conflictors

patch 41d9f81cb0bb0c89d3589440cd2959d0b5df7361
Author: Ben Franksen <ben.franksen@online.de>
Date:   Fri Feb  8 17:45:38 CET 2019
  * add V3 stuff to the test suite (quickcheck tests only)

patch b8386798be44a17c5e33cfaa130da7f94fc60941
Author: Ben Franksen <ben.franksen@online.de>
Date:   Fri Feb 22 15:02:39 CET 2019
  * add a comment to commutePast
Attachments
msg20812 (view) Author: ganesh Date: 2019-06-14.23:09:47
Continuing the discussion from patch1804:

>>> -- The 'Int' gets inverted together with the patch.
>>> data PrimPatchId = PrimPatchId !Int !SHA1
>> 
>> Is it illegal for the Int to be 0?
> 
> Yes.
> 
>> If so this should be documented
>> and ideally tested. (I haven't read a lot of the rest of the 
>> relevant code yet, so this might already be tested.)
> 
> Good point. No, i don't think this is tested nor documented. Will do.

Actually, a followup thought is do we really need this slightly
opaque encoding, or could we use a Bool like PatchInfo has?

In the same vein, I wonder a bit about the SHA1. I wonder what
the trade-offs between that and using the PatchInfo directly are (if
the latter is even possible). I'll need to get to grips with the rest
of the code to have a good feeling for that though.
msg20816 (view) Author: bf Date: 2019-06-15.08:08:03
> Actually, a followup thought is do we really need this slightly
> opaque encoding, or could we use a Bool like PatchInfo has?

We certainly could. But we'd still have to maintain an invariant (>=0).
I tried to minimize the memory footprint here. Note this gets attached
to /every/ single prim patch (if it is part of a RepoPatchV3).

> In the same vein, I wonder a bit about the SHA1. I wonder what
> the trade-offs between that and using the PatchInfo directly are (if
> the latter is even possible). I'll need to get to grips with the rest
> of the code to have a good feeling for that though.

Again, certainly possible. But remember that the PrimPatchId is part of
the on-disk representation. Using the full meta-data would make patches
quite a bit larger, slowing down read and write of patches and also
equality tests. And since we read PrimPatchIds from disk, we cannot (at
least not automatically) share equal PrimPatchIds in memory, so memory
requirements would also go up significantly.
msg20831 (view) Author: ganesh Date: 2019-06-15.10:55:50
> Again, certainly possible. But remember that the PrimPatchId is
> part of the on-disk representation. Using the full meta-data
> would make patches quite a bit larger, slowing down read and
> write of patches and also equality tests. And since we read 
> PrimPatchIds from disk, we cannot (at least not automatically) 
> share equal PrimPatchIds in memory, so memory requirements
> would also go up significantly.

Yeah. I was thinking that the on-disk representation could elide the 
PrimPatchId when it's the same as that of the containing patch, so
it'd only be needed for conflicts.

When reading patches, this would also implicitly mean we get back 
most of the sharing.

A pointer is smaller than a SHA1, so this would be a win for the
non-conflicting case.

We could also choose to use hash-consing to get back sharing for
the conflicting case, at the cost of some speed when reading, and
extra code complexity in reading patches.
msg20833 (view) Author: ganesh Date: 2019-06-15.11:07:39
>> Actually, a followup thought is do we really need this slightly
>> opaque encoding, or could we use a Bool like PatchInfo has?
> 
> We certainly could. But we'd still have to maintain an invariant 
> (>=0). I tried to minimize the memory footprint here. Note this
> gets attached to /every/ single prim patch (if it is part of
> a RepoPatchV3).

Isn't it just that the numbers need to be unique amongst all the 
primitives with the same named patch, even if in practice we
allocate them as positive numbers?

One alternative if keeping the efficiency is important would be to 
protect the representation by not exporting the constructor of 
NamedPrim.
msg20834 (view) Author: bf Date: 2019-06-15.14:01:03
>> Again, certainly possible. But remember that the PrimPatchId is
>> part of the on-disk representation. Using the full meta-data
>> would make patches quite a bit larger, slowing down read and
>> write of patches and also equality tests. And since we read 
>> PrimPatchIds from disk, we cannot (at least not automatically) 
>> share equal PrimPatchIds in memory, so memory requirements
>> would also go up significantly.
> 
> Yeah. I was thinking that the on-disk representation could elide the 
> PrimPatchId when it's the same as that of the containing patch, so
> it'd only be needed for conflicts.
> 
> When reading patches, this would also implicitly mean we get back 
> most of the sharing.
> 
> A pointer is smaller than a SHA1, so this would be a win for the
> non-conflicting case.

Okay, this would solve the space issue (on disk and in memory), assuming
the repo doesn't contain too many conflicted patches.

> We could also choose to use hash-consing to get back sharing for
> the conflicting case, at the cost of some speed when reading, and
> extra code complexity in reading patches.

Let me flesh that out a bit and please correct me if I misunderstood:

So the idea here is to use a global (per repo) lookup table from SHA1
hashes to meta-data. That would allow us to use the optimized on-disk
representation for Conflictors, too, i.e. the current one base on
PrimPatchId, but use the real meta-data in a shared way in memory.
Access to this table is of course impure, we would need to hide the
necessary unsafePerformIO stuff behind some hopefully semantically pure
abstraction.

I still cannot see how this would give us back cheap equality tests for
prim patches in the implementation of RepoPatchV3.

BTW, one optimization of the on-disk representation that would be
minimally invasive and even serve as a possible first step toward the
more far-reaching changes discussed above, is to eliminate the
PrimPatchIds for unconflicted patches and just re-calculate them when
reading the patch.
msg20886 (view) Author: ganesh Date: 2019-07-07.11:37:04
(Sorry for the delay in following up on this)

On 15/06/2019 15:01, Ben Franksen wrote:
> 
> Ben Franksen <ben.franksen@online.de> added the comment:
> 
>>> Again, certainly possible. But remember that the PrimPatchId is
>>> part of the on-disk representation. Using the full meta-data
>>> would make patches quite a bit larger, slowing down read and
>>> write of patches and also equality tests. And since we read 
>>> PrimPatchIds from disk, we cannot (at least not automatically) 
>>> share equal PrimPatchIds in memory, so memory requirements
>>> would also go up significantly.
>>
>> Yeah. I was thinking that the on-disk representation could elide the 
>> PrimPatchId when it's the same as that of the containing patch, so
>> it'd only be needed for conflicts.
>>
>> When reading patches, this would also implicitly mean we get back 
>> most of the sharing.
>>
>> A pointer is smaller than a SHA1, so this would be a win for the
>> non-conflicting case.
> 
> Okay, this would solve the space issue (on disk and in memory), assuming
> the repo doesn't contain too many conflicted patches.
> 
>> We could also choose to use hash-consing to get back sharing for
>> the conflicting case, at the cost of some speed when reading, and
>> extra code complexity in reading patches.
> 
> Let me flesh that out a bit and please correct me if I misunderstood:
> 
> So the idea here is to use a global (per repo) lookup table from SHA1
> hashes to meta-data. That would allow us to use the optimized on-disk
> representation for Conflictors, too, i.e. the current one base on
> PrimPatchId, but use the real meta-data in a shared way in memory.
> Access to this table is of course impure, we would need to hide the
> necessary unsafePerformIO stuff behind some hopefully semantically pure
> abstraction.

Yes. It doesn't necessarily need to be a global with unsafePerformIO, we
could just thread it around when parsing a repo if we don't want to have
perfect sharing all the time.

> I still cannot see how this would give us back cheap equality tests for
> prim patches in the implementation of RepoPatchV3.

You're right, it's not as simple as I had thought.

To get cheap equality we'd also need to use 'reallyUnsafePtrEquality'.
Clearly the name should give us some pause :-) It sounds like it can't
give false positives (false reports that


> BTW, one optimization of the on-disk representation that would be
> minimally invasive and even serve as a possible first step toward the
> more far-reaching changes discussed above, is to eliminate the
> PrimPatchIds for unconflicted patches and just re-calculate them when
> reading the patch.

Yep. I think that's also what I meant by eliding it when it's the same
as the containing patch.

One option for now would be to keep the full patch names in the on-disk
format, but keep the hashes in the in-memory representation for now.
That would be easy on reading (just hash), but harder when writing as
we'd need a table of hash -> patch name somewhere. But then the hashing
would just be an implementation detail we could change at any time.
msg20889 (view) Author: bf Date: 2019-07-07.15:45:35
So it looks like getting all the benefits from hash-consing would
require us to swallow a heavy dose of unsafeWhatever. And the added code
complexity is considerable. I won't say that it is unreasonable to
consider such an optimization in the long run, but I'd say for now it is
out of scope.

>> BTW, one optimization of the on-disk representation that would be
>> minimally invasive and even serve as a possible first step toward the
>> more far-reaching changes discussed above, is to eliminate the
>> PrimPatchIds for unconflicted patches and just re-calculate them when
>> reading the patch.
> 
> Yep. I think that's also what I meant by eliding it when it's the same
> as the containing patch.
> 
> One option for now would be to keep the full patch names in the on-disk
> format, but keep the hashes in the in-memory representation for now.
> That would be easy on reading (just hash), but harder when writing as
> we'd need a table of hash -> patch name somewhere. But then the hashing
> would just be an implementation detail we could change at any time.

I am happy to consider optimizations of all sorts, but I really think
that the on-disk patch format should not contain the full meta data more
than once, ever. This is just wrong.

I am also, albeit less strongly, opposed to penalizing conflictors even
if that improves things in repos that contain few conflicts. One of the
reasons I have invested so much work on the v3 format is that I want
darcs to finally get rid of the "avoid conflicts at all costs" stigma.

Eliding hashes when they are the same as for the containing patch is a
possibility, but not quite as uninvasive as I thought. The problem is
that when we read a Named patch we'd have to pass down a hash and a
serial number to the parsing function for the ingredients. There is
currently no API to do that, as up to now all patch types have a fully
self-contained on-disk format. I have been through such an internal API
change with the FromPrim class (there was no way to avoid that one), so
believe me when i say this is doable but nothing to enjoy.

Can we postpone these considerations for the moment and move on? There
is no pressing need to nail down the on-disk format for v3 at this time.
The current implementation is simple and yet reasonably efficient. More
importantly, it required no deep thinking to get right, which can't be
said for the commute and merge code... ;-) Let us optimize things when
we are certain we have got everything correct. I urge you to open a
ticket gathering the possibilities we discussed so we don't forget.
msg20895 (view) Author: bf Date: 2019-07-08.16:17:48
Following up on patch review.

2 patches for repository http://darcs.net/screened:

patch f5d79f24bab01bad038670f17c2e56c2d7a45b49
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sat Jun 15 11:20:33 CEST 2019
  * document and add property for PrimPatchId invariant

patch 51096b5cd49a10d71aec2b7242caaf344d72e53d
Author: Ben Franksen <ben.franksen@online.de>
Date:   Mon Jul  8 18:10:10 CEST 2019
  * make PrimPatchId opaque and check invariant when constructing it
Attachments
msg20904 (view) Author: ganesh Date: 2019-07-11.10:07:21
> Can we postpone these considerations for the moment and move on? 
> There is no pressing need to nail down the on-disk format for v3 
> at this time. The current implementation is simple and yet 
> reasonably efficient. More importantly, it required no deep 
> thinking to get right, which can't be said for the commute and 
> merge code... ;-) Let us optimize things when we are certain we 
> have got everything correct. I urge you to open a ticket gathering 
> the possibilities we discussed so we don't forget.

Agreed. Raised issue2628.
msg20905 (view) Author: ganesh Date: 2019-07-11.12:36:29
In src/Darcs/Patch/V3/Contexted.hs:

> +-- Comparing the contexts is inefficient and unnecessary
> +-- if the patches have identities, see 'prop_CtxEq'.

It took me a little while to understand that this is true because
the context is minimal. It is implied from the existing description
and prop_ctxInvariants makes it unambiguous, but it still took me a 
little while to work it out.

I'll send a patch to emphasise this in the haddocks for Contexted.
msg20906 (view) Author: ganesh Date: 2019-07-11.12:38:05
Forget my previous comment, I must have completely missed the
final line of the haddocks that says this: :-)

"The definition of context above is chosen so that this sequence is 
minimal."
msg20907 (view) Author: ganesh Date: 2019-07-11.13:08:21
> patch 51096b5cd49a10d71aec2b7242caaf344d72e53d
> Author: Ben Franksen <ben.franksen@online.de>
> Date:   Mon Jul  8 18:10:10 CEST 2019
>   * make PrimPatchId opaque and check invariant when constructing 
>     it

I'd also like the "inverses are negation" aspect to be opaque.
I think this probably means that the API should only expose
positive numbers and have separate methods for checking for an 
inverse and inverting.

I'm happy to make that change myself if you prefer.

I'm happy with the rest of this bundle, though I haven't read the
actual conflict handling algorithm in detail or compared it to the
camp paper, nor have I looked closely at the exact set of
properties. If you think it's important to have a second set
of eyes on the algorithm, I can try to do it after I've caught up
with the rest of my reviewing backlog.
msg20911 (view) Author: bf Date: 2019-07-12.13:44:40
>>   * make PrimPatchId opaque and check invariant when constructing 
>>     it
> 
> I'd also like the "inverses are negation" aspect to be opaque.
> I think this probably means that the API should only expose
> positive numbers and have separate methods for checking for an 
> inverse and inverting.
> 
> I'm happy to make that change myself if you prefer.

I would appreciate that. (And there is no better way to deeply
understand things than to make changes to it...)

> I'm happy with the rest of this bundle, though I haven't read the
> actual conflict handling algorithm in detail or compared it to the
> camp paper, nor have I looked closely at the exact set of
> properties. If you think it's important to have a second set
> of eyes on the algorithm, I can try to do it after I've caught up
> with the rest of my reviewing backlog.

Sounds like a good plan. When you get to it: a special thing to look out
for would be cases where I didn't clearly distinguish between
side-conditions (which may rightfully fail) and invariants which must
not. Also there are some details where my implementation differs from
the paper on purpose; in some cases this is because there was a bug in
the algorithm as described in the paper. It may be useful to comment
these differences (if I did not already do that).
msg20938 (view) Author: ganesh Date: 2019-07-14.18:36:14
Accepting this now we have a viable candidate for improving the
abstraction of the sequence numbers.
History
Date User Action Args
2019-06-14 19:05:28bfcreate
2019-06-14 23:09:47ganeshsetstatus: needs-screening -> review-in-progress
assignedto: ganesh
messages: + msg20812
nosy: + ganesh
2019-06-15 08:08:03bfsetmessages: + msg20816
2019-06-15 10:55:50ganeshsetmessages: + msg20831
2019-06-15 11:07:39ganeshsetmessages: + msg20833
2019-06-15 14:01:03bfsetmessages: + msg20834
2019-07-07 11:37:04ganeshsetmessages: + msg20886
2019-07-07 15:45:36bfsetmessages: + msg20889
2019-07-08 16:17:48bfsetfiles: + patch-preview.txt, document-and-add-property-for-primpatchid-invariant.dpatch, unnamed
messages: + msg20895
2019-07-11 10:07:21ganeshsetmessages: + msg20904
2019-07-11 12:36:29ganeshsetmessages: + msg20905
2019-07-11 12:38:06ganeshsetmessages: + msg20906
2019-07-11 13:08:21ganeshsetmessages: + msg20907
2019-07-12 13:44:41bfsetmessages: + msg20911
2019-07-14 18:36:14ganeshsetstatus: review-in-progress -> accepted-pending-tests
messages: + msg20938