Issue 2578 less weak and more useful hash for repo state

Title less weak and more useful hash for repo state
Priority feature Status unknown
Milestone Resolved in
Superseder Nosy List bf
Assigned To

Created on 2018-03-20.23:06:28 by bf, last changed 2018-03-21.10:19:51 by bf.

msg19993 (view) Author: bf Date: 2018-03-20.23:06:26
A recent discussion with Steve <Stephen J. Turnbull> on @darcs-user gave
me the following idea:

We could make a more secure and more useful identifier for repository
states than our current 'weak hash'.

The idea is to take the inventory, strip off the meta data to arrive at
a simple list of patch (representation) hashes, plus the hash of the
parent inventory, if one exists. Then we compress the list, hash it
again, and store it in some subdirectory under _darcs using the hash as
the file name as usual.

This hash can then not only be used to identify a repo state but also to
restore the state.

As an experiment I hacked this into two shell commands and timed it for
the current darcs repo with a bit less than 500 patches in the head

> grep -Eo '[0-9]{10}-[a-f0-9]*' _darcs/hashed_inventory > hi_hashes
> cat hi_hashes| gzip > $(sha256sum hi_hashes|cut -d' ' -f1))

This takes about 30 ms to complete. The size of the gzipped file is
about 20kB. So this seems manageable, resource-wise.
msg19994 (view) Author: gh Date: 2018-03-21.00:37:46
Yes, this is a long-standing issue and discussion in Darcs, see:

* http://bugs.darcs.net/issue992
* http://darcs.net/Ideas/ShortSecureId

What you implemented seems like the "naive inventory hash" from the wiki
page. The challenge of creating such identifier is to be able to use it
as a reliable way to get some set of patches from someone else's repository.

In your implementation I am not sure what the compression step is useful
for if after that you end up hashing anyway?
msg19995 (view) Author: bf Date: 2018-03-21.03:16:19
Thanks Guillaume, the links are helpful and interesting reading. Yes
what I proposed seems to be the naive inventory version, basically. And
indeed, it occurred to me independently that it is not
"commutation-friendly" as the wiki page formulates it. But perhaps this
is not a big problem in practice. It limits the number of repos from
which Bob can get a positive reply, though.

Re compression: I made a mistake in the text (but not in the script). We
cannot first gzip then take the hash, this does not give reproducable
hashes. Anyway, compression doesn't buy us much.

My idea was that we store the state every time we mutate the repo, so we
have a handle for every past state. If we want to be able to restore a
state given its ID, we need to store the list of hashes in the inventory
(in the same order).

As David points out in the discussion of the ticket, this requires
O(N^2) space in the "length" (number of patches) of the repo, but this
is a bit too pessimistic. More precisely, for an average length N
(number of patches) of the head inventory and M repo-mutating operations
we need N*M*hashsize bytes to store the states, where hashsize=75. (This
assumes we store only the hashes, not the meta data.)

For the current screened we have 467 patches since the last tag. For a
public repo we can approximate M with the number of patches in the repo
(no amends, no reordering, only patch addition) which is currently
~12000 for us. If we had stored every state and assume an average length
of the head inventory of 500 we'd need 500MB disk space. That's a lot.

To save space, compression is one tool, but it gives us only a constant
factor. A more effective idea is to store the states so that we get
maximal sharing. I think this would reduce the space requirements to
2*M*hashsize for a linear repo (one hash for each patch and one hash for
its parent). We could also limit the length of the head inventory to
some constant. So we store an inventory not only when tagging but more
often. This would require that we mark clean tags specially.

The "minimal context hash" sounds interesting but I am missing an
explanation how exactly it would work. AFAIU contexts are always
understood to be relative to a specific change and minimising it means
to commute as many patches as possible past this specific patch (but
only up to the next clean tag).
msg19996 (view) Author: bf Date: 2018-03-21.10:19:49
Let me introduce the term "version" to denote a repo state that we can
store efficiently. Each version contains the hash of its latest patch
(representation) and the hash of its parent version. This is like lists
in Haskell with shared tails. As an optimization, we can add (and
should) an optional inventory hash, see below.

Here is an example scenario. Suppose we commute the last two patches of
a repo. So the inventory has some patches (a, b) at its end, and we have
the current version vb=(b,va), va=(a,vx). We drop the last two patches
from the inventory, simultaneously going back to the parent of the
parent of the current version vb which is vx. The patches are commuted
and we get new representations (b', a') which we append, one by one, to
the inventory. For the first new patch b' we make a new version
vb'=(b',vx) and for a' we make va'=(a',vb'). Then va' is our new current
version. Suppose we now re-order (b',a') back to (a,b). We drop (b',a')
as before and start with vx. Now, when we append a, we make va''=(a,vx)
but now this has the same  hash as that of va, so we share the version
file; and similar with b.

This suggests that the storage requirements are roughly two additional
hashes per patch representation. This is quite reasonable IMO.

Reconstructing a repo state from a version hash requires reading as many
version files as are patches in the head inventory. The first patch of
the head inventory contains the optional reference to the parent
inventory. When we hit such an inventory hash, we can proceed by
restoring the repo from the chain of inventories as we do now.

(BTW, an independent optimization we can and should add to Darcs is to
remember the pristine hash in /each/ inventory instead of only the head
inventory. This has the potential to speed up many operations that work
on the distant past of a repo, like cloning with --tag. I am not ready
to suggest we add the pristine hash to each version, but /if/ we did
that then we would arrive at something very similar to git. I have no
idea how git manages to optimize away the enormous storage requirements
that a naive implementation of its underlying model has. I think I will
ask Steve how they do that.)

Since we do not have branches (multiple head inventories under different
names) yet, the only reasonably safe way to restore a version is to make
a clone. We could consider adding a dangerous command to restore a
version in-place, with appropriately dire warnings attached.
Date User Action Args
2018-03-20 23:06:28bfcreate
2018-03-21 00:36:28ghlinkissue992 superseder
2018-03-21 00:37:48ghsetmessages: + msg19994
2018-03-21 03:16:21bfsetmessages: + msg19995
2018-03-21 10:19:51bfsetmessages: + msg19996