Issue 2578: less weak and more useful hash for repo state - Darcs bug tracker

Title	less weak and more useful hash for repo state
Priority	feature	Status	unknown
Milestone		Resolved in
Superseder		Nosy List	bfrk
Assigned To		Topics

Created on 2018-03-20.23:06:28 by bfrk, last changed 2018-03-21.10:19:51 by bfrk.

Messages
msg19993 (view)	Author: bfrk	Date: 2018-03-20.23:06:26
A recent discussion with Steve <Stephen J. Turnbull> on @darcs-user gave me the following idea: We could make a more secure and more useful identifier for repository states than our current 'weak hash'. The idea is to take the inventory, strip off the meta data to arrive at a simple list of patch (representation) hashes, plus the hash of the parent inventory, if one exists. Then we compress the list, hash it again, and store it in some subdirectory under _darcs using the hash as the file name as usual. This hash can then not only be used to identify a repo state but also to restore the state. As an experiment I hacked this into two shell commands and timed it for the current darcs repo with a bit less than 500 patches in the head inventory: > grep -Eo '[0-9]{10}-[a-f0-9]*' _darcs/hashed_inventory > hi_hashes > cat hi_hashes\| gzip > $(sha256sum hi_hashes\|cut -d' ' -f1)) This takes about 30 ms to complete. The size of the gzipped file is about 20kB. So this seems manageable, resource-wise.
msg19994 (view)	Author: gh	Date: 2018-03-21.00:37:46
Yes, this is a long-standing issue and discussion in Darcs, see: * http://bugs.darcs.net/issue992 * http://darcs.net/Ideas/ShortSecureId What you implemented seems like the "naive inventory hash" from the wiki page. The challenge of creating such identifier is to be able to use it as a reliable way to get some set of patches from someone else's repository. In your implementation I am not sure what the compression step is useful for if after that you end up hashing anyway?
msg19995 (view)	Author: bfrk	Date: 2018-03-21.03:16:19
Thanks Guillaume, the links are helpful and interesting reading. Yes what I proposed seems to be the naive inventory version, basically. And indeed, it occurred to me independently that it is not "commutation-friendly" as the wiki page formulates it. But perhaps this is not a big problem in practice. It limits the number of repos from which Bob can get a positive reply, though. Re compression: I made a mistake in the text (but not in the script). We cannot first gzip then take the hash, this does not give reproducable hashes. Anyway, compression doesn't buy us much. My idea was that we store the state every time we mutate the repo, so we have a handle for every past state. If we want to be able to restore a state given its ID, we need to store the list of hashes in the inventory (in the same order). As David points out in the discussion of the ticket, this requires O(N^2) space in the "length" (number of patches) of the repo, but this is a bit too pessimistic. More precisely, for an average length N (number of patches) of the head inventory and M repo-mutating operations we need NMhashsize bytes to store the states, where hashsize=75. (This assumes we store only the hashes, not the meta data.) For the current screened we have 467 patches since the last tag. For a public repo we can approximate M with the number of patches in the repo (no amends, no reordering, only patch addition) which is currently ~12000 for us. If we had stored every state and assume an average length of the head inventory of 500 we'd need 500MB disk space. That's a lot. To save space, compression is one tool, but it gives us only a constant factor. A more effective idea is to store the states so that we get maximal sharing. I think this would reduce the space requirements to 2Mhashsize for a linear repo (one hash for each patch and one hash for its parent). We could also limit the length of the head inventory to some constant. So we store an inventory not only when tagging but more often. This would require that we mark clean tags specially. The "minimal context hash" sounds interesting but I am missing an explanation how exactly it would work. AFAIU contexts are always understood to be relative to a specific change and minimising it means to commute as many patches as possible past this specific patch (but only up to the next clean tag).
msg19996 (view)	Author: bfrk	Date: 2018-03-21.10:19:49
Let me introduce the term "version" to denote a repo state that we can store efficiently. Each version contains the hash of its latest patch (representation) and the hash of its parent version. This is like lists in Haskell with shared tails. As an optimization, we can add (and should) an optional inventory hash, see below. Here is an example scenario. Suppose we commute the last two patches of a repo. So the inventory has some patches (a, b) at its end, and we have the current version vb=(b,va), va=(a,vx). We drop the last two patches from the inventory, simultaneously going back to the parent of the parent of the current version vb which is vx. The patches are commuted and we get new representations (b', a') which we append, one by one, to the inventory. For the first new patch b' we make a new version vb'=(b',vx) and for a' we make va'=(a',vb'). Then va' is our new current version. Suppose we now re-order (b',a') back to (a,b). We drop (b',a') as before and start with vx. Now, when we append a, we make va''=(a,vx) but now this has the same hash as that of va, so we share the version file; and similar with b. This suggests that the storage requirements are roughly two additional hashes per patch representation. This is quite reasonable IMO. Reconstructing a repo state from a version hash requires reading as many version files as are patches in the head inventory. The first patch of the head inventory contains the optional reference to the parent inventory. When we hit such an inventory hash, we can proceed by restoring the repo from the chain of inventories as we do now. (BTW, an independent optimization we can and should add to Darcs is to remember the pristine hash in /each/ inventory instead of only the head inventory. This has the potential to speed up many operations that work on the distant past of a repo, like cloning with --tag. I am not ready to suggest we add the pristine hash to each version, but /if/ we did that then we would arrive at something very similar to git. I have no idea how git manages to optimize away the enormous storage requirements that a naive implementation of its underlying model has. I think I will ask Steve how they do that.) Since we do not have branches (multiple head inventories under different names) yet, the only reasonably safe way to restore a version is to make a clone. We could consider adding a dangerous command to restore a version in-place, with appropriately dire warnings attached.

History
Date	User	Action	Args
2018-03-20 23:06:28	bfrk	create
2018-03-21 00:36:28	gh	link	issue992 superseder
2018-03-21 00:37:48	gh	set	messages: + msg19994
2018-03-21 03:16:21	bfrk	set	messages: + msg19995
2018-03-21 10:19:51	bfrk	set	messages: + msg19996