darcs

Issue 992 short, secure, fast version identifiers

Title short, secure, fast version identifiers
Priority feature Status deferred
Milestone Resolved in
Superseder context files with file hashes, task: safety refactor to ensure that hashed_inventory is only read once
View: 1505, 1556
Nosy List darcs-devel, dmitry.kurochkin, kowey, nwf, thorkilnaur, zooko
Assigned To
Topics Git

Created on 2008-08-12.21:01:25 by zooko, last changed 2011-04-20.12:28:22 by kowey.

Files
File name Uploaded Type Edit Remove
ssvc-take-1 nwf, 2008-08-15.16:16:58 application/octet-stream
ssvc-take-2 nwf, 2008-08-16.07:06:46 application/octet-stream
Messages
msg5439 (view) Author: zooko Date: 2008-08-12.21:01:22
My Primary Canary is Brian Warner.  He is the most important person for me to
work with, and if he is not happy with darcs, then I'm not happy.

Last November he wrote this letter:

http://allmydata.org/pipermail/tahoe-dev/2007-November/000228.html

The parts about performance are still an issue (he recently said, when I asked
him, "I would like it to be faster."), but this ticket is about the other part:
short, secure identifiers for versions.  I recently asked him to clarify exactly
what he wanted:

<zooko> By the way, I am very interested in satisfying your desires for darcs 
        functionality, and I think many of the other darcs developers would 
        be, too. 
<zooko> But so far I don't understand what you want precisely enough to write 
        it down in a darcs issue ticket.
<zooko> I certainly understand parts of it, but not all of it. 
<warner> 'darcs get -r SHORTSTRING http://repo new-tree' 
<warner> darcs status -> SHORTSTRING + diffs
<zooko> Thanks.  :-) 
<zooko> I'll write that down now. 
<warner> also, SHORTSTRING should be a cryptographically secure description of 
         a specific source tree 
<zooko> More precisely, there should be at most one source tree matching 
        SHORTSTRING, right?
<warner> yes 
<warner> oh, and computing SHORTSTRING from a given darcs repo should be O(1), 
         and 'get -r SHORTSTRING' should be O(not too bad)
<zooko> Hm.  I don't think that you mean that it has to beaconstant time 
        regardless of repo size and shape
<warner> it is in hg :) 
<zooko> as that would be impossible to conform with the unique repo 
        requirement... 
<zooko> Well, I'm sure it is possible if you amortize away the cost of 
        computing it, and then you take only O(1) time to look up the stored 
        answer.  ;-) 
<warner> I expect SHORTSTRING to be computed during the process of 'darcs get' 
         or 'darcs record' 
<zooko> I see.  ;-) 
<zooko> Ok. 
<zooko> I'll write that down too.
<warner> (basically I'm trying to discourage an implementation that, say, 
         sorts all the patches by name and then hashes the result) 
<zooko> Wait, 
<warner> well, not discourage it, but make it clear that 'what version am I 
         using' is supposed to be a cheap question, so that we could use it in 
         a 'make version' step
<zooko> your requirements are about the functionality that you need, not about 
        the implementation. 
<zooko> "cheap" is definitely a requirement. 
<zooko> O(1) I'm willing to accept, although I really don't think that's a 
        requirement of yours. 
<zooko> "Nos sorting and hashing" is definitely not a requirement you havfe. 
<zooko> I'll write down whatever you say.
<warner> you're right 
<zooko> But I might feel free to append my own opinion that if it is fast 
        enouugh then you won't know the difference.  ;-) 
<warner> darcs has some operations which are O(n!), and 'get version' must not 
         be one of them 
<zooko> The darcs people are all fired up about performance nowadays... 
<zooko> I can definitely write that down as a requirement. 
<warner> and I'd consider it a bug if it were O(Npatches), since that will be 
         usable for small repos but become less-so for large ones
<warner> so "very fast even for large/complex repos" is a better way to 
         express my requirement 
<zooko> Yes, okay. 
<warner> I don't want tahoe to be discouraged from using 'make version' on 
         every build step, for example 
<zooko> That is surely a fair requirement on your part. 
* zooko nods. 
<warner> as we are right now :)
msg5441 (view) Author: zooko Date: 2008-08-12.21:52:12
How short is short enough to satisfy Brian while long enough to be secure?

Well, there is some range.  We can look at what git, hg, and bzr do along these
lines.  Personally (having thought quite a lot about these issues recently, with
Brian's help, with regard to Tahoe, as well as to other security architectures),
I would say that 192 bits of hash output is a good sweet spot to aim for.  Using
base-62 encoding, that would be 33 chars, like this:
0VNS8XxDzaf4kcWth1vrQYORf3D4QnfwZ
msg5545 (view) Author: nwf Date: 2008-08-15.16:16:58
I've hacked up a demo this kind of functionality (following my ideas in
http://lists.osuosl.org/pipermail/darcs-users/2008-August/013053.html) and would
appreciate feedback from concerned parties.

I've attached the full changes for your experimental pleasure, but here's an
example session:

% cd ~/tmp/dtest/a
% ls
_darcs  test  test2
% darcs show version
Version (SSVC): 00000000-c568a81b6aaa78bd4b31
% darcs changes --context > ../ctx1
% diff -u ../ctx1 _darcs/ssvcs/00000000-c568a81b6aaa78bd4b31
% cd ..
% darcs get --ssvc=00000000-c568a81b6aaa78bd4b31 a b
Copying patches, to get lazy repository hit ctrl-C...
Unapplying 0 patches.
Finished getting.
% darcs get --context=ctx1 a c
Copying patches, to get lazy repository hit ctrl-C...
Unapplying 0 patches.
Finished getting.
% diff --recursive b c
diff --recursive b/_darcs/prefs/sources c/_darcs/prefs/sources
2c2
< thisrepo:/home/nwf/tmp/dtest/b
---
> thisrepo:/home/nwf/tmp/dtest/c

Note that "show version" also understands --dry-run (which doesn't write the
file into _darcs/ssvcs/) and --xml-output (in case you want the version
surrounded by <ssvc> tags. ;))

Incidentally, lacking a better name, SSVC stands for "Short, Secure Version of
Context"; perhaps "darcs show ssvc" would be less potentially confusing than
"darcs show version".  IIRC hg uses "identity".
Attachments
msg5547 (view) Author: zooko Date: 2008-08-15.16:30:47
Very cool!
msg5550 (view) Author: nwf Date: 2008-08-16.07:06:46
Just a quick amend-record'd version (take-2) of the previous patch which fixes a
one-line bug in the test case which went undetected.  Sorry 'bout that.
Attachments
msg5776 (view) Author: droundy Date: 2008-08-28.17:40:54
Sorry Nathaniel,

I wouldn't say that this version deserves the name "secure", as (so far as I can
tell) there is nothing done to ensure that security is maintained.  It's also
not really very robust, since you can't generate a ssvc identifier on anything
but a public repository, unless you've got somewhere to upload ssvcs/xxx file. 
And for the simple case of only describing versions in a public repository, one
doesn't really need short version identifiers, since the existing --to-match
'hash = xxx' approach works fine.

In the future, if you want me to take a look at a code contribution, you should
send it with darcs send.

David
msg5797 (view) Author: droundy Date: 2008-08-29.14:00:30
I've realized that there is in fact a relatively simple and totally secure
approach that we could use for this problem, which works whenever hashed
repositories are in use (which is all we care about, moving forward).  In hashed
repositories, we've already got secure identifiers for certain repository
states, stored in _darcs/patches/inventories/.  These currently only exist for
tagged states, but there's no reason we couldn't add more inventories in there,
similar to Nathaniel's idea with the _darcs/ssvn directory.  Then the only
requirement would be to enable accessing those stored states by name, which is a
feature that Zooko has already requested, in the form of an extended URL that
includes the inventory hash.  I think it makes most sense to first enable this
latter feature (which will be immediately useful), and only later to enable a
darcs mode that dumps many (all?) states into _darcs/inventories/.  The latter
needs to be optional, as it uses O(N^2) space in the number of patches between tags.

Anyhow, this should be combined with a safety refactor which would ensure that
the _darcs/hashed_inventory is only read once:  we should store its contents in
the Repository data structure, so we can't accidentally mix two views of a
remote repository during one command.  I don't think we currently make this
mistake, but it's troubling that we could.  Once this refactor is done (which
means that we'd read _darcs/hashed_inventory when first identifying the
Repository), we can easily make darcs read _darcs/inventories/xxx instead, if
the URL has some fancy format that includes a hash value.  Or if a file with
that hash isn't present in _darcs/inventories/ we'd look at
_darcs/hashed_inventory to see if that has the right hash.  This feature will
enable self-authenticating URLs, albeit URLs that only describe a specific version.

Note that this approach isn't much more robust than the approach using hashes of
contexts.  It's a bit more robust, since it can allow access to patches that
have been obliterated (as long as the relevant files haven't been deleted from
the server).  But it is completely secure, even in the presence of malicious
users attempting to corrupt a repository.
msg7924 (view) Author: zooko Date: 2009-06-21.18:09:26
http://allmydata.org/trac/darcsver/ticket/3 (store the hash of "darcs changes
--context" and make it available as "--hashed-version")

This is the ticket showing how I intend to implement a kludgy approximation of
this feature in a Python tool by invoking "darcs changes --context | sha256sum".
msg7927 (view) Author: zooko Date: 2009-06-23.20:04:48
By the way, Brian Warner just mentioned to me that he is experimenting with
converting the Tahoe repo from darcs to git.  I think the end is nearing!  I
consider Tahoe to be the second-biggest user of darcs (after ghc, which is also
on the verge of jumping ship).  If anyone knows of other similarly "large and
active" projects using darcs, I'd be interested to know about it.
msg7928 (view) Author: zooko Date: 2009-06-23.20:52:09
See also new discussion on mailing list:
http://lists.osuosl.org/pipermail/darcs-users/2009-June/020262.html
msg7931 (view) Author: kowey Date: 2009-06-24.07:41:51
On Tue, Jun 23, 2009 at 20:04:52 -0000, Zooko wrote:
> I consider Tahoe to be the second-biggest user of darcs (after ghc,
> which is also on the verge of jumping ship).

GHC have indefinitely postponed their switch to git.  In any case, I
wish the Tahoe hackers well in whatever decision they end up making.
msg8052 (view) Author: kowey Date: 2009-08-09.15:46:49
I'm still trying to understand this issue.  One thing I realised is that I'm
having trouble because I'm simultaneously trying to understand the proposed
solutions AND the requirements.

So baby steps.  Maybe if I try to very slowly step through the requirements, my
head won't hurt so much.

version identifier (or version for short): something string/number such that
given two repos, iff the version matches, we say that the two repos are
equivalent (what I'm going after here is that two repos with the same patches in
different orders have the same version).

short : zooko suggests 192 bits of hash output is good for tradeoff between
short and secure

fast: given a repository, obtaining the version identifier should be absurdly
cheap at all times (Brian Warner wants O(1))

secure: not only should the version identifier tell us if we think the two
repositories are equivalent in theory, they should tell us that the repos are
equivalent in practice.  Whereas a version identifier would uniquely identify
repos to the patch-id level, a *secure* version identifier would identify them
to the source tree level.  (By source tree I'm thinking the pristine cache).

Have I at least gotten that bit right?
msg8053 (view) Author: kowey Date: 2009-08-09.15:58:21
Adding one more requirement.

I think one thing I was confused about was the "fast" bit, because along with
talking about 'short', 'secure', 'fast' and 'version identifier' a lot of this
discussion is also about being able to go from version identifier to repo
(whereas in message8052 I've been exclusively concerned with going from two
repos to their versions and then doing a string equality on their versions).  So
I guess there's this extra property we want, what, "invertible"?

short, secure, fast, invertible version identifiers?
msg8054 (view) Author: zooko Date: 2009-08-09.16:42:29
Eric: good idea to spell out the requirements so that everyone can understand them.

What you wrote is close, but I'm not sure the requirements you stated are
exactly the same as Brian's requirements.  Let's step through a use case.  Alice
has a repository.  She generates an identifier.  It less than 40 characters
long.  She sends it to Bob.  Bob has read-access to a different repo, owned by
Charles.  He then tells darcs "Using Charles's repo, get the version identified
by this identifier.  Darcs then (very very quickly) does one of two things:
tells him it can't because Charles's repo doesn't have all the patches that it
would need, or does so, resulting in a local repo (on Bob's computer) which has
exactly the set of patches in it that Alice intended when she generated the
identifier.

The "secure" part simply means that *if* darcs does the second case (fetches the
patches) instead of the first case (says that it can't), then Bob doesn't have
to consider Charles as part of the equation of what set of patches that he got.
 Bob knows that he got the set of patches Alice intended, regardless of
Charles's intention.  (Charles could of course force darcs into the "I can't do
that" branch, for example by deleting his repository or turning off his network
connection.)

It might help to think that what Brian is asking for is basically "Just give me
what git gives me.".  Brian loves git nowadays and uses it every chance he gets.
 The use case story outlines above is, I think, exactly the standard git use case.

> secure: not only should the version identifier tell us if we think the two
> repositories are equivalent in theory, they should tell us that the repos are
> equivalent in practice.  Whereas a version identifier would uniquely identify
> repos to the patch-id level, a *secure* version identifier would identify them
> to the source tree level.  (By source tree I'm thinking the pristine cache).

Hm...  Oh, I get it, yes the problem is if the version-identifier only gives Bob
assurance that he got a set of patches whose *patch identifiers* are the same as
the *patch identifiers* that Alice meant, then Charles could provide a different
patch with a matching patch identifier, thus giving Bob a repository which was
*not* what Alice meant.
msg8414 (view) Author: kowey Date: 2009-08-23.18:00:37
OK, so I'm still working on wrapping my head around this issue.  

I think Zooko's msg8054 (for which thanks) confirms my attempt in spelling out
the requirements in msg8052.  I'll also note that my extra 'invertible' property
from msg8053 should probably have been 'reversible'.  That is being able to
reproduce a repository from an identifier, not just confirm it matches.

All that said, I think the reason David (msg5576) said Nathaniel's ssvc approach
in msg5545 was insecure was because context files are missing file hashes for
patches which means it's easy to trick darcs by fiddling with the contents of a
patch.  His plan of just using hashed inventories would fix this (as would the
use of hashed context files, see issue1505).

So David's plan in msg5797 would be:
* version identifier? - hash of hashed inventory
* short? - not particularly, eg
0000001130-780dbc51d54cfd4934f16bffcfff38474b5dae0fccc94a4002dae3762df65109
* secure? - yes due to hashing of patch file contents
* fast? - as fast as generating a context file
* reversible? - if we provide a mechanism to store the inventories and look them up

Also, I don't fully understand David's safety refactor but I've opened a new
dependency on it as issue1556.  Unless somebody comes up with a nice plan, I'm
assuming that's how we're to be moving forward on this.

I'm deferring this until at least the safety refactor is done (issue1556) and
perhaps after we have hashed contexts (issue1505) so that we can implement the
read part of this before working on the write part.
msg13952 (view) Author: kowey Date: 2011-04-20.12:28:21
http://wiki.darcs.net/Ideas/ShortSecureId attempts to capture, compare and 
contrast the proposals for short, secure, fast, version identifiers.
History
Date User Action Args
2008-08-12 21:01:25zookocreate
2008-08-12 21:05:11zookosetnosy: beschmi, zooko, dagit
title: short secure version identifiers -> short, secure, fast version identifiers
2008-08-12 21:52:16zookosetstatus: unread -> unknown
nosy: beschmi, zooko, dagit
messages: + msg5441
2008-08-15 16:17:01nwfsetfiles: + ssvc-take-1
nosy: + nwf
messages: + msg5545
2008-08-15 16:30:49zookosetnosy: beschmi, zooko, dagit, nwf
messages: + msg5547
2008-08-16 07:06:47nwfsetfiles: + ssvc-take-2
nosy: beschmi, zooko, dagit, nwf
messages: + msg5550
2008-08-28 17:40:56droundysetnosy: + droundy, kowey
messages: + msg5776
2008-08-28 17:41:08droundysetnosy: - droundy
2008-08-29 14:00:33droundysetnosy: + droundy
messages: + msg5797
2009-06-21 18:09:29zookosetnosy: + dmitry.kurochkin, simon, thorkilnaur
messages: + msg7924
2009-06-22 13:46:56droundysetnosy: - droundy
2009-06-23 20:04:52zookosetnosy: beschmi, kowey, zooko, dagit, simon, thorkilnaur, dmitry.kurochkin, nwf
messages: + msg7927
2009-06-23 20:52:12zookosetnosy: beschmi, kowey, zooko, dagit, simon, thorkilnaur, dmitry.kurochkin, nwf
messages: + msg7928
2009-06-24 07:41:56koweysetnosy: beschmi, kowey, zooko, dagit, simon, thorkilnaur, dmitry.kurochkin, nwf
messages: + msg7931
2009-06-24 07:48:22koweysetpriority: bug -> feature
nosy: beschmi, kowey, zooko, dagit, simon, thorkilnaur, dmitry.kurochkin, nwf
2009-08-06 15:34:50koweylinkissue1484 superseder
2009-08-06 21:10:45adminsetnosy: - beschmi
2009-08-09 15:46:52koweysetnosy: kowey, zooko, dagit, simon, thorkilnaur, dmitry.kurochkin, nwf
messages: + msg8052
2009-08-09 15:58:22koweysetnosy: kowey, zooko, dagit, simon, thorkilnaur, dmitry.kurochkin, nwf
messages: + msg8053
2009-08-09 16:42:32zookosetnosy: kowey, zooko, dagit, simon, thorkilnaur, dmitry.kurochkin, nwf
messages: + msg8054
2009-08-11 00:20:13adminsetnosy: - dagit
2009-08-23 18:00:41koweysetstatus: unknown -> deferred
nosy: kowey, zooko, simon, thorkilnaur, dmitry.kurochkin, nwf
topic: + Git
superseder: + context files with file hashes, task: safety refactor to ensure that hashed_inventory is only read once
messages: + msg8414
2009-08-25 17:45:47adminsetnosy: + darcs-devel, - simon
2009-08-27 14:30:51adminsetnosy: kowey, darcs-devel, zooko, thorkilnaur, dmitry.kurochkin, nwf
2011-04-20 12:28:22koweysetmessages: + msg13952