darcs

Issue 27 patch ids are not collision-free

Title patch ids are not collision-free
Priority bug Status resolved
Milestone Resolved in
Superseder Nosy List Serware, darcs-devel, dmitry.kurochkin, gwern, kowey, markstos, nwf, thorkilnaur, tommy, zooko
Assigned To droundy
Topics

Created on 2005-11-29.02:44:14 by zooko, last changed 2009-10-23.23:27:51 by admin.

Messages
msg99 (view) Author: zooko Date: 2005-11-29.02:44:13
I just got bitten by this bug again.  The bug is that darcs relies on
gettimeofday to make patch names collision-free, and this is not always
sufficient.  In particular, when one is playing with automated usage of darcs,
such as tailor, deps.pl, and such like, multiple darcs records might be issued
within the same fraction of a second on the same machine by the same user.

When patches like that get created, then later darcs generates internal errors
like this one when it finds two such patches in the same repo:

"Fail: bug in get_extra commuting patch:"

As has been discussed more than once on the mailing list, adding a sufficient
number of random bytes would be sufficient to make patch ids collision free. 
Every major platform today offers the equivalent of /dev/urandom.
msg100 (view) Author: zooko Date: 2005-11-29.02:48:25
Alternatively, instead of making patch ids be collision free by randomness, we
could hash in the entire body of the patch, so that patch ids would collide if
and only if the author, timestamp, log, and body were identical.  That would be
perfect for tailor, and probably wouldn't cause a problem for anyone else.
msg101 (view) Author: beschmi Date: 2005-11-29.03:39:13
Using the body (patch representation) is problematic, because the same patch
can have different bodies [1]. You could use the initial representation when
the patch is recorded. Then you would have to store it somewhere because you
can't simply recalculate the id without the old body when the body has changed.

The only problem with the random ID i see is when two people want to
record an identical patch (eg converting the same upstream repo with
tailor using the CVS commit date) and get different random bytes. Then
darcs can't recognize that the two patches are intended to be identical
and you get conflicts. But you can always add a flag to darcs that lets
tailor supply a string that is somehow derived from the contents of the
commit instead of a random string in this special case.

[1] http://www.abridgegame.org/pipermail/darcs-users/2005-November/008962.html
msg102 (view) Author: zooko Date: 2005-11-29.11:47:12
We could store just the patch id rather than the original body, as the patch id
doesn't need to be recomputed.  However, I no longer like this approach because
it would allow for collisions among patches that ought to be distinct.

So I urge darcs to take the simpler solution of adding randomness to the patch
metadata.
msg104 (view) Author: droundy Date: 2005-11-29.12:44:04
On Tue, Nov 29, 2005 at 03:39:13AM +0000, Benedikt Schmidt wrote:
> The only problem with the random ID i see is when two people want to
> record an identical patch (eg converting the same upstream repo with
> tailor using the CVS commit date) and get different random bytes. Then
> darcs can't recognize that the two patches are intended to be identical
> and you get conflicts. But you can always add a flag to darcs that lets
> tailor supply a string that is somehow derived from the contents of the
> commit instead of a random string in this special case.

I think that this is enough of a problem that using a hash of content
rather than random data makes a bit more sense.  As Zooko points out, we
can store it the same as we'd store the random data.  We certainly could
have a different option (for people who often record identical patches with
the same patch name in the same second).

The old idea was to stick it in the least significant digits of the date
(and ignore it for date use purposes), so that older darcs would still be
able to handle it.  I still think this is a reasonable approach.

Another approach would be to just check the latest patch to see if it has
the same date and the same ID.  Then keep looking back until you hit one
with a different date.  It's a lot cheaper than reading *all* the patches
to make sure your new patch is unique, and gives a stronger guarantee of
uniqueness (provided the same author isn't simultaneously recording patches
with the same name in another repository--and at that stage I'll assert
that the author is being silly), and doesn't require that we muck with the
patch ID format.
-- 
David Roundy
http://www.darcs.net
msg108 (view) Author: zooko Date: 2005-11-29.14:29:54
> I think that this is enough of a problem that using a hash of content
> rather than random data makes a bit more sense.  As Zooko points out, we
> can store it the same as we'd store the random data.  We certainly could
> have a different option (for people who often record identical patches with
> the same patch name in the same second).

My objection to this is somewhat weak, but there could be two patches with
different identical bodies but different meaning due to different contexts
which would get identical patch ids.

However, I wouldn't object too strongly to the "hash the body" approach,
because it would allow a certain kind of limited accidental convergence,
especially for uses like tailor's.

> The old idea was to stick it in the least significant digits of the date
> (and ignore it for date use purposes), so that older darcs would still be
> able to handle it.  I still think this is a reasonable approach.

Yes!  Backwards compatibility makes this easier.  Let's do it!

> Another approach would be to just check the latest patch to see if it has
> the same date and the same ID.  Then keep looking back until you hit one
> with a different date.  It's a lot cheaper than reading *all* the patches
> to make sure your new patch is unique, and gives a stronger guarantee of
> uniqueness (provided the same author isn't simultaneously recording patches
> with the same name in another repository--and at that stage I'll assert
> that the author is being silly), and doesn't require that we muck with the
> patch ID format.

The three colliding patches that I generated overnight last night were
generated in three separate repositories.

Regards,

Zooko
msg616 (view) Author: zooko Date: 2006-04-11.18:38:41
This bug bit me again today.  I have a script (based on tailor) that
automatically syncs between darcs and svn repositories.  It generated two
identical patchnames within the same second and gave me a corrupted repository
that cannot be repaired with "darcs repair", but will instead require me to do
manual surgery on the repository:

zooko@hanford:/home/darcs/trunk-tailorizer$ time darcs repair -v -v -v
Checking that patch names are unique...
Error! Duplicate patch name:
Mon Apr  3 12:13:50 PDT 2006  zooko
  * [4346]: deleted a random line to test automated svn->darcs sync
msg2573 (view) Author: markstos Date: 2008-01-19.02:43:13
David,

Do the the hashed or darcs-2 repo formats resolve this issue?

  Mark
msg2597 (view) Author: droundy Date: 2008-01-19.13:32:24
No, the new formats don't help this, as darcs still relies on the path IDs being
unique.
msg2833 (view) Author: zooko Date: 2008-01-28.14:33:17
I thought that the darcs-2 format defined the patch id as being the hash of the
patch contents:

http://lists.osuosl.org/pipermail/darcs-devel/2008-January/007021.html

So therefore the ids of different patches are necessarily different.

What am I missing?
msg2835 (view) Author: droundy Date: 2008-01-28.15:17:15
On Mon, Jan 28, 2008 at 02:33:19PM -0000, Zooko wrote:
> I thought that the darcs-2 format defined the patch id as being the hash of the
> patch contents:
> 
> http://lists.osuosl.org/pipermail/darcs-devel/2008-January/007021.html
> 
> So therefore the ids of different patches are necessarily different.
> 
> What am I missing?

No, that is just the filename of the patch.  The patch id is unchanged.
-- 
David Roundy
Department of Physics
Oregon State University
msg2846 (view) Author: zooko Date: 2008-01-28.15:56:07
On Jan 28, 2008, at 8:17 AM, David Roundy wrote:

> On Mon, Jan 28, 2008 at 02:33:19PM -0000, Zooko wrote:
>> I thought that the darcs-2 format defined the patch id as being  
>> the hash of the
>> patch contents:
>>
>> http://lists.osuosl.org/pipermail/darcs-devel/2008-January/ 
>> 007021.html

> No, that is just the filename of the patch.  The patch id is  
> unchanged.

I'm sorry if I'm being dense, but how can this be consistent with  
your earlier statement that "All data in the hashed format is  
hashed." (in the message referenced above), and the consequent idea  
that using a secure hash function could therefore facilitate  
"signing" of repositories?

Is there a document that I can read to educate myself about this design?

Regards,

Zooko
msg3779 (view) Author: zooko Date: 2008-03-06.13:56:01
I just tried to convert the history of (the trunk of) twistedmatrix.com from SVN
into darcs, and this bug -- patch ids are not collision-free -- prevents twisted
from being importable into darcs.

There are two patches in twisted's history with the same name, timestamp (to the
second, at least), and author, but different contents.
msg3789 (view) Author: markstos Date: 2008-03-06.22:45:30
Zooko,

I'm not surprised the issue persists, since it was never fixed. 

It can be worked around with an import tool that notices this situation and
tweaks the time stamp by a second.
msg3881 (view) Author: zooko Date: 2008-03-13.13:28:48
Another person just got on IRC and said he had this problem, converting a
project from CVS.

Personally, I think that this is a bug in darcs, and I'm loathe to build
work-around support for it in other tools.  On the other hand, I think that
David Roundy's current plan to add SHA-256 hashes of patches contents and
context will fix this problem...
msg3899 (view) Author: droundy Date: 2008-03-17.22:05:21
On Thu, Mar 13, 2008 at 9:28 AM, Zooko <bugs@darcs.net> wrote:
>  Another person just got on IRC and said he had this problem, converting a
>  project from CVS.
>
>  Personally, I think that this is a bug in darcs, and I'm loathe to build
>  work-around support for it in other tools.  On the other hand, I think that
>  David Roundy's current plan to add SHA-256 hashes of patches contents and
>  context will fix this problem...

Given that this is a bug which is only triggered by tailor (or perhaps
*almost* only?), I think it's reasonable to put the workaround in
there rather than in darcs.  The potential darcs feature is really
only a workaround for the behavior of tailor, in my opinion.  A
correct fix would be to crash with an error message when you try to
record the second patch with the same date, author and name.

David
msg3901 (view) Author: zooko Date: 2008-03-17.22:21:57
There have been several sightings of this problem, which fall into  
two categories:

1.  Tailor, or another automated tool which creates darcs patches,  
created two with the same date, author, and name.

2.  Some revision control history already exists in a different  
revision control format which has patches like this.  Tailor then  
tries to create the equivalent darcs patches.

These are two different cases, but I believe that both of them are  
valid.  There is no a priori reason why a revision control system can  
require of its users that each "date, author, description" tuple gets  
used only once in history.

Anyway, won't the planned feature of including secure hashes of patch  
contents (plus the planned feature of darcs-2-format converging equal  
patches) fix this?

Regards,

Zooko
msg3904 (view) Author: markstos Date: 2008-03-18.13:27:48
While it would be OK with me if this was addressed with darcs, I agree with
David that it is appropriate that it be addressed in tailor, and similar tools. 

Tailor needs to understand each revision control system to convert between them.
In this case, tailor needs to understand this facet of darcs and behave a little
better.

For the moment I see it as lower priority than the 2.0 release, unless it's
relatively easy *and* involves a repo format change which would better to do now .
msg5749 (view) Author: kowey Date: 2008-08-28.08:37:17
So the issue that Zooko reported was that you could generate two different
patches in a single repository with identical patch ids (which can confuse darcs
terribly).

I think we should consider http://bugs.darcs.net/issue1026 as another
manifestation of the some fundamental problem (patch ids are not collision
free).  In issue1026, you have two unrelated repositories that share a common
comment history (e.g. two different directories in a single CVS repositories). 
If you automatically convert these repositories to darcs using something like
tailor, you can produce two different patches with the same patchinfo.

Summarising  the discussion below, both issues can apparently be resolved by
computing a hash of the patch (and its context?) upon creation and storing it in
the least significant digits of the date field.  (Ian also suggests that within
the same repository we should also count the uses of a patch name, to be extra
safe). 

If I understand correctly, this does not entail a repo format change, and will
decrease the probability of a patch id collision to be small enough that we can
close this ticket.

I'm marking it this as critical for darcs 2.1
msg5755 (view) Author: nwf Date: 2008-08-28.10:18:36
Is it impossible to use the hashed repository format for a patch info hash?  I
realize it couldn't be stored inside the file, but that seems OK, since you can
grab it from the HashedIO layer itself.  For non-hashed repositories, this might
be a performance issue (though for non-hashed repositories, you could presumably
append an optional field to the patch header containing the hash (and strip it
back off on darcs convert) to mitigate this particular problem... or just
convert to hashed).

That is: I don't understand the "hide it in the date field" proposal's appeal,
and the above seems backwards compatible... though I am still not terribly
comfortable with this part of darcs's internals.
msg5756 (view) Author: kowey Date: 2008-08-28.10:37:40
> Is it impossible to use the hashed repository format for a patch info hash?  I
> realize it couldn't be stored inside the file, but that seems OK, since you can
> grab it from the HashedIO layer itself.

As I understand it, this needs to be something that is stored in the
patch file itself and not recomputed on the fly.  If I pull or push a
patch from you, the contents and consequently its file hash may change
through commutation.

Does that make sense?
msg5758 (view) Author: nwf Date: 2008-08-28.11:25:47
On Thu, Aug 28, 2008 at 10:37:43AM -0000, Eric Kow wrote:
> 
> Eric Kow <eric.kow@gmail.com> added the comment:
> 
> > Is it impossible to use the hashed repository format for a patch info hash?  I
> > realize it couldn't be stored inside the file, but that seems OK, since you can
> > grab it from the HashedIO layer itself.
> 
> As I understand it, this needs to be something that is stored in the
> patch file itself and not recomputed on the fly.  If I pull or push a
> patch from you, the contents and consequently its file hash may change
> through commutation.

If it really has to be stored in the file, why can't we just add another
field to the PatchInfo structure at the top of each patch?  Flag the change
in _darcs/format so old versions don't puke unhelpfully... this seems
preferable (though I obviously don't care so much about flag days) to
changing the semantics of the date field.

> Does that make sense?

No; why would a patch object (serialized PatchInfoAnd, essentially) ever
change during communication?  If it could, it would seem that the global
cache would be substantially less useful as several forms of the patch may
all be in cache needlessly...

Oh... I have a sneaking suspicion that the answer is that darcs rewrites named
patches in the presence of conflictors... in this case, though, it would
seem that the rewritten version _should_ have a different unique-ifying hash
than the original?

Thanks for your answers.
--nwf;
msg5759 (view) Author: kowey Date: 2008-08-28.12:14:51
Hi,

2008/8/28 Nathaniel Filardo <bugs@darcs.net>:
>> As I understand it, this needs to be something that is stored in the
>> patch file itself and not recomputed on the fly.  If I pull or push a
>> patch from you, the contents and consequently its file hash may change
>> through commutation.
>
> If it really has to be stored in the file, why can't we just add another
> field to the PatchInfo structure at the top of each patch?  Flag the change
> in _darcs/format so old versions don't puke unhelpfully... this seems
> preferable (though I obviously don't care so much about flag days) to
> changing the semantics of the date field.

Oh we could do this, but the appeal of packing this information into
the date field is that it would be completely backwards compatible --
for purposes of getting the patch date, I think darcs only pays
attention to the first 14 digits, eg 20080828124937.  In fact, we
could certainly change the PatchInfo code itself to add this field,
and only do the actual on-file storage (and parsing) in the date
field.

It's not always possible to retain backwards compatibility, which is
why it's nice to have _darcs/format as a means for older darcs to
detect forward-compatibility issues; but I suppose it would be good if
we can avoid resorting to it.

>> Does that make sense?
>
> No; why would a patch object (serialized PatchInfoAnd, essentially) ever
> change during communication?

So here I'm talking about the patch itself and not the patchinfo.
Well, it's really more basic than what you are suspecting.

Say Arjan and Ganesh are starting from the same repository and editing
a file "old".  They make simultaneous changes which must be merged.
Arjan darcs mv's "old" to "new" and Ganesh adds some hunks to "old".
We'll call Ganesh's patch G.  When Arjan pulls G so that it applies on
top of his mv patch, G must be adjusted, i.e. replacing instances of
"old" with "new".  Do you agree with these core bits of patch theory?

In practise, darcs stores Arjan's version of G (Ga) on disk.  In
principle it could have stored a canonical version of G instead, based
on some minimal context, but that would mean that it would have to do
extra work each time bringing G back up to his current context.
Alternatively, we could store both the canonical version AND Arjan's
current version (and hey, maybe this sort of redundancy would be
useful in other ways).

> If it could, it would seem that the global
> cache would be substantially less useful as several forms of the patch may
> all be in cache needlessly...

Well, in practice you don't actually commute all that many patches, or
the fact that you have commuted them doesn't really matter.

Say we start from the repository with the three patches Prior X Y.  If
for some reason you have to commute the latter patches, Prior Y' X',
yes it is true that you may get extra entries in the cache for Y' and
X'.  The good news is that if we both add the same patch on top of
this (Prior X Y Z and Prior Y' X' Z), the new patch Z has the same
exact representation (commuting X Y to Y' X' has no effect on anything
outside of X and Y) and so it gets usefully cached.

As an example, consider the darcs darcs repository.  We can at least
guarantee that all patches up to the 2.0.2 tag are cached.  By itself
the caching of that stuff (Prior) should be quite useful.  Your
repository may diverge from darcs.net if you made some local patches
and then pulled some of our stuff, so you may lose some caching there.
 Furthermore, if your patches /do/ eventually make it into darcs.net
(our repositories reconverge) any patches on top of the converged
point will be identical and so caching becomes useful again.

Also, in practice, commutation does not necessarily affect the patch
contents.  For example, patches that touch completely unrelated files
commute without either patch having to change representation.

> Oh... I have a sneaking suspicion that the answer is that darcs rewrites named
> patches in the presence of conflictors... in this case, though, it would
> seem that the rewritten version _should_ have a different unique-ifying hash
> than the original?

Well, the answer is that darcs almost /always/ rewrites patches when
it has to commute something :-)

Anyway, there is a chance that I am getting in over my head here, so I
may have to defer to more knowledgeable darcs hackers from now on,
msg5764 (view) Author: droundy Date: 2008-08-28.13:34:26
On Thu, Aug 28, 2008 at 12:14:54PM -0000, Eric Kow wrote:
> 2008/8/28 Nathaniel Filardo <bugs@darcs.net>:
> >> As I understand it, this needs to be something that is stored in the
> >> patch file itself and not recomputed on the fly.  If I pull or push a
> >> patch from you, the contents and consequently its file hash may change
> >> through commutation.
> >
> > If it really has to be stored in the file, why can't we just add another
> > field to the PatchInfo structure at the top of each patch?  Flag the change
> > in _darcs/format so old versions don't puke unhelpfully... this seems
> > preferable (though I obviously don't care so much about flag days) to
> > changing the semantics of the date field.
>
> Oh we could do this, but the appeal of packing this information into
> the date field is that it would be completely backwards compatible --
> for purposes of getting the patch date, I think darcs only pays
> attention to the first 14 digits, eg 20080828124937.  In fact, we
> could certainly change the PatchInfo code itself to add this field,
> and only do the actual on-file storage (and parsing) in the date
> field.
>
> It's not always possible to retain backwards compatibility, which is
> why it's nice to have _darcs/format as a means for older darcs to
> detect forward-compatibility issues; but I suppose it would be good if
> we can avoid resorting to it.

For what it's worth, I wouldn't choose to pack it into the date, I'd
just stick it into the long comment (which would retain backwards
compatibility), and then we could make newer darcs hide it.  It
doesn't matter what junk we stick in there, it could just as well be a
large random number.  But I don't really like the idea, as I think I
may have mentioned--although perhaps that was on a duplicate of this
bug report?

The thing is that tailor can do this itself, and can make the junk
something meaningful, and this issue cannot bite people who are using
darcs in a reasonable non-automated manner, since you'd have to have a
human record two patches within a second with the same patch name.  Or
two people recording two patches at the same time using the same patch
name and pretending to be the same person.

David
msg5767 (view) Author: kowey Date: 2008-08-28.14:42:45
> For what it's worth, I wouldn't choose to pack it into the date, I'd
> just stick it into the long comment (which would retain backwards
> compatibility), and then we could make newer darcs hide it.

That could work too.  I just figured that the date approach would
involve the least hackery.

> It
> doesn't matter what junk we stick in there, it could just as well be a
> large random number.

Yep! I heartily agree.

> But I don't really like the idea, as I think I
> may have mentioned--although perhaps that was on a duplicate of this
> bug report?

Ah yes, you did say this in http://bugs.darcs.net/msg3899

> The thing is that tailor can do this itself, and can make the junk
> something meaningful, and this issue cannot bite people who are using
> darcs in a reasonable non-automated manner, since you'd have to have a
> human record two patches within a second with the same patch name.  Or
> two people recording two patches at the same time using the same patch
> name and pretending to be the same person.

I think it's useful to practise defensive programming here.  Who knows
what other tools people might try to write (for example, Tortoise
darcs)?

Having the same patch name is actually very probable in some uses of
darcs.  Some people are fond of just calling their minor patches
'Wibble' -- hmm maybe we can call this the "Wibble" problem -- so you
could imagine somebody doing something like:
 darcs record -am Wibble foo
 darcs record -am Wibble bar

Well, that seems rather silly, but it isn't entirely far-fetched.
msg5768 (view) Author: kowey Date: 2008-08-28.14:53:30
>> It
>> doesn't matter what junk we stick in there, it could just as well be a
>> large random number.
>
> Yep! I heartily agree.

Oh wait, I don't agree quite so heartily.  I guess we'd still want
something reasonably deterministic, but otherwise, it doesn't really
matter.
msg5769 (view) Author: droundy Date: 2008-08-28.15:08:47
On Thu, Aug 28, 2008 at 02:42:48PM -0000, Eric Kow wrote:
> > For what it's worth, I wouldn't choose to pack it into the date, I'd
> > just stick it into the long comment (which would retain backwards
> > compatibility), and then we could make newer darcs hide it.
> 
> That could work too.  I just figured that the date approach would
> involve the least hackery.

I think that the opposite is true.  It's much nicer to put the data in
the current free-form section of the patch info, rather than in a
section that older darcs tries to parse.

> > But I don't really like the idea, as I think I may have
> > mentioned--although perhaps that was on a duplicate of this bug
> > report?
> 
> Ah yes, you did say this in http://bugs.darcs.net/msg3899
> 
> > The thing is that tailor can do this itself, and can make the junk
> > something meaningful, and this issue cannot bite people who are
> > using darcs in a reasonable non-automated manner, since you'd have
> > to have a human record two patches within a second with the same
> > patch name.  Or two people recording two patches at the same time
> > using the same patch name and pretending to be the same person.
> 
> I think it's useful to practise defensive programming here.  Who knows
> what other tools people might try to write (for example, Tortoise
> darcs)?
>
> Having the same patch name is actually very probable in some uses of
> darcs.  Some people are fond of just calling their minor patches
> 'Wibble' -- hmm maybe we can call this the "Wibble" problem -- so you
> could imagine somebody doing something like:
>  darcs record -am Wibble foo
>  darcs record -am Wibble bar
> 
> Well, that seems rather silly, but it isn't entirely far-fetched.

But darcs record has a one-second delay built into it (to keep the
timestamps of the working directory from matching those in the
pristine cache), so unless they're doing something very odd (or their
computer's clock is broken somehow?), this won't cause any trouble.
The problem really is (in practice) only with programs that use the
--pipe interface to manually set the date.

David
msg5770 (view) Author: droundy Date: 2008-08-28.15:09:45
Setting this as not ReleaseCritical, since I can't see how something can both be
release-critical and almost wont-fix.
msg5771 (view) Author: zooko Date: 2008-08-28.15:31:57
> In practise, darcs stores Arjan's version of G (Ga) on disk.  In
> principle it could have stored a canonical version of G instead, based
> on some minimal context, but that would mean that it would have to do
> extra work each time bringing G back up to his current context.
> Alternatively, we could store both the canonical version AND Arjan's
> current version (and hey, maybe this sort of redundancy would be
> useful in other ways).

There are three different properties that patch ids might have.   
Let's examine each in turn.

property 1: collision-resistance

It is necessary for correctness (in my view) that the mapping from  
patch to patchid is injective -- that there is at most one patch  
which can produce a given patchid.  A.k.a. patch ids are "collision- 
free" or "collision-resistant".

property 2: convergence

It is not necessary, but it might turn out to be useful, if the  
mapping is "convergent" in the sense that separate computations which  
start with the same patch result in the same patchid.  This might  
turn out to be useful if we wanted convergence as a feature -- two  
people doing "darcs record" on the same textual changes both get the  
"same" darcs patch.  It might also be useful in the future for  
efficiency, as Nathaniel Filardo observed, by reducing the number of  
patch ids that darcs has to keep track of.

property 3: self-authenticating

In the future I want, given a patchid that I got from a trusted  
source and a patch that I got from an untrusted source, to be able to  
verify that the patch matches the patch id, so that I know that the  
trusted source who gave me the patchid would accept this patch as an  
acceptable patch for *that* patchid.

This 3rd property is potentially useful in issue992 -- short, secure,  
fast version identifiers.

http://bugs.darcs.net/issue992

Here are a few ways to do it which have different combinations of  
these properties.  Please see the end of this note, which argues that  
throwing in timestamps, "count of patches by this name" and other  
stuff helps with none of these properties.

option 1: patchids are randomly generated 192-bit numbers

This is definitely collision-resistant (up to the negligible chance  
of an accidental collision), and it has no convergence -- any time  
anyone computes a patchid from a patch, they get a new patchid.   
Observe that this is not a problem in current darcs because the  
patchid is included with the patch whenever the patch is  
transmitted.  It has no self-authentication.

option 2: patchids are collision-resistant hashes of the patch and  
the entire current context

This is still collision-resistant (at least as much as the hash  
function we use), and it has convergence for the case that two people  
are computing patchids of the same patch in the same context.  It is  
self-authenticating, but only if the person doing the checking has  
access to the entire context, and can figure out exactly *which*  
context was used.  (See Nathaniel Filardo's patch to implement short,  
secure, fast version identifiers.)

option 3: patchids are collision-resistant hashes of the patch and  
the minimal context

This is still collision-resistant, and it has more convergence: now  
two people who are computing a patch id for the same patch, even in  
the presence of extraneous other patches which are not part of the  
minimal context.  It is self-authenticating, and it might turn out to  
be easier to actually check in practice, since the checker knows how  
to find the minimal context of the patch in order to use that minimal  
context in the check.

option 4: patchids are collision-resistant hashes of just the text of  
the patch, devoid of context

This seems problematic to me, because then darcs might get confused  
about two textually identical patches which were created in disjoint  
contexts.  Let us speak no more of it in this note.

Now, observe that once you've chosen one of these three basic  
strategies, then throwing in the timestamp, the count of the patches  
by this name, and so forth can't help and can hurt:  it does not  
increase collision-resistance (since all of these proposals except  
the weird 4th one already have collision-resistance), and it does not  
help with self-authentication, but it might hurt with self- 
authentication (if the added data that gets thrown in is not  
available at check-time or if it isn't clear which data elements need  
to be used in the check).  Finally, it can't help and can hurt with  
convergence, since it is never the case that we want two patches  
which are otherwise identical to be treated as distinct patches due  
only to their timestamp or count.

The same argument can be made about author name and patch  
description.  Perhaps it is the case that we want two patches which  
are identical in all ways except for their patch comment to be  
treated as identical patches.  Or perhaps we want them to be treated  
as different.  I'm not sure.  But I am sure that we should not throw  
data into the secure hash function *unless* we are sure that we want  
to cause divergence based on that data.

By the way, I'm currently in favor of strategy #2, because #3  
requires more computational work when recording a patch, and because  
strategy #1 doesn't seem to facilitate issue992 as well.  However, I  
think any of the three strategies are okay.

Regards,

Zooko
msg5772 (view) Author: kowey Date: 2008-08-28.15:47:18
> But darcs record has a one-second delay built into it (to keep the
> timestamps of the working directory from matching those in the
> pristine cache)

Hmm.  I just did the Wibble test (no automation except for doing two
records at once with a semi-colon) and got a Duplicate patch error.

darcs init
touch a b
darcs add a b
echo a > a
echo b > b
darcs record -am Wibble a; darcs record -am Wibble b
darcs check

Also, I didn't find any mentions of sleep or wait_a_moment in Record
(on the other hand, I do see that Apply, Obliterate, Rollback and
Unrecord use it before applying patches to the working directory).

Is it the case that record should have a one-second delay and doesn't?
msg5773 (view) Author: droundy Date: 2008-08-28.15:55:58
On Thu, Aug 28, 2008 at 11:31 AM, zooko <zooko@zooko.com> wrote:
> There are three different properties that patch ids might have.  Let's
> examine each in turn.
>
> property 1: collision-resistance
>
> It is necessary for correctness (in my view) that the mapping from patch to
> patchid is injective -- that there is at most one patch which can produce a
> given patchid.  A.k.a. patch ids are "collision-free" or
> "collision-resistant".

The question becomes under what circumstances we need to be
collision-resistant.  In the presence of malicious users we cannot be
collision-resistant, unless we also have the self-authenticating
property.  I would call users who intentionally manipulate the date in
such a way as to create two identical patches with the same name
something very close to malicious.

> property 2: convergence
>
> It is not necessary, but it might turn out to be useful, if the mapping is
> "convergent" in the sense that separate computations which start with the
> same patch result in the same patchid.  This might turn out to be useful if
> we wanted convergence as a feature -- two people doing "darcs record" on the
> same textual changes both get the "same" darcs patch.  It might also be
> useful in the future for efficiency, as Nathaniel Filardo observed, by
> reducing the number of patch ids that darcs has to keep track of.

I don't think this is a good property to have.

> property 3: self-authenticating
>
> In the future I want, given a patchid that I got from a trusted source and a
> patch that I got from an untrusted source, to be able to verify that the
> patch matches the patch id, so that I know that the trusted source who gave
> me the patchid would accept this patch as an acceptable patch for *that*
> patchid.

I can see that this could be useful, but do not see a practical path
to achieve this.  I don't want record to be O(N) in the length of the
repository history, let alone O(N^2).

> Here are a few ways to do it which have different combinations of these
> properties.  Please see the end of this note, which argues that throwing in
> timestamps, "count of patches by this name" and other stuff helps with none
> of these properties.
>
>
> option 1: patchids are randomly generated 192-bit numbers
>
> This is definitely collision-resistant (up to the negligible chance of an
> accidental collision), and it has no convergence -- any time anyone computes
> a patchid from a patch, they get a new patchid.  Observe that this is not a
> problem in current darcs because the patchid is included with the patch
> whenever the patch is transmitted.  It has no self-authentication.

Agreed.  It's not a bad approach, if we really don't trust our users
to be reasonable.

> option 2: patchids are collision-resistant hashes of the patch and the
> entire current context
>
> This is still collision-resistant (at least as much as the hash function we
> use), and it has convergence for the case that two people are computing
> patchids of the same patch in the same context.  It is self-authenticating,
> but only if the person doing the checking has access to the entire context,
> and can figure out exactly *which* context was used.  (See Nathaniel
> Filardo's patch to implement short, secure, fast version identifiers.)

Also not a bad approach, if we don't trust our users.  A little more
tailor-friendly.  As you point out, it's not robustly
self-authenicating, so I'd call that a red herring.

> option 3: patchids are collision-resistant hashes of the patch and the
> minimal context
>
> This is still collision-resistant, and it has more convergence: now two
> people who are computing a patch id for the same patch, even in the presence
> of extraneous other patches which are not part of the minimal context.  It
> is self-authenticating, and it might turn out to be easier to actually check
> in practice, since the checker knows how to find the minimal context of the
> patch in order to use that minimal context in the check.

This is only convergent if the two people also have the same name and
give the patch the same name, and also the same date.  In other words,
the convergence benefit here is a red herring.  The
self-authentication properties are real, but the cost is that this is
O(N^2) in the size and duration of the repository.  Note that in
darcs-2 the minimal context is not a set of named patches, but rather
a set of primitive patches.  Also note that we'd have to introduce a
canonical primitive-patch sorting function for this hash to be useful
for self-authentication, which is another unsolved problem.

> option 4: patchids are collision-resistant hashes of just the text of the
> patch, devoid of context
>
> This seems problematic to me, because then darcs might get confused about
> two textually identical patches which were created in disjoint contexts.
>  Let us speak no more of it in this note.

Agreed.  This is has no real benefits.

> Now, observe that once you've chosen one of these three basic strategies,
> then throwing in the timestamp, the count of the patches by this name, and
> so forth can't help and can hurt:  it does not increase collision-resistance
> (since all of these proposals except the weird 4th one already have
> collision-resistance), and it does not help with self-authentication, but it
> might hurt with self-authentication (if the added data that gets thrown in
> is not available at check-time or if it isn't clear which data elements need
> to be used in the check).  Finally, it can't help and can hurt with
> convergence, since it is never the case that we want two patches which are
> otherwise identical to be treated as distinct patches due only to their
> timestamp or count.

The timestamp is necessary to be included (not in a hash, but in the
ID), because our users want to know when changes were made.

> The same argument can be made about author name and patch description.
>  Perhaps it is the case that we want two patches which are identical in all
> ways except for their patch comment to be treated as identical patches.  Or
> perhaps we want them to be treated as different.  I'm not sure.  But I am
> sure that we should not throw data into the secure hash function *unless* we
> are sure that we want to cause divergence based on that data.

Again, these need to be included, because our users want to know who
made the changes.  They don't need to go into any sort of hash, but we
need to track them, and if two people make the same change, we need to
track *that*.  So yes, we do need to leave the name of the patch as
part of its identifier.

> By the way, I'm currently in favor of strategy #2, because #3 requires more
> computational work when recording a patch, and because strategy #1 doesn't
> seem to facilitate issue992 as well.  However, I think any of the three
> strategies are okay.

I just am not convinced that this is a real issue.  As I mention
above, we can't protect against malicious users using anything less
than #3, which really is an O(N^2) solution that also requires the
creation of a new primitive operation on patches (sorting).  It might
not be too hard to figure out the sorting problem, but any bugs or
changes in that function would break repositories.

I suppose there's something to be said for protecting against stupid
users, but I don't at all like the idea of making record O(N) in order
to do so.  In a hashed repository, we #2 can be done in O(1) time,
since we already store most of the hashes in the context, and I
suppose on darcs-1 repositories we could just hash the inventory file.
 If we chose #1, I believe tailor users would be up in arms.

David
msg5774 (view) Author: droundy Date: 2008-08-28.15:59:46
On Thu, Aug 28, 2008 at 11:47 AM, Eric Kow <bugs@darcs.net> wrote:
>> But darcs record has a one-second delay built into it (to keep the
>> timestamps of the working directory from matching those in the
>> pristine cache)
>
> Hmm.  I just did the Wibble test (no automation except for doing two
> records at once with a semi-colon) and got a Duplicate patch error.
>
> darcs init
> touch a b
> darcs add a b
> echo a > a
> echo b > b
> darcs record -am Wibble a; darcs record -am Wibble b
> darcs check
>
> Also, I didn't find any mentions of sleep or wait_a_moment in Record
> (on the other hand, I do see that Apply, Obliterate, Rollback and
> Unrecord use it before applying patches to the working directory).
>
> Is it the case that record should have a one-second delay and doesn't?

You're right, I was being stupid.  Record doesn't need the delay
because it doesn't touch the working directory.  But of course, this
kind of issue can be most readily fixed by implementing a very simple
check as to whether there is a patch with an identical name among the
most recently-recorded patches in the repository.  In fact, this check
ought to be there, because any junk data (i.e. hash or random number)
isn't going to be human-friendly, and it's not nice to our users to
allow them to create two distinct patches with indistinguishable (to
them) names.  At least the date should be different.  Otherwise
something like darcs changes will show two identical patches, which is
downright unfriendly.

David
msg5775 (view) Author: zooko Date: 2008-08-28.16:14:10
On Aug 28, 2008, at 9:56 AM, David Roundy wrote:
>
> The timestamp is necessary to be included (not in a hash, but in the
> ID), because our users want to know when changes were made.

I agree that the timestamp, author, patch description are important  
to have around for human use, and that they don't have to be included  
in the hash or otherwise included in the patch identifier.

I think you and I have differing intuitions about the value of  
facilitating the possible future development of secure identifiers.   
I think that such identifiers, like those currently used in git etc.,  
could be implemented in the future by someone, either in darcs or in  
an extension tool that uses darcs, or in a successor to darcs, and I  
hope that we don't make design decisions which hinder such a  
development.  But maybe it will never happen.  Making predictions is  
hard, especially about the future.  See Nathaniel Filardo's  
contribution to issue992 for an example of a development along those  
lines which is already implemented.

>  If we chose #1, I believe tailor users would be up in arms.

Why is that?  Option 1 is that the identifier of a patch is nothing  
more than a randomly chosen 192-bit number.  I am one of the world's  
heaviest tailor users -- I routinely convert the entire history of  
open source software projects into darcs -- multiple years, tens of  
thousands of patches -- just my own personal use.

Regards,

Zooko
---
http://allmydata.org -- Tahoe, the Least-Authority Filesystem
http://allmydata.com -- back up all your files for $5/month
msg5777 (view) Author: droundy Date: 2008-08-28.17:46:25
On Thu, Aug 28, 2008 at 12:13 PM, zooko <zooko@zooko.com> wrote:
> On Aug 28, 2008, at 9:56 AM, David Roundy wrote:
>> The timestamp is necessary to be included (not in a hash, but in the
>> ID), because our users want to know when changes were made.
>
> I agree that the timestamp, author, patch description are important to have
> around for human use, and that they don't have to be included in the hash or
> otherwise included in the patch identifier.

You have some unusual idea that we're talking about completely
rewriting darcs to introduce some new patch identifier concept.  That
doesn't exist in darcs right now, and I don't expect to add it any
time soon.  What you're describing cannot happen before darcs 3, since
it's way too large a change in darcs' semantics.  The patch identifier
*is* the metadata describing the patch.

> I think you and I have differing intuitions about the value of facilitating
> the possible future development of secure identifiers.  I think that such
> identifiers, like those currently used in git etc., could be implemented in
> the future by someone, either in darcs or in an extension tool that uses
> darcs, or in a successor to darcs, and I hope that we don't make design
> decisions which hinder such a development.  But maybe it will never happen.
>  Making predictions is hard, especially about the future.  See Nathaniel
> Filardo's contribution to issue992 for an example of a development along
> those lines which is already implemented.

I think that adding complexity without adding value in the interest of
possible future development is almost always wrong.  And noone has yet
suggested a plausible mechanism for creating such a scheme.  (And yes,
I did just take a look at issue992, thanks for pointing it out.)

>>  If we chose #1, I believe tailor users would be up in arms.
>
> Why is that?  Option 1 is that the identifier of a patch is nothing more
> than a randomly chosen 192-bit number.  I am one of the world's heaviest
> tailor users -- I routinely convert the entire history of open source
> software projects into darcs -- multiple years, tens of thousands of patches
> -- just my own personal use.

Because I know there are tailor users who apply tailor twice and
expect to get two identical  repositories (apart from added patches).

David
msg5779 (view) Author: nwf Date: 2008-08-28.17:50:38
On Thu, Aug 28, 2008 at 12:14:54PM -0000, Eric Kow wrote:
> Say Arjan and Ganesh are starting from the same repository and editing
> a file "old".  They make simultaneous changes which must be merged.
> Arjan darcs mv's "old" to "new" and Ganesh adds some hunks to "old".
> We'll call Ganesh's patch G.  When Arjan pulls G so that it applies on
> top of his mv patch, G must be adjusted, i.e. replacing instances of
> "old" with "new".  Do you agree with these core bits of patch theory?

I agree that the patch must be adjusted logically, but I disagree that we
should be mangling the patch that came from upstream; see below.

I am not very clear on how this works as patches "complete the loop" ... if
a patch from Ganesh's repo, G, is stored in Arjan's as Ga, when Ganesh next
pulls from Arjan, is G replaced by Ga, are G and Ga both now in Ganesh's
repo, or is just Ga in Ganesh's repo?

> In practise, darcs stores Arjan's version of G (Ga) on disk.  In
> principle it could have stored a canonical version of G instead, based
> on some minimal context, but that would mean that it would have to do
> extra work each time bringing G back up to his current context.
> Alternatively, we could store both the canonical version AND Arjan's
> current version (and hey, maybe this sort of redundancy would be
> useful in other ways).

Right; having both the patch as pulled from the remote repository and a
local superseder seems like it would make life easier when trying to figure
out "what happened" internally (I've had to go trolling _darcs/patches
before to try to figure out what broke), would make comparisons between
repositories easier (if only by hand; perhaps Darcs wouldn't have a need for
such functionality), would allow us to use HashedIO hashes as patch
identifiers across the board, and would seem to be a good start on issue891.

Be well.
--nwf;
msg5782 (view) Author: zooko Date: 2008-08-28.19:00:27
On Aug 28, 2008, at 11:46 AM, David Roundy wrote:
>
>> I agree that the timestamp, author, patch description are  
>> important to have
>> around for human use, and that they don't have to be included in  
>> the hash or
>> otherwise included in the patch identifier.
>
> You have some unusual idea that we're talking about completely
> rewriting darcs to introduce some new patch identifier concept.

I'm sorry if my terminology makes it sound like I have incorrect  
assumptions about darcs development.  What I intended to say was that  
those bits of information don't *have* to be included in the patch  
identifier in order to achieve any of the three properties of patch  
identifiers that we were discussing.  If I understand correctly, this  
is in agreement with what you said in your previous note.

Let me make sure I understand your other points:  you believe that  
the convergence property is necessary, at least for the use case of  
duplicate runs of tailor producing identical repositories, you  
continue to be skeptical that it will ever be possible to implement  
self-authenticating identifiers in a way compatible with patch  
theory, and you think that some of the strategies under discussion  
are unnecessarily complex.  I guess you mean strategy #3 (hash  
including minimal context), which is more complex than either  
strategy #1 (random id) or strategy #2 (hash including current context).

Is that right?

Regards,

Zooko
---
http://allmydata.org -- Tahoe, the Least-Authority Filesystem
http://allmydata.com -- back up all your files for $5/month
msg5784 (view) Author: kowey Date: 2008-08-28.19:38:27
> I am not very clear on how this works as patches "complete the loop" ... if
> a patch from Ganesh's repo, G, is stored in Arjan's as Ga, when Ganesh next
> pulls from Arjan, is G replaced by Ga, are G and Ga both now in Ganesh's
> repo, or is just Ga in Ganesh's repo?

It remains G in Ganesh's repository and Ga in Arjan's one.  Darcs
knows that both patches are the same because they have the same
patchinfo.  This seems weird at first, but it helps to think that all
patches are inherently mangled.  There is nothing about the patch as
it is created upstream that makes it the "correct" representation,
unless you wish to consider the minimal context version.

> Right; having both the patch as pulled from the remote repository and a
> local superseder seems like it would make life easier when trying to figure
> out "what happened" internally (I've had to go trolling _darcs/patches
> before to try to figure out what broke),

I understand that pain.  If you are interested in this, it might be
worthwhile for you to try and implement a minimal_context function.

Even if we don't wind up using it for this issue, it could still be
handy for other purposes, or at the very least, a good way to get a
handle on how darcs works

Thanks! ( especially for braving darcs' guts and having waded in so
far! I'm still safely treading water at the UI level :-D )
msg5789 (view) Author: droundy Date: 2008-08-28.23:13:09
On Thu, Aug 28, 2008 at 2:59 PM, zooko <zooko@zooko.com> wrote:
> On Aug 28, 2008, at 11:46 AM, David Roundy wrote:
>>> I agree that the timestamp, author, patch description are important to
>>> have
>>> around for human use, and that they don't have to be included in the hash
>>> or
>>> otherwise included in the patch identifier.
>>
>> You have some unusual idea that we're talking about completely
>> rewriting darcs to introduce some new patch identifier concept.
>
> I'm sorry if my terminology makes it sound like I have incorrect assumptions
> about darcs development.  What I intended to say was that those bits of
> information don't *have* to be included in the patch identifier in order to
> achieve any of the three properties of patch identifiers that we were
> discussing.  If I understand correctly, this is in agreement with what you
> said in your previous note.

I'm still not sure what you mean by patch identifier.  I think you
mean the extra junk that you'd like to have added to the patch
identifiers? The thing that currently doesn't exist?

> Let me make sure I understand your other points:  you believe that the
> convergence property is necessary, at least for the use case of duplicate
> runs of tailor producing identical repositories, you continue to be
> skeptical that it will ever be possible to implement self-authenticating
> identifiers in a way compatible with patch theory, and you think that some
> of the strategies under discussion are unnecessarily complex.  I guess you
> mean strategy #3 (hash including minimal context), which is more complex
> than either strategy #1 (random id) or strategy #2 (hash including current
> context).
>
> Is that right?

Right.  I don't particularly care about convergence, but I believe
other people do care, and I don't see any reason to break current
functionality in order to fix an loophole that has no real detrimental
effect in practice.

#2 is marginally better, but if it's done in any way that could
provide hope of self-authentication, it would be too slow, and #3 is
guaranteed to be too slow.  So we're left with a #2 that only solves
the problem of collision-resistance.  (which is, of course, the title
of this bug...)  But is that a worthwhile tradeoff, including useless
data in every patch identifier, just so we won't have collisions, when
there is a friendlier way to eliminate the probability of collisions
(i.e. to check for collisions).

David
msg5790 (view) Author: dagit Date: 2008-08-29.00:42:02
After re-reading much of this thread it seems the decision is, "Won't fix
because it's not darcs' problem, it's tailor's problem".

If I understand correctly, the reasoning is darcs can/could make the patchinfo
unique per branch at creation time.  I think, this is to be done by checking if
the name is already used and either complaining or using a new name.  So instead
of making patch ids collision free, we'll ensure that each branch is free of
duplicate patchinfos.

Unfortunately, this doesn't address the other problem that's emerged in this
discussion, which is, what to do about repositories that have identical
patchinfos but different patches?  For example see issue1026.

Because cases like issue1026 can cause a crash in darcs that is semantically
equivalent to a segfault or buffer overflow, we really should fix this!

I've just made a big claim.  Here is how it works:

1) Darcs trusts patchinfos to be consistent across patchsets.

2) When you do a pull, darcs looks at the set of patches in both repositories
and tries to figure out which patches they have in common and which patches
exist only in each repository (get_common_and_uncommon).  There is also a bias
to work with patch contents from local copies of the patches.  This is intended
as an optimization, but it could be meaningful in this context.

3) To do get_common_and_uncommon, darcs grabs the the patchinfos for every patch
in both sequences and takes the intersection and calls these the common patchinfos.

4) Darcs takes the common patchinfos and uses them to decide which patches in
each patchset can be commuted out of the patchset.  Darcs assumes if the patch
is not a common patch but it's sitting between common patches then it was
commuted there previously.  And this assumption is true when there is an
injection from patchinfos to patches (up to transformations generated by commute).

5) In the case of "bug in get_extra" you're getting lucky in a sense.  That's
the case where darcs is doing the equivalent of an unsafe C program trusting
user input and dying with a segfault.  Darcs catches the error and aborts.

6) The unlucky case.  Now imagine if we had two different patches with the same
patchinfos but they just happen to commute with all the right patches so that no
bug in get_extra is triggered.  Darcs has now corrupted your repository and you
can expect it is now in some ill-defined state.  This is akin to a buffer
overflow exploit.

Now, some people might argue that user's shouldn't be "stupid" and create
different patches in different branches that share the same patchinfo.  I think
that's blaming the victim.  As a respectable piece of software Darcs shouldn't
let itself do unsafe things with user input.

I'm not saying we have handle this situation by doing a successful
get_common_and_uncommon inspite of duplicate patchinfos. I am saying that we
need to detect #6 and at a minimum respond like we do in #5.

It would be nice if someone could construct an example of #6 leading to
corruption so we can include it in the buggy tests.

Jason
msg5791 (view) Author: zooko Date: 2008-08-29.02:08:22
>> ... those bits of
>> information don't *have* to be included in the patch identifier in  
>> order to
>> achieve any of the three properties of patch identifiers that we were
>> discussing.  If I understand correctly, this is in agreement with  
>> what you
>> said in your previous note.
>
> I'm still not sure what you mean by patch identifier.  I think you
> mean the extra junk that you'd like to have added to the patch
> identifiers? The thing that currently doesn't exist?

No, I mean the same thing that you mean by "patch  
identifier" (although the statement is also true of other possible  
identifiers of patches).

> #2 is marginally better, but if it's done in any way that could
> provide hope of self-authentication, it would be too slow, and #3 is
> guaranteed to be too slow.
> ...
> So we're left with a #2 that only solves
> the problem of collision-resistance.

Could you tell me what you mean by this alternate form of #2 -- the  
one which could provide hope of self-authentication but is too slow  
-- as opposed to the form of #2 which provides only collision- 
resistance?

The strategy #2 that I am thinking of is to take the secure hash of  
the patch contents and the current context, and include the result  
value in the patch id.  This would provide collision-resistance and  
would be fast (I think).  You think (if I understand correctly) that  
it would not provide hope of self-authentication in the future.  I  
think that it might, although I haven't worked out the details enough  
to be sure that it would be practical.

> But is that a worthwhile tradeoff, including useless
> data in every patch identifier, just so we won't have collisions, when
> there is a friendlier way to eliminate the probability of collisions
> (i.e. to check for collisions).

Can you tell me how this strategy -- let's call it strategy #5:  
"check for collisions" -- is friendlier than #2?  Possibly because  
you are envisioning that the secure hash value would appear in the  
user interface in a way that would make that user interface less  
friendly?

Any of these strategies would also mean that we don't need to have  
the "sleep for one second" feature in order to guarantee unique  
timestamps, although apparently it turns out that we don't use that  
feature in darcs record -- I'm a bit confused about that.

Regards,

Zooko
---
http://allmydata.org -- Tahoe, the Least-Authority Filesystem
http://allmydata.com -- back up all your files for $5/month
msg6033 (view) Author: droundy Date: 2008-09-17.15:22:13
I'm pushing a fix for this, using the random number approach.  We'll see if
tailor users rise up in arms about this... if so, it's easy enough to change.

Zooko, I'm sorry for being rude about this, and thank you for being stubborn.

David
msg6036 (view) Author: droundy Date: 2008-09-17.18:49:38
The following patch updated the status of issue27 to be resolved:

* resolve issue27: add junk to patch identifiers. 
Ignore-this: b91ab6f6e05e0fda25488fa51653b741
msg7167 (view) Author: gwern Date: 2009-01-23.17:51:57
> Having the same patch name is actually very probable in some uses of
> darcs.  Some people are fond of just calling their minor patches
> 'Wibble' -- hmm maybe we can call this the "Wibble" problem -- so you
> could imagine somebody doing something like:
> darcs record -am Wibble foo
> darcs record -am Wibble bar
> 
> Well, that seems rather silly, but it isn't entirely far-fetched.

Actually, it isn't silly at all. Suppose one is using Darcs for storing a wiki
(like Gitit or Orchid do). 

One of the fundamental justifications of a wiki is allowing small minor edits.
Obviously if one is correcting the spelling of a single word, it makes no sense
to spend a full sentence summarizing one's change. One uses the shorthand- 'sp',
or 'fmt' or 'cpedit' or 'rv' or 'rvv' or 'ln' or '+See also' etc. This jargon is
very useful for speeding up review of edits, particularly subtle vandalism (a
reversion without a summary like 'rvv' can easily look like an editing dispute).
I daresay on the English Wikipedia alone there must be millions of edits which
have the same summary as another edit. (With 279,793,198 edits in total on en, I
suspect there are many very long/improbable edit summaries which are duplicates...)

Now, this might seem to be a problem only for large wikis, but it's very easy to
run into. When I was playing with the first release of Orchid, I made a test
edit messing with some of the markup, which I summarized, of course, as 'fmt'.
Later, I was adding some content and I saw I made a syntax error. My correction
had the summary 'fmt'... 

(Orchid didn't allow this to be saved because it wasn't a unique summary and
Orchid worked on unique summaries. Now that Orchid has moved to use filestore -
which uses darcs's hashs as IDs and specifically avoids the summaries - this
isn't a problem anymore. But it was.)
msg7172 (view) Author: kowey Date: 2009-01-24.08:58:56
On Fri, Jan 23, 2009 at 17:52:02 -0000, gwern wrote:
> (Orchid didn't allow this to be saved because it wasn't a unique summary and
> Orchid worked on unique summaries. Now that Orchid has moved to use filestore -
> which uses darcs's hashs as IDs and specifically avoids the summaries - this
> isn't a problem anymore. But it was.)

So, darcs 2.2.0 generates some random data in the patch log now (using
an Ignore-this) field... so by rights it should work even if Orchid is
relying on the logs (note logs and not just summary though)
History
Date User Action Args
2005-11-29 02:44:14zookocreate
2005-11-29 02:48:25zookosetstatus: unread -> unknown
nosy: droundy, tommy, zooko
messages: + msg100
2005-11-29 03:39:13beschmisetnosy: + beschmi
messages: + msg101
2005-11-29 11:47:12zookosetnosy: droundy, tommy, beschmi, zooko
messages: + msg102
2005-11-29 12:44:04droundysetnosy: droundy, tommy, beschmi, zooko
messages: + msg104
2005-11-29 14:29:55zookosetnosy: droundy, tommy, beschmi, zooko
messages: + msg108
2006-04-11 18:38:43zookosetnosy: droundy, tommy, beschmi, zooko
messages: + msg616
2008-01-19 02:43:14markstossetnosy: + markstos, kowey
messages: + msg2573
2008-01-19 13:32:25droundysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko
messages: + msg2597
2008-01-19 15:59:16markstossettopic: + Darcs2
nosy: droundy, tommy, beschmi, kowey, markstos, zooko
2008-01-28 14:33:19zookosetnosy: droundy, tommy, beschmi, kowey, markstos, zooko
messages: + msg2833
2008-01-28 15:17:17droundysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko
messages: + msg2835
2008-01-28 15:56:08zookosetnosy: droundy, tommy, beschmi, kowey, markstos, zooko
messages: + msg2846
2008-02-13 22:40:56markstossetstatus: unknown -> deferred
nosy: droundy, tommy, beschmi, kowey, markstos, zooko
2008-03-06 13:56:02zookosetnosy: droundy, tommy, beschmi, kowey, markstos, zooko
messages: + msg3779
2008-03-06 22:45:31markstossetnosy: droundy, tommy, beschmi, kowey, markstos, zooko
messages: + msg3789
2008-03-13 13:28:50zookosetnosy: droundy, tommy, beschmi, kowey, markstos, zooko
messages: + msg3881
2008-03-17 22:05:24droundysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko
messages: + msg3899
2008-03-17 22:22:00zookosetnosy: droundy, tommy, beschmi, kowey, markstos, zooko
messages: + msg3901
2008-03-18 13:27:49markstossetnosy: droundy, tommy, beschmi, kowey, markstos, zooko
messages: + msg3904
2008-03-31 15:38:51droundylinkissue772 superseder
2008-08-28 08:12:34koweylinkissue1026 superseder
2008-08-28 08:37:21koweysetstatus: deferred -> unknown
nosy: + Serware, dagit
topic: + Target-2.0, - Darcs2
messages: + msg5749
2008-08-28 09:06:44koweysetstatus: unknown -> needs-reproduction
nosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, Serware
2008-08-28 10:18:38nwfsetnosy: + nwf
messages: + msg5755
2008-08-28 10:37:43koweysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, Serware, nwf
messages: + msg5756
2008-08-28 11:25:50nwfsetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, Serware, nwf
messages: + msg5758
2008-08-28 12:14:54koweysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, Serware, nwf
messages: + msg5759
2008-08-28 13:34:30droundysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, Serware, nwf
messages: + msg5764
2008-08-28 14:42:48koweysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, Serware, nwf
messages: + msg5767
2008-08-28 14:53:32koweysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, Serware, nwf
messages: + msg5768
2008-08-28 15:08:49droundysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, Serware, nwf
messages: + msg5769
2008-08-28 15:09:48droundysettopic: - Target-2.0
nosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, Serware, nwf
messages: + msg5770
2008-08-28 15:32:00zookosetnosy: + serware
messages: + msg5771
2008-08-28 15:47:20koweysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, serware, Serware, nwf
messages: + msg5772
2008-08-28 15:56:00droundysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, serware, Serware, nwf
messages: + msg5773
2008-08-28 15:59:48droundysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, serware, Serware, nwf
messages: + msg5774
2008-08-28 16:14:13zookosetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, serware, Serware, nwf
messages: + msg5775
2008-08-28 17:46:28droundysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, serware, Serware, nwf
messages: + msg5777
2008-08-28 17:50:41nwfsetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, serware, Serware, nwf
messages: + msg5779
2008-08-28 19:00:30zookosetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, serware, Serware, nwf
messages: + msg5782
2008-08-28 19:38:29koweysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, serware, Serware, nwf
messages: + msg5784
2008-08-28 23:13:12droundysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, serware, Serware, nwf
messages: + msg5789
2008-08-29 00:42:05dagitsetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, serware, Serware, nwf
messages: + msg5790
2008-08-29 02:08:30zookosetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, serware, Serware, nwf
messages: + msg5791
2008-09-17 12:06:46dmitry.kurochkinsetnosy: + dmitry.kurochkin
2008-09-17 15:22:15droundysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, dmitry.kurochkin, serware, Serware, nwf
messages: + msg6033
2008-09-17 15:22:30droundysetstatus: needs-reproduction -> has-patch
nosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, dmitry.kurochkin, serware, Serware, nwf
2008-09-17 15:22:57droundysetnosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, dmitry.kurochkin, serware, Serware, nwf
assignedto: droundy
2008-09-17 18:49:40droundysetstatus: has-patch -> resolved-in-unstable
nosy: droundy, tommy, beschmi, kowey, markstos, zooko, dagit, dmitry.kurochkin, serware, Serware, nwf
messages: + msg6036
2009-01-23 17:52:01gwernsetnosy: + gwern, simon, thorkilnaur
messages: + msg7167
2009-01-23 22:41:41droundysetnosy: - droundy
2009-01-24 08:58:58koweysetnosy: tommy, beschmi, kowey, markstos, zooko, dagit, simon, thorkilnaur, gwern, dmitry.kurochkin, serware, Serware, nwf
messages: + msg7172
2009-04-22 03:26:15twbsetstatus: resolved-in-unstable -> resolved
nosy: tommy, beschmi, kowey, markstos, zooko, dagit, simon, thorkilnaur, gwern, dmitry.kurochkin, serware, Serware, nwf
2009-08-06 20:29:31adminsetnosy: - beschmi
2009-08-10 23:56:28adminsetnosy: - dagit
2009-08-25 17:39:45adminsetnosy: + darcs-devel, - simon
2009-08-27 14:17:51adminsetnosy: tommy, kowey, markstos, darcs-devel, zooko, thorkilnaur, gwern, dmitry.kurochkin, serware, Serware, nwf
2009-10-23 22:44:26adminsetnosy: - Serware
2009-10-23 23:27:51adminsetnosy: + Serware, - serware