Issue 2644: detect "invalid" patches on pull and apply

Title	detect "invalid" patches on pull and apply
Priority		Status	unknown
Milestone		Resolved in
Superseder		Nosy List	bfrk
Assigned To		Topics
Created on 2020-05-04.09:14:39 by bfrk, last changed 2020-05-04.09:14:39 by bfrk.
msg22010 (view)
Author: bfrk
Date: 2020-05-04.09:14:32
See also #1879 and discussion starting here:

https://lists.osuosl.org/pipermail/darcs-devel/2020-February/020920.html

To recap the problem, patches can be "spoofed" (accidentally or by
malicious intent) in such a way that when we pull them into an otherwise
valid repository, darcs crashes or (worse) happily applies an invalid
patch resulting in an invalid repo in which internal invariants no
longer hold.

This is not about patches being syntactically invalid in the sense of
darcs not being able to read it: this will immediatly lead to failure
and thus such a patch cannot infect an otherwise valid repo. The kind of
problems we want to prevent are caused by patch semantics. From now on I
will assume that all patches are syntactically valid.

It is important to realize that, in general, patches cannot be decided
valid or invalid by themselves. An extreme case are primitive patches:
for each syntactically valid prim patch there is a repository state in
which this patch can be applied and is semantically valid. While this is
not necessarily the case for the more complex patch types built on top
of prim patches, such as conflictors and named patches, it is still true
that full validation requires access to the /context/ (all patches
preceding the patch). It is also important to realize that being able to
apply a patch is not the same as validating it, for the simple reason
that applying a patch always reduces it to its effect, which looses
information.

To validate conflictors and named patches when we add them to our repo
we need an extended patch API that allows us to pass the context as an
extra argument. Note that for RepoPatchV3 the necessary checks (in
context) are already implemented as properties.

Another major part of the problem is the reliance on global uniqueness
invariants. Specifically, we rely on a global invariant that says: two
patches with the same name (PatchInfo) in the same context (set of
preceding patches) have equal content. The reason we rely on this
invariant is so we can efficiently exchange patches between repos. When
we merge two repos, we first commute all the common patches to a common
"trunk", using PatchInfo equality. Only the remaining patches are
actually transferred and merged.

Again, violations of global uniqueness cannot, in general, be judged
valid or invalid objectively. All we can do is detect if there is an
inconsistency between two repos.

In the following I'll concentrate on the problem of detecting
consistency violations. I'll assume that the local repo is valid in the
sense that all its patches pass internal validity checks (in context).
This could be ensured by extending/improving the existing 'darcs check'
command.

The algorithm for checking consistency between a local repo and a remote
one is similar to what we do when we merge patches.

We first download only inventories, starting with the latest one and
stopping as soon as a parent inventory coincides with one of our own.
This can be cheaply tested by comparing their inventory hashes. This
gives us a guaranteed common starting point: we know that up to that
point the histories are completely identical (modulo hash collisions for
SHA256).

We are now left with two inventory lists, in which common patches may be
listed in a different order. We can now compare them patch by patch,
starting with the earliest one. Whenever we see a difference in the
order of patches (i.e. the patch infos don't compare equal), we locate
our own corresponding patch and commute it backwards to the same
position. Patches in the same position have the same context and thus
can be compared (using their patch hashes) and we can report failure if
we see a difference. This stops when we are done or when we hit a remote
patch that isn't in our repo. At this point we have to actually download
remote patches. We continue the process in a similar manner, commuting
either our own or the remote patches until we have compared all common
patches.

The last thing to do is to check that all remaining (uncommon) remote
patches are semantically valid in the now established common
context.(Remember that merely checking if a patch can be applied is not
enough.) If we merely want to guard against inconsistencies in our own
repo, we could limit this check to the patches that are actually added
to our repo.

It may well turn out that this whole procedure is cheap enough that we
can do it regularly whenever we apply, pull, or push.

To cache the results of this validation, we could offer the user the
possibility of retaining the changed order of patches (after a
successful comparison) in the local repo so it becomes identical to the
remote one. This means the next time we do the check against the same
remote repo, we only have to compare the latest inventory.

Another idea is to cache validated remote inventories in a special
subdirectory, without committing to the order of patches therein. This
could be useful in case we regularly pull from different remote repos.
History
Date	User	Action	Args
2020-05-04 09:14:39	bfrk	create
Issue 2644 detect "invalid" patches on pull and apply