darcs

Issue 2379 only clone repositories with packs when they are up-to-date

Title only clone repositories with packs when they are up-to-date
Priority Status has-patch
Milestone Resolved in 2.10.0
Superseder Nosy List darcs-devel, gh, simon
Assigned To
Topics

Created on 2014-04-15.20:47:32 by gh, last changed 2023-04-01.12:55:26 by bfrk.

Messages
msg17352 (view) Author: gh Date: 2014-04-15.20:47:30
Packs aim at making repo cloning via HTTP faster. To create packs, the
user must run "darcs optimize --http", which creates packs corresponfing
to the current state of the repository.

When packs get outdated (because of new patches), "darcs get" gets the
packs anyway, and applies the missing patches. The problem is that
outdated packs make cloning *slower* than cloning without packs, since
patch application can be costful.

So I suggest a little change of format and behaviour:

* when creating packs, copy pristine hash to _darcs/packs/pristine
* when getting, compare remote _darcs/packs/pristine to the pristine
hash of _darcs/hashed_inventory
* if _darcs/packs/pristine does not exist, or hash is different, get
normally, otherwise get with packs (function copyPackedRepository2)

Basically this makes darcs clone repository with packs only when they
are up-to-date (modulo pristine hash collision, which can happen, mostly
if the missing patches are tags).

As a bonus, this is retrocompatible with darcs 2.8, but anyway packs
were not enabled by default so I guess we can change them as we wish.

Related:

* <http://darcs.net/Internals/OptimizeHTTP>
* <http://irclog.perlgeek.de/darcs/2014-04-15#i_8592088>
msg17353 (view) Author: gh Date: 2014-04-15.20:49:02
Sorry for the lack of proof-reading here's a correction:

Outdated packs do *not* make cloning *systematically* slower, but they
can with time.
msg17354 (view) Author: kowey Date: 2014-04-16.09:00:09
Wasn't the idea behind packs supposed to be that we would fetch from
both sides and meet in the middle?


On 15 April 2014 21:49, Guillaume Hoffmann <bugs@darcs.net> wrote:
>
> Guillaume Hoffmann <guillaumh@gmail.com> added the comment:
>
> Sorry for the lack of proof-reading here's a correction:
>
> Outdated packs do *not* make cloning *systematically* slower, but they
> can with time.
>
> __________________________________
> Darcs bug tracker <bugs@darcs.net>
> <http://bugs.darcs.net/issue2379>
> __________________________________
> _______________________________________________
> darcs-devel mailing list
> darcs-devel@darcs.net
> http://lists.osuosl.org/mailman/listinfo/darcs-devel



-- 
Eric Kow <http://erickow.com>
msg17356 (view) Author: gh Date: 2014-04-17.18:00:08
Yes that was the idea, but in the case of getting the last pristine
state, it does not work well in all cases, since outdated packs require
downloading and applying extra patches, which unfortunately is slow in
some real-world cases.

One toy case I made for the sake of the argument is this repo:
<http://www.cs.famaf.unc.edu.ar/~hoffmann/badpacks/>  It has 2 patches,
one that introduces a big binary file, and another that replaces its
contents with only a few bytes. Cloning it without packs is much faster
than with.

And I can't think of any way of predicting whether it's worth using
packs+new patches versus pristine downloading.

Now for getting the whole history... actually yes, the "meeting in the
middle" idea works, since we just want to download all patches. So in
the case of patches I think that we should use them in all cases.

That is, my proposal is now:

* when creating packs, copy pristine hash to _darcs/packs/pristine
* when getting, compare remote _darcs/packs/pristine to the pristine
hash of _darcs/hashed_inventory
* if _darcs/packs/pristine does not exist, or hash is different, get
the pristine cache normally, otherwise get it with packs (beginning of
function copyPackedRepository2)
* if _darcs/packs/patches.tar.gz exists, grab this pack and patches in
parallel (end of function copyPackedRepository2)
msg17429 (view) Author: noreply Date: 2014-05-04.20:17:25
The following patch sent by Guillaume Hoffmann <guillaumh@gmail.com> updated issue issue2379 with
status=resolved;resolvedin=2.10.0 HEAD

* resolve issue2379: only use packs to copy pristine when up-to-date 
Ignore-this: 76acb197a8a681ef92c496819b08add5

When creating packs, save pristine hash to _darcs/packs/pristine
If basic pack is outdated, do not fetch it, but fetch patches pack
anyway.
In Darcs.Repository, separate functions between the ones that fetch
basic repository and complete repository (packed or not), and
separate function that clones old-fashioned repositories.
msg23243 (view) Author: bfrk Date: 2023-04-01.12:55:25
Re-opening for discussion.

> When packs get outdated (because of new patches), "darcs get" gets the
> packs anyway, and applies the missing patches. The problem is that [...]
> patch application can be costful.

Yes, patch application is costly, but there is no reason to do that. We simply 
download the missing pristine files after getting the basic pack. (In case you 
have trouble seeing how and where this is done: it is happening as a side-effect 
of createPristineDirectoryTree.)

It is still true that there are cases where this is slower than only downloading 
the current pristine files. The test case mentioned by gh (from the description, 
unfortunately the repo is no longer online) has two special properties: (1) the 
basic pack is much larger than the sum of current pristine file sizes; and (2) 
the number of (current) pristine files is small.

On the other hand, it is easy to see that there are cases where using an outdated 
basic pack is much cheaper, namely when there are many pristine files, the pack 
is only slightly outdated, and the latest (unpacked) patches only touch a small 
percentage of existing files.

So this is a trade-off and we have to decide which case is more likely to occur 
in practice. And the answer to that is quite obviously that the latter case is 
much more typical than the former.

Projects tend to grow over time, adding more and more files (documentation, test 
cases, features). The number of files to download for pristine is the main reason 
why (even a lazy) unpacked clone can become unbearably slow unless you have most 
of the files already in your global cache. While it may happen occasionally that 
you remove a very large file or or make it (much) smaller, the vast majority of 
patches make small to medium sized changes that leave the contents of most files 
untouched.

Here is a real world example for which I have made a repo available via HTTP as 
http://darcs.net/test. This is screened (at the time of writing this) with packs 
outdated by 121 patches. The times are the best out of three successive runs to 
account for server side caching. The command was:

> time darcs clone http://darcs.net/test -v --no-cache --lazy

With current darcs (use basic pack only if current): 2:54 minutes
With a patched darcs (always use basic patch if available): 0:42 minutes

The improvement factor is roughly 4; it gets better the less packs are outdated, 
e.g. with only 30 patches I get something in the order of 20 seconds for the 
patched darcs, but of course this depends on the number of files touched by the 
patches not included in the pack. The optimum (packs are current) is about 8 
seconds for both variants. If you run `darcs optimize http` regularly from a cron 
job, then packs will always be at least "almost up to date" and you get a huge 
improvement from using basic packs.

I won't be able to send the patches that make this change for some time because 
they depend on a number of unrelated improvements.
History
Date User Action Args
2014-04-15 20:47:32ghcreate
2014-04-15 20:49:03ghsetmessages: + msg17353
2014-04-15 21:08:06ghsettitle: only clone repositories with packs when they are up-to-date -> only use with packs when up-to-date
2014-04-15 21:14:55ghsettitle: only use with packs when up-to-date -> only use packs when up-to-date
2014-04-16 09:00:11koweysetmessages: + msg17354
title: only use packs when up-to-date -> only clone repositories with packs when they are up-to-date
2014-04-16 13:35:57ghsetnosy: + darcs-devel
2014-04-16 13:36:18ghsetnosy: + simon
2014-04-17 18:00:10ghsetmessages: + msg17356
2014-05-04 20:17:26noreplysetstatus: unknown -> resolved
messages: + msg17429
resolvedin: 2.10.0
2023-04-01 12:55:26bfrksetstatus: resolved -> has-patch
messages: + msg23243