darcs

Issue 1477 darcs show contents --match 'hash X' is slow

Title darcs show contents --match 'hash X' is slow
Priority bug Status given-up
Milestone Resolved in
Superseder wish: performance of darcs changes <file>
View: 124
Nosy List darcs-devel, dmitry.kurochkin, gwern, kowey, lele, mornfall, thorkilnaur, zooko
Assigned To mornfall
Topics Matchers, Performance

Created on 2009-06-03.22:39:55 by zooko, last changed 2017-07-30.23:42:02 by gh.

Messages
msg7873 (view) Author: zooko Date: 2009-06-03.22:39:45
There follows a message that I wrote to the darcs-users mailing  
list.  Eric Kow asked me to open a ticket for it, so here it is.  He  
also asked what parameters cause this to be slow.  I think, but am  
not sure, that the main issue is a deep history.  Projects which have  
many thousands of patches seem more likely to be slow.

Regards,

Zooko

Begin forwarded message:

> From: Zooko Wilcox-O'Hearn <zooko@zooko.com>
> Date: May 27, 2009 10:26:07 AM MDT
> To: Eric Kow <kowey@darcs.net>
> Cc: Petr Rockai <me@mornfall.net>, Darcs Mailing list <darcs- 
> users@darcs.net>
> Subject: Re: [darcs-users] Hashed-storage & darcs 2.3 (Was; Re: 2.3  
> release	schedule)
>
> I have a request:  speed up the operations which users have to wait  
> on when they browse a darcs repository through trac, and deploy it  
> as soon as possible!
>
> For example, the front page of http://darcs.net has a link to this  
> page:
>
> http://allmydata.org/trac/darcs-2/browser
>
> If you load that page, trac issues the darcs commands like these:
>
> 2009-05-27 09:15:14,095 Trac[command] DEBUG: cd /home/source/darcs/ 
> darcs-2/trunk-hashedformat; TZ=UTC darcs --version
> 2009-05-27 09:15:14,244 Trac[command] DEBUG: cd /home/source/darcs/ 
> darcs-2/trunk-hashedformat; TZ=UTC darcs query contents --quiet -- 
> match "hash 20090513075931- 
> c31ef-94cc8b3d6a88ae270963d60d6d5426bd0bda7245.gz" "GNUmakefile"
> 2009-05-27 09:15:14,638 Trac[command] DEBUG: cd /home/source/darcs/ 
> darcs-2/trunk-hashedformat; TZ=UTC darcs query contents --quiet -- 
> match "hash 20090513075931- 
> c31ef-94cc8b3d6a88ae270963d60d6d5426bd0bda7245.gz" "configure.ac"
>
> If there are enough files which have been changed since the last  
> time trac queried darcs and cached the result, then this can take  
> many seconds.  While it is processing, other users of the same trac  
> will get a "database is locked" error if they try to view the same  
> repository.  On some repositories that I use, it takes more than 30  
> seconds and trac times-out (believing that the database or revision  
> control tool must have locked up) and gives an error message.
>
> Also if you view the contents of a file like this:
>
> http://allmydata.org/trac/darcs-2/browser/src/darcs.tex
>
> Or the history of a file or directory, like this:
>
> http://allmydata.org/trac/darcs-2/log/src/darcs.tex
>
> Then trac will issue a 'darcs query contents --match "hash ..."'  
> command.
>
> So, it would be great for me if 'darcs query contents --match  
> "hash ..."' were as fast as possible, and if this speed-up were  
> released in a stable and completely bug-free new version of darcs  
> as soon as possible.  ;-)
>
> Thanks!
>
> Regards,
>
> Zooko
> _______________________________________________
> darcs-users mailing list
> darcs-users@darcs.net
> http://lists.osuosl.org/mailman/listinfo/darcs-users
msg7881 (view) Author: kowey Date: 2009-06-04.06:17:51
Marking 'Performance' and updating to reflect the new command name (we've got to
have a common vocabulary here)
msg7888 (view) Author: zooko Date: 2009-06-04.15:52:58
I just set up a new trac showing this repository:
http://allmydata.org/source/cryptopp/trunk .  Here is the trac:
http://allmydata.org/trac/cryptopp .  If you click on "Browse Source" and then
click on the directory named "c5" then it will time out and give an error. 
That's because there are 290 files in "c5", and for each file trac issues a
command like

darcs query contents --quiet --match "hash
20021004173141-0a92f-14ec3ac46e0a267c7eaaa7342de0f999da20f406.gz" "tailor.config"

That one command takes 44 seconds on my server.  That's about 100 times as long
as would be usable.  To reproduce this, run "darcs get
http://allmydata.org/source/cryptopp/trunk" and then cd into it and run:

darcs query contents --quiet --match "hash
20021004173141-0a92f-14ec3ac46e0a267c7eaaa7342de0f999da20f406.gz" "tailor.config"

Please let me know how long that command takes on your machine and with your
version of darcs.
msg7889 (view) Author: zooko Date: 2009-06-04.15:59:58
Hm, no a speedup by a factor of 100 would not make this usable if trac does the
same command for each of the 290 files and it takes as long for each one.  Let's
say for this to be usable you need to get a listing of the directory which
contains 290 files in, let's say, less than 10 seconds.  So darcs has to provide
whatever information it is that trac needs in about 300 milliseconds per file.

Maybe trac or Lele's trac-darcs plugin could generate the appropriate web page
listing the directory *without* needing to query darcs about each file?
msg7890 (view) Author: kowey Date: 2009-06-04.16:05:03
Adding Gwern because he may be interested for filestore, and Benedikt as well
because of his filecache work (hope you don't mind)

It'd be good to see if the issue here is match hash (and not the show contents
command in itself).

Also, another thing to consider is if it's the same thing that annotate and
changes are affected by, just having to comb through all patches, when we could
just be grabbing the patches that affect a specific file instead.  This is where
Benedikt's work comes in.
msg7891 (view) Author: kowey Date: 2009-06-04.16:05:46
As for the multiple file use case, darcs diff may be useful.  It also accepts
matchers.
msg7900 (view) Author: lele Date: 2009-06-04.22:32:24
On Thu, 04 Jun 2009 16:00:02 -0000
Zooko <bugs@darcs.net> wrote:

> Maybe trac or Lele's trac-darcs plugin could generate the appropriate
> web page listing the directory *without* needing to query darcs about
> each file?

It may be worth remembering that trac-darcs heavily caches files's
content, and it should be calling darcs just once per file per
revision.

Maybe a simple optimization could be handling "darcs query contents
fileA fileB fileC" at least with --xml-output: I bet the time it takes
darcs to compute all the files is a fraction of the N calls to compute
them one at a time.

ciao, lele.
-- 
nickname: Lele Gaifax    | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas    | comincerò ad aver paura di chi mi copia.
lele@nautilus.homeip.net |                 -- Fortunato Depero, 1929.
msg8317 (view) Author: kowey Date: 2009-08-21.17:01:14
On Zooko's cryptopp repository.

Here we search for (a) darcs changes --match 'hash' with something that's far
back in the history (b) darcs show contents with the same as a, (c) darcs show
contents matching on the most recent patch (d) darcs show contents on HEAD
without actually using any matchers.

Petr says that he was able to improve matcher efficiency in his branch of darcs,
which sounds very promising, but I'm not sure what the implications of these
improvements are exactly.

As we can see in (a) just matching and (c) just loading the repository into
memory are not particularly costly on their own [although there is *some*
overhead, as we see in (d)].  What seems to be hurting us most is when we have
to go unapply a bunch of patches to our in-memory representation of the tree. 
But does that mean this bit in particular is slow, or could it be something
ridiculous like us holding on patches way after we're done looking at them?

I'd like to see timings with old darcs and his darcs, which is why I'm assigning
this to Petr.

Afterwards, I suggest:

- adding this sort of test to the benchmarks
- more profiling (and general thinking about things)

I'm also throwing in issue124 (for the filecache work).  As a good way on
cutting down on variables however, it may be a good idea to work with some demo
repositories where there is only ever one file in the repo, just a whole lot of
patches that affect that one file.  This might give us a better understanding of
what's going on.

$ time darcs changes --quiet --match "hash
20021004173141-0a92f-14ec3ac46e0a267c7eaaa7342de0f999da20f406.gz"
"tailor.config" > /dev/null

real	0m0.117s
user	0m0.020s
sys	0m0.008s

$ time darcs show contents --quiet --match "hash
20021004173141-0a92f-14ec3ac46e0a267c7eaaa7342de0f999da20f406.gz"
"tailor.config" > /dev/null

real	0m30.343s
user	0m8.417s
sys	0m12.757s
$ time darcs show contents --quiet --match "hash
20090711225018-e3536-21ecdecdcefba3c64f3407fd049acc83558ecd97.gz"
"tailor.config" > /dev/null

real	0m3.401s
user	0m0.288s
sys	0m0.136s

$ time darcs show contents --quiet "tailor.config" > /dev/null

real	0m0.043s
user	0m0.004s
sys	0m0.012s
msg8997 (view) Author: kowey Date: 2009-10-22.09:39:06
I think I mis-filed this.  Marking need-action because we need somebody to
generate timings for this using darcs-hs (comparing to current darcs)
msg9192 (view) Author: kowey Date: 2009-11-03.17:35:46
I have some updates to this.  The main user affected is trac-darcs (and
presumably other web interfaces or guis around darcs repositories).  I'll speak
from the point of view of trac-darcs

source browser [workaround found, trac-darcs 0.7]
-------------------------------------------------
On 2009-07, Zooko pointed reminded us about this problem for trac-darcs.  Petr
then pointed out that trac-darcs could actually avoid using matchers for getting
the HEAD version of the file.  This according to Zooko's report made trac-darcs
much more usable.

See http://lists.osuosl.org/pipermail/darcs-users/2009-July/020387.html and also
a recent chat about this http://irclog.perlgeek.de/darcs/2009-11-03#i_1681206

So fixing this would only really matter for the case where users specifically
decide to run the source browser on an older version of a file or directory.

changeset viewing [still a problem]
-----------------------------------
Zooko pointed out another case where this affects us,
http://lists.osuosl.org/pipermail/darcs-users/2009-October/021906.html

Petr has suggested some possible ideas on how to make this better in practice
(see the thread for details)

One workaround I've suggested to Lele was to use darcs diff, as you'd only have
to run it once instead of repeating the query for each individual file

this ticket
-----------
Now that darcs-hs has been merged into the mainline, it would still be useful to
have some timings against 2.2.0 of this darcs show contents --match 'hash X'
operation.
msg9962 (view) Author: zooko Date: 2010-02-07.16:25:28
> So fixing this would only really matter for the case where users
specifically decide to run the source browser on an older version of a
file or directory.

Just to be clear this is a common case. If you want to browse the most
recent version of the source you can just use your text editor on your
local repo, but if you want to browse an older version you need more
tool support. We often browse older versions in the Tahoe-LAFS
development process.

Another problem is that trac locks the entire database while invoking
trac-darcs. So if any user clicks on "annotate" then all users get a
timeout from the entire web site http://allmydata.org/trac/tahoe-lafs
until the "darcs annotate" completes.

Now that's clearly something that could be improved by trac itself -- if
it knew which queries to the revision control plugin were side-effect
free then it could leave the database unlocked when issuing those
commands. Then other users would not be blocked and as long as the
revision control tool was concurrency-safe for multiple simultaneous
read queries and at most one write command nothing bad would happen.

However, it would be really nice if darcs were so fast that it was
usuable for this use case even when trac locks the database for all queries.

There was some more discussion on the mailing list:
http://lists.osuosl.org/pipermail/darcs-users/2010-February/023034.html
History
Date User Action Args
2009-06-03 22:39:55zookocreate
2009-06-04 06:17:56koweysetstatus: unread -> unknown
topic: + Performance
title: please speed up "darcs query contents" -> please speed up "darcs show contents"
nosy: kowey, zooko, simon, thorkilnaur, dmitry.kurochkin
messages: + msg7881
priority: feature
2009-06-04 15:53:01zookosetnosy: kowey, zooko, simon, thorkilnaur, dmitry.kurochkin
messages: + msg7888
2009-06-04 16:00:02zookosetnosy: + lele
messages: + msg7889
2009-06-04 16:05:06koweysetnosy: kowey, zooko, lele, simon, thorkilnaur, dmitry.kurochkin
messages: + msg7890
title: please speed up "darcs show contents" -> speed up: darcs show contents --match "hash foo"
2009-06-04 16:05:49koweysetnosy: kowey, zooko, lele, simon, thorkilnaur, dmitry.kurochkin
messages: + msg7891
2009-06-04 16:07:05koweysetnosy: + gwern, beschmi
2009-06-04 22:32:29lelesetnosy: beschmi, kowey, zooko, lele, simon, thorkilnaur, gwern, dmitry.kurochkin
messages: + msg7900
title: speed up: darcs show contents --match "hash foo" -> please speed up "darcs show contents"
2009-08-06 21:14:06adminsetnosy: - beschmi
2009-08-21 17:01:21koweysetstatus: unknown -> waiting-for
topic: + Matchers
messages: + msg8317
assignedto: mornfall
nosy: + mornfall
title: please speed up "darcs show contents" -> darcs show contents --match 'hash X' is slow
superseder: + wish: performance of darcs changes <file>
priority: feature -> bug
2009-08-25 17:45:35adminsetnosy: + darcs-devel, - simon
2009-08-27 14:28:46adminsetnosy: kowey, darcs-devel, zooko, lele, thorkilnaur, gwern, dmitry.kurochkin, mornfall
2009-10-22 09:36:00koweysetstatus: waiting-for -> unknown
nosy: kowey, darcs-devel, zooko, lele, thorkilnaur, gwern, dmitry.kurochkin, mornfall
2009-10-22 09:39:09koweysetstatus: unknown -> needs-reproduction
nosy: kowey, darcs-devel, zooko, lele, thorkilnaur, gwern, dmitry.kurochkin, mornfall
messages: + msg8997
2009-11-03 17:35:49koweysetmessages: + msg9192
2010-02-07 16:25:32zookosetmessages: + msg9962
2017-07-30 23:42:02ghsetstatus: needs-reproduction -> given-up