darcs

Issue 1746 whatsnew -l reading files it should not (2.4)

Title whatsnew -l reading files it should not (2.4)
Priority bug Status needs-implementation
Milestone Resolved in
Superseder Nosy List dmitry.kurochkin, hoijarvi, kowey, mornfall, quick, tux_rocker
Assigned To
Topics Performance, Regression

Created on 2010-02-18.19:01:37 by quick, last changed 2020-07-31.21:46:56 by bfrk.

Messages
msg10019 (view) Author: quick Date: 2010-02-18.19:01:34
I caught newer darcs wasting time reading files it should be ignoring.  
In worst-case scenarios this causes it to fail by running out of 
memory.  I discovered this because I happened to copy a couple of very 
large datafiles into my working tree.

$ cd [some-large-darcs-repo]
$ darcs-2.0.2 -v
2.0.2 (release)
$ darcs.net -v
2.3.1 (+ 468 patches)
$ darcs show repo
          Type: darcs
        Format: hashed
      Pristine: HashedPristine
   Num Patches: 3491
$ darcs show files | wc -l
5053
$

The darcs.net is up-to-date as of today, 18 Feb 2010.

$ time darcs-2.0.2 w -l
M filea
M fileb

real    1m10.187s
user    0m31.342s
sys     0m6.154s
$ time darcs.net w -l
M filea
M fileb

real    0m53.732s
user    0m33.392s
sys     0m1.899s

And now to create the problem:

$ dd if=/dev/zero of=datafile1 bs=1024 count=1750000
[creates a 1.7GB file]
$ dd if=/dev/zero of=datafile2 bs=1024 count=1750000

$ time darcs-2.0.2 w -l
M filea
M fileb
a ./datafile1                                                        
a ./datafile2
                                                                     
real    1m4.777s
user    0m31.536s
sys     0m5.669s


Old darcs noticed the new files but other than that, it clearly didn't 
waste any time on them.  Not so with new darcs:

$ time darcs.net w -l
darcs: out of memory (requested 1808793600 bytes)

real    1m42.485s
user    0m8.956s
sys     0m28.924s

Trail of darcs.net --exact-version:

Compiled with:

HTTP-4000.0.9
array-0.2.0.0
base-4.1.0.0
bytestring-0.9.1.4
containers-0.2.0.1
directory-1.0.0.3
extensible-exceptions-0.1.1.0
filepath-1.1.0.2
hashed-storage-0.4.7
haskeline-0.6.2.2
html-1.0.1.2
mmap-0.4.1
mtl-1.1.0.2
network-2.2.1.7
old-time-1.0.0.2
parsec-3.0.1
process-1.0.1.1
random-1.0.0.1
regex-compat-0.92
terminfo-0.3.1.1
text-0.7.1.0
unix-2.3.2.0
zlib-0.5.2.0
msg10020 (view) Author: quick Date: 2010-02-18.19:11:22
Even though the original example used extremely large files to 
demonstrate the problem, I wanted to point out that there is a negative 
aspect to this even if you don't have this extreme case.

On my system, the size of the darcs executable itself is 12MB.  This 
means that if I'm working on darcs itself, every "$ darcs w -l" spends 
time reading this 12MB file (and object files, etc.).

On the heels of this is the thought that we should be extending 
Haskell's intrinsic laziness to this area as well for darcs summary-
mode whatsnew operations:

  * darcs should note the existence of new files, but do nothing more 
with that file
  * for an existing file, finding the *first* difference in the file 
should be sufficient to report it as modified: no need to scan the rest 
of the file.
  * for a removed directory, everything beneath that directory must 
have been removed as well: no need to actually check that.

Just some general thoughts; whatsnew is probably one of the most 
frequently issued commands so performance is key for this one.
msg10026 (view) Author: kowey Date: 2010-02-19.10:49:09
Comments, Petr?  I know we all want to see 2.4 go out the door, but is
there any chance this is a blocker?

[Sorry Kevin if the answer to that is self-evident; I'm in skim and
delegate mode, here]
msg10034 (view) Author: mornfall Date: 2010-02-19.15:39:50
Well, the summary mode(s) are implemented by taking the full patch and 
summarising it. There was probably some non-obvious laziness hack 
involved in older versions that avoided constructing the full patch. 
Currently, the code is probably more strict and therefore computes more. 
Of course this is something that could be improved, but is not top 
priority. Keeping large non-boring files around in a repository is very 
rare, and boring files are not read so this is a non-issue.

I think that a proper fix here would be to implement a simple summary 
mode that wouldn't rely on generating a patch sequence instead of the 
current one. Let's aim for 2.5.
msg11112 (view) Author: hoijarvi Date: 2010-05-25.15:08:13
I just run into this and created 1851 which I now marked as a dupe.

This can silently kill TortoiseDarcs, workaround is to mark those files
as boring.
msg11297 (view) Author: kowey Date: 2010-06-07.08:21:07
Marking need implementation since it looks like we have an idea how to
tackle this if I'm reading Petr correctly when he says

> I think that a proper fix here would be to implement a simple summary 
> mode that wouldn't rely on generating a patch sequence instead of the 
> current one. Let's aim for 2.5.
msg11406 (view) Author: tux_rocker Date: 2010-06-14.06:48:37
I may make an effort to have this fixed before 2.5. While it may be a
darcser's second nature to avoid large files in a repo, it's not for
others. It makes darcs look stupid if the mere presence of a few
1.7-gigabyte files makes it crash.
msg11608 (view) Author: tux_rocker Date: 2010-06-27.18:33:27
I was sounding very brave before but I'm afraid I'm not able to live up
to it. If anyone else would be so kind to take a stab at it, we'd be
very grateful.
msg14757 (view) Author: markstos Date: 2011-10-13.13:03:24
It's not a regression since 2.5, so bumping to 2.10.
History
Date User Action Args
2010-02-18 19:01:37quickcreate
2010-02-18 19:11:25quicksetmessages: + msg10020
2010-02-19 10:49:15koweysettopic: + Performance, Regression
nosy: + tux_rocker, kowey, mornfall
messages: + msg10026
title: reading files it should not -> whatsnew -l reading files it should not (2.4)
2010-02-19 10:49:25koweysetstatus: unknown -> needs-reproduction
2010-02-19 10:49:48koweysettopic: + Target-2.4
2010-02-19 15:40:01mornfallsettopic: + Target-2.5, - Target-2.4
messages: + msg10034
2010-03-22 12:25:48koweylinkissue1762 superseder
2010-05-25 15:05:17hoijarvilinkissue1851 superseder
2010-05-25 15:08:14hoijarvisetnosy: + hoijarvi
messages: + msg11112
2010-06-07 08:21:08koweysetstatus: needs-reproduction -> needs-implementation
nosy: - darcs-devel
messages: + msg11297
2010-06-14 06:48:37tux_rockersetassignedto: tux_rocker
messages: + msg11406
2010-06-15 20:52:06adminsetmilestone: 2.5.0
2010-06-15 20:59:39adminsettopic: - Target-2.5
2010-06-27 18:33:28tux_rockersetassignedto: tux_rocker ->
messages: + msg11608
2010-07-25 14:29:33tux_rockersetmilestone: 2.5.0 -> 2.8.0
2011-10-13 13:03:25markstossetmessages: + msg14757
milestone: 2.8.0 -> 2.10.0
2015-04-18 17:39:41ghsetmilestone: 2.10.0 -> 2.12.0
2020-07-31 21:46:56bfrksetmilestone: 2.12.0 ->