I've been playing around with this a bit, informed by some of the things Guillaume
discovered.
Broadly it seems that the problem is that the entire list of [(PatchId, [PatchMod
FileName])] is being constructed and then used to build the PatchIndex, rather than
being produced and consumed incrementally.
I think that the list of patches from the repository is itself parsed lazily (otherwise
all darcs commands that operate on PatchSet would read the entire set into memory), so
the problem must be coming from the patch index code itself.
I can see a few potential reasons:
- the set of FileNames that patches2PatchMods passes around and the related use of
mapAccumL, which may be introducing an unintentional dependency on calculating the
entire output list before producing any results
- the two uses of pmods in createPatchIndexFrom (pids and mods) which may mean it's
kept in memory for the second use until the first one is done
- the nubSeq in applyPatchMods and the way the state monad is used - these *should* be
incremental in themselves, but it makes it a bit hard to untangle the logic
Logically, it should always be possible to build the patch index up one patch at a time
- otherwise updatePatchIndex wouldn't work properly. However the code isn't quite
structured that way, so I'd like to simplify it to bring us closer to that point.
Another way of improving memory usage substantially might be to make sure that the
strings and bytestrings in the list are all shared with each other via hash-
consing/interning. For example lots of filenames must be repeated a lot. However this
seems like a bit fiddly to implement (you need to manage a cache) and might interfere
with laziness.
|