darcs

Patch 1134 resolve issue1987: Garbage collection for inventories ...

Title resolve issue1987: Garbage collection for inventories ...
Superseder Nosy List mdiaz, simon
Related Issues
Status accepted Assigned To
Milestone

Created on 2014-04-06.17:56:05 by mdiaz, last changed 2014-04-10.14:03:08 by gh.

Files
File name Status Uploaded Type Edit Remove
accept-issue1987_-regression-tests-for-garbage-collection-for.dpatch mdiaz, 2014-04-09.01:27:37 application/x-darcs-patch
accept-issue1987_-regression-tests-for-garbage-collection-for.dpatch mdiaz, 2014-04-10.01:53:27 application/x-darcs-patch
patch-preview.txt mdiaz, 2014-04-06.17:56:05 text/x-darcs-patch
patch-preview.txt mdiaz, 2014-04-07.19:30:01 text/x-darcs-patch
patch-preview.txt mdiaz, 2014-04-08.10:40:49 text/x-darcs-patch
patch-preview.txt mdiaz, 2014-04-09.01:27:37 text/x-darcs-patch
patch-preview.txt mdiaz, 2014-04-10.01:53:27 text/x-darcs-patch
resolve-issue1987_-garbage-collection-for-inventories-and-patches_.dpatch mdiaz, 2014-04-06.17:56:05 application/x-darcs-patch
resolve-issue1987_-garbage-collection-for-inventories-and-patches_.dpatch mdiaz, 2014-04-07.19:30:01 application/x-darcs-patch
resolve-issue1987_-garbage-collection-for-inventories-and-patches_.dpatch mdiaz, 2014-04-08.10:40:49 application/x-darcs-patch
unnamed mdiaz, 2014-04-06.17:56:05
unnamed mdiaz, 2014-04-07.19:30:01
unnamed mdiaz, 2014-04-08.10:40:49
unnamed mdiaz, 2014-04-09.01:27:37
unnamed mdiaz, 2014-04-10.01:53:27
See mailing list archives for discussion on individual patches.
Messages
msg17269 (view) Author: mdiaz Date: 2014-04-06.17:56:05
Updated [patch1128]: now I check that the inventory file exist 
before trying to read it (function readSafeInventoryPrivate).
1 patch for repository http://darcs.net/screened:

Sun Apr  6 14:48:28 ART 2014  marcio.diaz@gmail.com
  * resolve issue1987: Garbage collection for inventories and patches.
Attachments
msg17278 (view) Author: gh Date: 2014-04-07.18:21:20
Hi Marcio,

first, I recommend you to set your author string with the pattern
"Full Name <email@aaaa.aa>" since we have some scripts that catch that
format to generate the news bulletin. And it's more standard.

Also when sending an amended version of a patch, it is better to send
it with the same ticket number, in the present case it would have
been:

     darcs send --mail --subject '[patch1128]'

Let's continue with that ticket now (1134).

Code review (I'm doing it in an order that makes more sense to me):

> hunk ./src/Darcs/UI/Commands/Optimize.hs 167
> - "\n" ++
> - "There is one more optimization which CAN NOT be performed by this\n" ++
> - "command.  Every time your record a patch, a new inventory file is\n" ++
> - "written to `_darcs/inventories/`, and old inventories are never reaped.\n" ++
> - "\n" ++
> - "If `_darcs/inventories/` is consuming a relatively large amount of\n" ++
> - "space, you can safely reclaim it by using `darcs get` to make a\n" ++
> - "complete copy of the repo.  When doing so, don't forget to copy over\n" ++
> -  "any unsaved changes you have made to the working tree or to\n" ++
> - "unversioned files in `_darcs/prefs/` (such as `_darcs/prefs/author`).\n"
> + "\n"
> hunk ./src/Darcs/UI/Commands/Optimize.hs 198
> -    cleanRepository repository -- garbage collect pristine.hashed directory
> +    cleanRepository repository -- garbage collect pristine.hashed, inventories and patches directories

OK.

> hunk ./src/Darcs/Repository/Internal.hs 182
> +                            , cleanInventories
> +                            , cleanPatches
> hunk ./src/Darcs/Repository/Internal.hs 895
> -    HvsO { hashed = HashedRepo.cleanPristine repository,
> +    HvsO { hashed = cleanHashedRepo repository,
> hunk ./src/Darcs/Repository/Internal.hs 897
> +           where
> +            cleanHashedRepo r = do
> +              HashedRepo.cleanPristine r
> +              HashedRepo.cleanInventories r
> +              HashedRepo.cleanPatches r

OK, I'm fine with cleaning everything at once.

> hunk ./src/Darcs/Repository/HashedRepo.hs 22
> +    , cleanInventories
> +    , cleanPatches
> hunk ./src/Darcs/Repository/HashedRepo.hs 57
> +import qualified Data.Set as Set
> hunk ./src/Darcs/Repository/HashedRepo.hs 64
> -import System.Directory ( createDirectoryIfMissing )
> +import System.Directory ( createDirectoryIfMissing, getDirectoryContents, doesFileExist )
> hunk ./src/Darcs/Repository/HashedRepo.hs 85
> +    , removeFileMayNotExist
> hunk ./src/Darcs/Repository/HashedRepo.hs 140
> +patchesDir, patchesDirPath :: String
> +patchesDir = "patches"
> +patchesDirPath = makeDarcsdirPath patchesDir
> +

OK.

> hunk ./src/Darcs/Repository/HashedRepo.hs 241
> +-- |cleanInventories removes any obsolete (unreferenced) files in the
> +-- inventories directory
> +cleanInventories :: Repository p wR wU wT -> IO ()
> +cleanInventories _ = do
> +    debugMessage "Cleaning out inventories..."
> +    hs <- listInventories

This calls the function that returns you a list of inventory hashes,
jumping from one "Starting With:" to another, starting with the hash
given in hashed_inventory. This will give us the good inventory files,
the ones we want to keep.

> +    fs <- filter okayHash <$> getDirectoryContents inventoriesDirPath

This lists the files of _darcs/inventories , only keeping those that
have the name length of a hashed file.

A side-effect of that is that we're not going to clean any random file
that could have been added manually (_darcs/inventories/SOME_FILE).
Maybe that's not a problem.

> +    mapM_ (removeFileMayNotExist . (inventoriesDirPath </>))
> +        (unset $ (set fs) `Set.difference` (set hs))
> +    where   set = Set.fromList . map BC.pack
> +            unset = map BC.unpack . Set.toList
> +

And we're removing those files. Seems ok.


> +-- |cleanPatches removes any obsolete (unreferenced) files in the
> +-- patches directory
> +cleanPatches :: Repository p wR wU wT -> IO ()
> +cleanPatches _ = do
> +    debugMessage "Cleaning out patches..."
> +    hs <- listPatches
> +    fs <- filter okayHash <$> getDirectoryContents patchesDirPath
> +    mapM_ (removeFileMayNotExist . (patchesDirPath </>))
> +        (unset $ (set fs) `Set.difference` (set hs))
> +    where   set = Set.fromList . map BC.pack
> +            unset = map BC.unpack . Set.toList
> +

OK so what happens is that we only delete files that have the
"shape" of a hashed file (since we filter with `okayHash`).

Hence files like pending, pending.tentative, unrevert are not deleted, good.
The alternative would be to have a special "white list" of files we
don't want to delete,
and not use `okayHash`.

> hunk ./src/Darcs/Repository/HashedRepo.hs 457
> -    inv <- skipPristine <$> gzFetchFilePS (dir </> invName) Uncachable
> -    readInventoryFromContent (toPath dir ++ "/" ++ darcsdir ++ invName) inv
> +  inv <- skipPristine <$> gzFetchFilePS (dir </> invName) Uncachable
> +  readInventoryFromContent (toPath dir ++ "/" ++ darcsdir ++ invName) inv
> +

You can remove this hunk with `darcs amend --unrecord`.



> hunk ./src/Darcs/Repository/HashedRepo.hs 601
> -        fst <$> readInventoryPrivate  invDir inv
> +        fst <$> readSafeInventoryPrivate  invDir inv
> hunk ./src/Darcs/Repository/HashedRepo.hs 608
> +    readSafeInventoryPrivate dir invName = do
> +        b <- doesFileExist (dir </> invName)
> +        if b then readInventoryPrivate dir invName
> +            else return (Nothing, [])
> +
> +

You should define another function `listInventories`
that call `readSafeInventoryPrivate` because there are other
functions that call `listInventories` and you precisely want them
to retrieve abstent inventories, otherwise darcs believes that your
repository is almost empty... (this happens when
you're working in a lazy repository, the directory `_darcs/inventories`
is actually empty right after cloning).

You could call that separate function `listInventoriesLocal`,
and don't forget to update the Haddock comment.

Also I'm not sure about the name `readSafeInventoryPrivate` since
it's not a matter of safety really; you could call it
`readInventoryLocalPrivate`.

> +listPatches :: IO [String]
> +listPatches = do
> +    inventory <- readSafeInventoryPrivate darcsdir hashedInventory

I'd also name it `listPatchesLocal` and put it a comment as for
`listInventories`.

Same comment as above about `readSafeInventoryPrivate`.

> +    followStartingWiths (fst inventory) (getPatches inventory)

You're getting the hashes of the patches of hashed_inventory and
continuing with the next inventory in the "Starting With:" chain; good.

> +    where
> +        followStartingWiths Nothing patches = return patches

ok.

> +        followStartingWiths (Just startingWith) patches = do
> +            inv <- readSafeInventoryPrivate inventoriesDirPath startingWith
> +            (patches++) <$> followStartingWiths (fst inv) (getPatches inv)
> +
> +        getPatches inv = map snd (snd inv)

OK.

> +
> +        readSafeInventoryPrivate dir invName = do
> +            b <- doesFileExist (dir </> invName)
> +            if b then readInventoryPrivate dir invName
> +                else return (Nothing, [])
> +

Same remark as above.
msg17282 (view) Author: mdiaz Date: 2014-04-07.19:30:01
Thanks Guillaume, I updated the patch according to your comments.

1 patch for repository http://darcs.net/screened:

Mon Apr  7 16:14:05 ART 2014  marcio.diaz@gmail.com
  * resolve issue1987: Garbage collection for inventories and patches.
Attachments
msg17283 (view) Author: gh Date: 2014-04-07.19:58:57
Good!

You could add to the haddock comments of functions
`listInventoriesLocal` and `listPatchesLocal`, an extra sentence saying
"This function does not attempt to download missing inventory files."

Now can you send, in a separate patch, a shell script that tests that
unreferenced files in directories
`_darcs/{pristine.hashed|inventories|patches}` are properly cleaned?
msg17284 (view) Author: gh Date: 2014-04-07.19:59:20
(In a separate patch, but in the same bundle, if that wasn't clear.)
msg17285 (view) Author: simon Date: 2014-04-07.22:33:31
I think this is needed quite badly by darcs hub, where multiple repos
are accumulating duplicate patch files excessively.

Can this patch help clean up existing repos, as well as prevent the
build-up in new ones ? How should I test it ?
msg17286 (view) Author: gh Date: 2014-04-07.22:44:30
Simon, the current behaviour of darcs is to clean reposirories only when
the command “optimize” is run, see <http://bugs.darcs.net/issue687>,
which ended with the following patch by David Roundy:

" resolve issue687: don't clean out pristine cache except when optimizing. 
The trouble is that we can never tell when someone might be waiting to grab
one of these older files.  Atomicity doesn't gain us much if we drop the
old files immediately.  The downside is that now repositories will
monotonically grow until an optimize is performed.  :("
msg17288 (view) Author: simon Date: 2014-04-07.23:00:13
Nice link.

I don't see any reduction in the number of files in patches/ after
running darcs optimize or darcs optimize --pristine.
msg17289 (view) Author: simon Date: 2014-04-07.23:35:44
Correction - after applying this patch, darcs optimize *does* clean
patches/ . Hurrah!

for d in _darcs/{inventories,patch_index,patches}; do find $d | wc -l; done
darcs2.9.8+1134 optimize
for d in _darcs/{inventories,patch_index,patches}; do find $d | wc -l; done
262
5
35837
Done optimizing!
13
5
2454

Here are the overall repo sizes:

cd ..; du -s *
305532  repo-original
33024   repo-optimized-with-patch1134
15540   repo-copied-with-get

This is most helpful, even though the optimized copy is still reported
as twice as large as a freshly got copy. Looking more closely with du
shows only a 5M difference between their respective _darcs directories,
and both have the same number of files. I assume this must be some kind
of normal filesystem/du reporting issue.

So.. how safe is it to run darcs optimize with this patch frequently on
all darcs hub repos ?
msg17290 (view) Author: mdiaz Date: 2014-04-08.10:40:49
Here goes the updated patch and the corresponding regression test.

2 patches for repository http://darcs.net/screened:

Mon Apr  7 18:16:43 ART 2014  Marcio Diaz <marcio.diaz@gmail.com>
  * resolve issue1987: Garbage collection for inventories and patches.
  

Tue Apr  8 07:38:38 ART 2014  Marcio Diaz <marcio.diaz@gmail.com>
  * Accept issue1987: regression tests for garbage collection for
  inventories and patches.
Attachments
msg17292 (view) Author: simon Date: 2014-04-08.13:23:25
+1. Very clear and informative. 

-Simon
msg17293 (view) Author: owst Date: 2014-04-08.14:24:41
Nice patch! A few comments:

I think there is a large amount of duplication between
cleanPatches/cleanInventories - can you abstract over the differences
and reduce duplication?

Why do you convert the [FilePath] to [ByteString] before converting to a
set, only to convert back after you've taken the intersection?

I think using a white list of files rather than okayHash would make the
intent more obvious.
msg17296 (view) Author: gh Date: 2014-04-08.15:46:54
> I think using a white list of files rather than okayHash would make the
> intent more obvious.

I just realized that Darcs.UI.Commands.Optimize.doOptimizeHTTP
actually did that :-)

So Marcio,  I suggest that this list of exceptions gets defined in
Darcs.Repository.HashedRepo (as, say, 'specialPatches') and used in
both doOptimizeHTTP and cleanPatches. Also, what happens if you also
stop using okayHash for cleaning _darcs/inventories/ ?
msg17298 (view) Author: gh Date: 2014-04-08.17:40:07
> I just realized that Darcs.UI.Commands.Optimize.doOptimizeHTTP
> actually did that :-)

Sorry, that was true at the time doOptimizeHTTP was introduced but
currently no more.
The code was:

      ps <- dirContents' "patches" $ \x -> all (x /=) ["unrevert", "pending",
    "pending.tentative"]
msg17306 (view) Author: mdiaz Date: 2014-04-09.01:27:37
New patches!, I hope they are better than the previous.

2 patches for repository http://darcs.net/screened:

Tue Apr  8 22:12:18 ART 2014  Marcio Diaz <marcio.diaz@gmail.com>
  * Accept issue1987: regression tests for garbage collection for
  inventories and patches.
  

Tue Apr  8 22:12:56 ART 2014  Marcio Diaz <marcio.diaz@gmail.com>
  * resolve issue1987: Garbage collection for inventories and patches.
Attachments
msg17314 (view) Author: gh Date: 2014-04-09.18:24:47
I think we're almost done:

* In `listPatchesLocal`, I'd rather you used `readInventoryPrivate`
instead of `readInventoryLocalPrivate`. The reason is that
`_darcs/hashed_inventory` should *always* be present in a repository.
In the contrary case, I'd rather have everything blow up sooner than
later.

*  Owen's question: "Why do you convert the [FilePath] to [ByteString]
before converting to a
set, only to convert back after you've taken the intersection?"

* Shell test: good!
msg17315 (view) Author: owst Date: 2014-04-09.18:53:20
Further to that: both cleanInventories and cleanPatches define
where-bound set/unset expressions. Make them top level if they're really
needed.

Also, you use patchesDir and patchesDirPath, but inside
filterDirContents, you pass (darcsdir </> d), which presumably does the
same thing as makeDarcsDirPath, which is used to make patchesDirPath
different to patchesDir?
msg17321 (view) Author: mdiaz Date: 2014-04-10.01:53:27
Changes:

* makeDarcsdirPath now uses '</>' instead of  '++ "/" ++',
  I don't know if this is correct.
* I replaced 'darcsdir </> d' by 'makeDarcsdirPath d'.
* The functions set/unset were moved to  the top level.
  They may be needed because byteStrings makes optimize faster
  (at least a fraction of second over my darcs repo with 
  70 inventories and 10k patches).
  Also, Set.difference is a lot faster than \\ (a few seconds).
  See Petr patch: 
  http://article.gmane.org/gmane.comp.version-control.darcs.user/21206/.


2 patches for repository http://darcs.net/screened:

Tue Apr  8 22:12:18 ART 2014  Marcio Diaz <marcio.diaz@gmail.com>
  * Accept issue1987: regression tests for garbage collection for
  inventories and patches.
  

Wed Apr  9 19:30:09 ART 2014  Marcio Diaz <marcio.diaz@gmail.com>
  * resolve issue1987: Garbage collection for inventories and patches.
Attachments
msg17323 (view) Author: gh Date: 2014-04-10.14:00:08
> * makeDarcsdirPath now uses '</>' instead of  '++ "/" ++',
>   I don't know if this is correct.

Yes it is, see <http://hackage.haskell.org/package/filepath-1.3.0.2/docs/System-FilePath-Posix.html#v:combine>

Thanks for the changes, bundle accepted!
History
Date User Action Args
2014-04-06 17:56:05mdiazcreate
2014-04-07 18:21:21ghsetmessages: + msg17278
2014-04-07 19:30:01mdiazsetfiles: + patch-preview.txt, resolve-issue1987_-garbage-collection-for-inventories-and-patches_.dpatch, unnamed
messages: + msg17282
2014-04-07 19:58:57ghsetmessages: + msg17283
2014-04-07 19:59:20ghsetmessages: + msg17284
2014-04-07 21:11:16simonsetnosy: + simon
2014-04-07 22:33:31simonsetmessages: + msg17285
2014-04-07 22:44:30ghsetmessages: + msg17286
2014-04-07 23:00:13simonsetmessages: + msg17288
2014-04-07 23:35:45simonsetmessages: + msg17289
2014-04-08 10:40:50mdiazsetfiles: + patch-preview.txt, resolve-issue1987_-garbage-collection-for-inventories-and-patches_.dpatch, unnamed
messages: + msg17290
2014-04-08 13:23:26simonsetmessages: + msg17292
2014-04-08 14:24:42owstsetmessages: + msg17293
2014-04-08 15:46:54ghsetmessages: + msg17296
2014-04-08 17:40:08ghsetmessages: + msg17298
2014-04-09 01:27:37mdiazsetfiles: + patch-preview.txt, accept-issue1987_-regression-tests-for-garbage-collection-for.dpatch, unnamed
messages: + msg17306
2014-04-09 18:24:47ghsetmessages: + msg17314
2014-04-09 18:53:20owstsetmessages: + msg17315
2014-04-10 01:53:28mdiazsetfiles: + patch-preview.txt, accept-issue1987_-regression-tests-for-garbage-collection-for.dpatch, unnamed
messages: + msg17321
2014-04-10 14:00:09ghsetmessages: + msg17323
2014-04-10 14:03:08ghsetstatus: needs-screening -> accepted