Created on 2021-11-04.23:32:13 by bfrk, last changed 2023-03-27.08:04:56 by ganesh.
See mailing list archives
for discussion on individual patches.
msg22919 (view) |
Author: bfrk |
Date: 2021-11-04.23:32:12 |
|
Refactors related to patch/inventory/pristine hashes.
17 patches for repository http://darcs.net/screened:
patch 7984dfa59af4ab318bbc1208bd54baa4e290b750
Author: Ben Franksen <ben.franksen@online.de>
Date: Sun Mar 7 17:25:25 CET 2021
* use ShortByteString for Hash content
This should bring down memory use and decrease fragmentation.
patch d56b7945d20bd2e29a0a2131d255394c17a63d03
Author: Ben Franksen <ben.franksen@online.de>
Date: Sun Mar 7 11:14:10 CET 2021
* cache: refactor cacheHash and remove export
patch 091c755d6bde86d0e0c1dd7b0a60aacb9fc3fab2
Author: Ben Franksen <ben.franksen@online.de>
Date: Mon Mar 8 10:36:08 CET 2021
* avoid re-validation of already validated patch hashes
patch 64ebee50e76201dc651da2b632a905d89f0db193
Author: Ben Franksen <ben.franksen@online.de>
Date: Sun Mar 14 09:34:56 CET 2021
* validate hashes in inventories on the ByteString side
For reasons I haven't been able to figure out this drastically reduces
memory consumption.
patch fbd443da22fdb6ec3012b95f2ba70059f2af5007
Author: Ben Franksen <ben.franksen@online.de>
Date: Sat Mar 27 08:53:24 CET 2021
* move hash validation from D.R.Inventory to D.Util.Cache
patch 1a649931c95cf11bde605b324738cf64fe01602f
Author: Ben Franksen <ben.franksen@online.de>
Date: Mon Apr 12 07:54:17 CEST 2021
* make possible non-existence of hashes explicit
This removes the NoHash constructor and (more or less mechanically) replaces
Hash with Maybe Hash. This exposes lots of situations where we missed out on
more precise typing i.e. where we know we have a hash but still work with a
Maybe Hash. This patch doesn't clean these up, it just allows and encourages
us to do so.
patch aa120e78415d2f17b0df30b196ea7d6450f0a46c
Author: Ben Franksen <ben.franksen@online.de>
Date: Fri Apr 16 11:53:52 CEST 2021
* use decoding to validate hashes
This also delegates the implementation of okayHash to okayHashB.
patch 08d3835579f633bf1b03250a00a54b9c79f91cbb
Author: Ben Franksen <ben.franksen@online.de>
Date: Wed Mar 24 03:32:16 CET 2021
* fix generator for hashes in D.T.R.Inventory
patch a5e7aac823fbb0e216473d93e714dfc9b477e428
Author: Ben Franksen <ben.franksen@online.de>
Date: Sat Apr 17 12:16:23 CEST 2021
* cleanup parsing and unparsing of hashed directories
The parser now uses Darcs.Util.Parser. The function decodeWhiteName which is
used by the parser is now explicit about decoding errors. In contrast, the
unparsing is actually *not* supposed to fail: it has existing hashes for the
subitems as a precondition; indeed we call hash update functions in various
places before calling darcsFormatDir. As it stands this is quite brittle and
should be improved but this has to wait for another patch.
patch 703d61283b18b26fe7b1a72a123180706a625608
Author: Ben Franksen <ben.franksen@online.de>
Date: Sun Mar 7 11:13:27 CET 2021
* cache: inline copyFilesUsingCache
patch cc644a31e458ba99c533b789ecf3e50664e16712
Author: Ben Franksen <ben.franksen@online.de>
Date: Sat Mar 27 06:48:14 CET 2021
* move D.R.Inventory to D.R.Inventory.Format
This is so that we can move the code conerned with reading and writing
inventories to D.R.Inventory w/o mixing the inventory format with its
interpretation.
patch 0d657b2f0ad029650cf514d3d62af20c835082c5
Author: Ben Franksen <ben.franksen@online.de>
Date: Fri Apr 16 13:24:03 CEST 2021
* major refactor: internally store valid hashes in parsed form
The main theme here is the "parse, don't validate" mantra. Storing hashes
(optionally including the content size) in parsed form is memory efficient
and gives us much better typing. The code for this has been moved into its
own module Darcs.Util.ValidHash with a safe API. PatchInfoAnd now stores its
hash as a PatchHash and the Tagged sections of a PatchSet store their hash
as an InventoryHash. The HashedDir is now inferred from the type of the
hash, which means we no longer have to pass it to the function exported by
Darcs.Util.Cache, which simplifies the API and makes it more type safe (yet
note that not a single 'forall' had to be added to type signatures).
This refactor exposed a strange HashedDir mismatch in the handling of packs
that I temporarily marked as FIXME. I suspect that some files are not placed
in the right directories, resulting in a loss of efficiency when cloning
packed repos. This needs further investigation.
patch b45e0e2afe32fc6ea1fc9933b9fb5ee552a14c61
Author: Ben Franksen <ben.franksen@online.de>
Date: Fri Jul 16 17:06:29 CEST 2021
* fix a FIXME: HashedDir mismatch in the handling of packs
It turned out that the HashedDir parameter wasn't used anywhere, so this
patch removes it. This makes sense since filepaths in the pack files already
contain the subdirectory of all files.
patch 7a591395b0b66b6754ee145489772bbedf5c09d2
Author: Ben Franksen <ben.franksen@online.de>
Date: Sat Mar 27 07:18:05 CET 2021
* move reading and writing of inventories from D.R.Hashed to D.R.Inventory
This is a pure code move, except that I also cleaned up the layout of the
import lists in D.R.Hashed.
patch 057ac977b25543eb7bfaf0e86b2965cd43e553f1
Author: Ben Franksen <ben.franksen@online.de>
Date: Wed Apr 21 10:31:22 CEST 2021
* refactor updateHashes in TreeMonad
The central change here is in the type of updateHash from
TreeItem m -> m (Maybe Hash)
to
Maybe (TreeItem m -> m Hash).
Indeed, for a concrete instantiation we either have a total hash function or
else no hash function at all. In the latter case this change avoids calling
a procedure that always returns Nothing, potentially recursively over a
large tree. It also gives us more precise typing.
Everything else follows from that, with the exception of 'flushItem' which
was rewritten to make it clearer what happens: we first update the hash,
then update the item on disk and then replace the item in the tree we are
tracking.
patch 120ad503225a42ef147135dee5ef16877a2c4035
Author: Ben Franksen <ben.franksen@online.de>
Date: Sun May 30 10:35:42 CEST 2021
* cache: eliminate hashedFilePathReadOnly
Since we pass the CacheLoc, we can make the distinction between bucketed
(writable) and non-bucketed (not writable) cache inside hashedFilePath.
patch 1fc26cf24773f455adb4776ff2525b59e56f4456
Author: Ben Franksen <ben.franksen@online.de>
Date: Sun Jul 18 16:15:37 CEST 2021
* fix in Darcs.Patch.PatchInfoAnd.fmapH: must invalidate hash
The function we map over the contained patch may modify it, so any hash
we may have had is now invalid.
Attachments
|
msg23178 (view) |
Author: ganesh |
Date: 2023-03-24.23:35:34 |
|
Incrementally reviewing
> * use ShortByteString for Hash content
OK
> * cache: refactor cacheHash and remove export
OK
> * avoid re-validation of already validated patch hashes
OK
> * validate hashes in inventories on the ByteString side
OK (I also can't see why it reduces memory consumption)
|
msg23179 (view) |
Author: ganesh |
Date: 2023-03-25.09:34:38 |
|
> * move hash validation from D.R.Inventory to D.Util.Cache
OK
> * make possible non-existence of hashes explicit
Nice. I vaguely remember wanting to do this for ages, can't remember
why I didn't.
> * use decoding to validate hashes
OK
> * use decoding to validate hashes
OK
|
msg23182 (view) |
Author: ganesh |
Date: 2023-03-25.21:14:53 |
|
> * cleanup parsing and unparsing of hashed directories
OK - I didn't check the new parser in detail but if it was broken
we'd find out pretty quickly I think.
> * cache: inline copyFilesUsingCache
OK
> * move D.R.Inventory to D.R.Inventory.Format
OK
|
msg23185 (view) |
Author: ganesh |
Date: 2023-03-26.16:41:11 |
|
> * major refactor: internally store valid hashes in parsed form
Nice! Looks good, though I haven't checked every line in detail.
Two questions/thoughts:
> hunk ./harness/Darcs/Test/HashedStorage.hs 472
> - - let darcs = dir </> "_darcs"
> - - h_inventory = darcs </> "hashed_inventory"
> - - repo <- doesDirectoryExist darcs
> - - unless repo $ fail $ "Not a darcs repository: " ++ dir
> - - isHashed <- doesFileExist h_inventory
> - - if isHashed
I guess support for non-hashed repos here was obsolete?
> +class (Eq h, IsSizeHash h) => ValidHash h where
> + encodeValidHash :: h -> String
> + decodeValidHash :: String -> Maybe h
[..]
> + -- default definitions
> + encodeValidHash = encodeSizeHash . getSizeHash
> + decodeValidHash = fmap fromSizeHash . decodeSizeHash
These don't ever seem to be overridden, maybe remove them from
the class?
|
msg23186 (view) |
Author: ganesh |
Date: 2023-03-26.17:31:27 |
|
> * fix a FIXME: HashedDir mismatch in the handling of packs
OK
> * move reading and writing of inventories from D.R.Hashed to
D.R.Inventory
OK (not checked)
|
msg23187 (view) |
Author: ganesh |
Date: 2023-03-26.17:33:40 |
|
> * refactor updateHashes in TreeMonad
> -darcsHash :: Monad m => TreeItem m -> m (Maybe Hash)
> -darcsHash (SubTree t) = return (Just (darcsTreeHash t))
> -darcsHash (File blob) = Just . sha256 <$> readBlob blob
> -darcsHash _ = return Nothing
> +darcsHash :: Monad m => TreeItem m -> m Hash
> +darcsHash (SubTree t) = return (darcsTreeHash t)
> +darcsHash (File blob) = sha256 <$> readBlob blob
> +darcsHash (Stub unstub _) = darcsTreeHash <$> unstub
I like your new type signature much better but there is actually a
theoretical behaviour change here since before we would
always return `Nothing` for a stub. Does it matter?
|
msg23188 (view) |
Author: ganesh |
Date: 2023-03-26.21:48:28 |
|
> * cache: eliminate hashedFilePathReadOnly
OK
> * fix in Darcs.Patch.PatchInfoAnd.fmapH: must invalidate hash
This is an interesting question. The general idea of `fmap`, that it
leaves structure untouched, clashes with the type invariants here.
I also struggle a bit to see what kind of legitimate function could
leave the witnesses untouched but alter the contents that would get
hashed, at least in a way that would later matter.
There are two actual uses of this function:
1. Reading the legacy rebase format (extractOldStyleRebase), where
the work is definitely structural in nature. Losing the hashes here
might be a big deal if this wasn't for compatibility only.
2. Filtering patches for log, which would change the content to be
hashed, so trashing the hashes probably is safer.
Maybe we should just get rid of these helpers and force the caller
to decide? It is a bunch of nasty boilerplate though and would also
force us to export more data constructors.
|
msg23192 (view) |
Author: bfrk |
Date: 2023-03-26.23:34:35 |
|
Partial response ;-)
>> hunk ./harness/Darcs/Test/HashedStorage.hs 472
>
> I guess support for non-hashed repos here was obsolete?
Yes, that was the idea.
>> +class (Eq h, IsSizeHash h) => ValidHash h where
>> + encodeValidHash :: h -> String
>> + decodeValidHash :: String -> Maybe h
> [..]
>> + -- default definitions
>> + encodeValidHash = encodeSizeHash . getSizeHash
>> + decodeValidHash = fmap fromSizeHash . decodeSizeHash
>
> These don't ever seem to be overridden, maybe remove them from
> the class?
Yes. Unfortunately this involves a bit of churn because it requires
additional imports in all client modules; I suspect that was the reason I
did not do this at the time.
Anyway, will send a patch, but it will depend on another two (closely
related) patches, one of them a bug fix.
>> * refactor updateHashes in TreeMonad
>
>> -darcsHash :: Monad m => TreeItem m -> m (Maybe Hash)
>> -darcsHash (SubTree t) = return (Just (darcsTreeHash t))
>> -darcsHash (File blob) = Just . sha256 <$> readBlob blob
>> -darcsHash _ = return Nothing
>> +darcsHash :: Monad m => TreeItem m -> m Hash
>> +darcsHash (SubTree t) = return (darcsTreeHash t)
>> +darcsHash (File blob) = sha256 <$> readBlob blob
>> +darcsHash (Stub unstub _) = darcsTreeHash <$> unstub
>
> I like your new type signature much better but there is actually a
> theoretical behaviour change here since before we would
> always return `Nothing` for a stub. Does it matter?
The only place this is used is as parameter to addMissingHashes, which
already expands stubs before applying it. I could have turned it into a
partial function instead, not sure if that's better though.
|
msg23193 (view) |
Author: bfrk |
Date: 2023-03-27.00:22:01 |
|
>> * fix in Darcs.Patch.PatchInfoAnd.fmapH: must invalidate hash
>
> This is an interesting question. The general idea of `fmap`, that it
> leaves structure untouched, clashes with the type invariants here.
>
> I also struggle a bit to see what kind of legitimate function could
> leave the witnesses untouched but alter the contents that would get
> hashed, at least in a way that would later matter.
I will not dwell much on the question of witnesses, which is interesting but leads us
too far astray here. You have a point but sticking to that principle in a strict way
would be impractical IMO, think of e.g. coalescing or canonizing.
Now, hashes are a different matter. They are supposed to correspond to the on-disk
representation. If you change content and don't invalidate the hash, then the hash is
*wrong*. This may lead to the new content not being written to disk, which is
certainly fatal. If "losing" a hash that no longer corresponds to the patch's content
leads to a problem, then the code that is using that hash is doing something seriously
wrong.
|
msg23194 (view) |
Author: bfrk |
Date: 2023-03-27.00:36:27 |
|
Here is the promised follow-up, together with two dependencies that are
closely enough related that adding them here makes more sense than to rebase
and send them separately.
3 patches for repository http://darcs.net/screened:
patch c293dbf55210b151de3266773f3f53fb3bc5e5fa
Author: Ben Franksen <ben.franksen@online.de>
Date: Mon Mar 27 00:26:12 CEST 2023
* bugfix and cleanup for maximum size in hashes
The previous size limit was too small by a factor of 10. To avoid such
stupid mistakes, we now calculate the limit from the number of digits and
use a constant for both.
patch 9af072b5a77fef1ca8b8b3cce1ee00dc22d2c31b
Author: Ben Franksen <ben.franksen@online.de>
Date: Sat Dec 17 10:19:34 CET 2022
* add proper parsing for (possibly size-prefixed) hashes
patch 01f043d25e45ee15c1fc58e5579b45936e02e288
Author: Ben Franksen <ben.franksen@online.de>
Date: Sun Mar 26 23:15:27 CEST 2023
* turn some methods of ValidHash into plain functions
Attachments
|
msg23195 (view) |
Author: bfrk |
Date: 2023-03-27.01:27:16 |
|
Okay, I see what you meant with "structural" in the case of
extractOldStyleRebase: here the content really doesn't change, the
mapped function W.fromRebasing merely undoes the wrapping. So in
this case, not invalidating the hash would be correct and slightly
more efficient. This is a rather special case, though; in general
the mapped function can (and in case of log does) change the
content. I prefer to be on the safe side here.
|
msg23196 (view) |
Author: ganesh |
Date: 2023-03-27.07:00:44 |
|
> The only place this is used is as parameter to addMissingHashes,
> which already expands stubs before applying it. I could have turned
> it into a partial function instead, not sure if that's better though.
OK, if it's not a behaviour change I think the new code is fine.
> I will not dwell much on the question of witnesses, which is
> interesting but leads us too far astray here. You have a point
> but sticking to that principle in a strict way would be impractical
> IMO, think of e.g. coalescing or canonizing.
Actually those are good examples of keeping witnesses but changing
content that I had missed. (If we think of the witnesses as
representing tree states.)
I think my main concern was the implications of the `fmap` name.
But thinking about it some more I'm not even sure that concern was
correct, the job of fmap is to preserve semantics not syntax.
|
|
Date |
User |
Action |
Args |
2021-11-04 23:32:13 | bfrk | create | |
2021-11-05 07:54:09 | bfrk | set | status: needs-screening -> needs-review |
2023-03-24 23:35:36 | ganesh | set | messages:
+ msg23178 |
2023-03-25 09:34:40 | ganesh | set | messages:
+ msg23179 |
2023-03-25 21:14:54 | ganesh | set | status: needs-review -> review-in-progress messages:
+ msg23182 |
2023-03-26 16:41:12 | ganesh | set | messages:
+ msg23185 |
2023-03-26 17:31:27 | ganesh | set | messages:
+ msg23186 |
2023-03-26 17:33:40 | ganesh | set | messages:
+ msg23187 |
2023-03-26 21:48:28 | ganesh | set | messages:
+ msg23188 |
2023-03-26 23:34:35 | bfrk | set | messages:
+ msg23192 |
2023-03-27 00:22:01 | bfrk | set | messages:
+ msg23193 |
2023-03-27 00:36:28 | bfrk | set | files:
+ patch-preview.txt, bugfix-and-cleanup-for-maximum-size-in-hashes.dpatch messages:
+ msg23194 |
2023-03-27 01:27:16 | bfrk | set | messages:
+ msg23195 |
2023-03-27 07:00:47 | ganesh | set | status: review-in-progress -> accepted-pending-tests messages:
+ msg23196 |
2023-03-27 08:04:56 | ganesh | set | status: accepted-pending-tests -> accepted |
|