darcs

Patch 2216 use ShortByteString for Hash content (and 16 more)

Title use ShortByteString for Hash content (and 16 more)
Superseder Nosy List bfrk
Related Issues
Status accepted Assigned To
Milestone

Created on 2021-11-04.23:32:13 by bfrk, last changed 2023-03-27.08:04:56 by ganesh.

Files
File name Status Uploaded Type Edit Remove
bugfix-and-cleanup-for-maximum-size-in-hashes.dpatch bfrk, 2023-03-27.00:36:27 application/x-darcs-patch
patch-preview.txt bfrk, 2021-11-04.23:32:12 text/x-darcs-patch
patch-preview.txt bfrk, 2023-03-27.00:36:27 text/x-darcs-patch
use-shortbytestring-for-hash-content.dpatch bfrk, 2021-11-04.23:32:12 application/x-darcs-patch
See mailing list archives for discussion on individual patches.
Messages
msg22919 (view) Author: bfrk Date: 2021-11-04.23:32:12
Refactors related to patch/inventory/pristine hashes.

17 patches for repository http://darcs.net/screened:

patch 7984dfa59af4ab318bbc1208bd54baa4e290b750
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sun Mar  7 17:25:25 CET 2021
  * use ShortByteString for Hash content

  This should bring down memory use and decrease fragmentation.

patch d56b7945d20bd2e29a0a2131d255394c17a63d03
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sun Mar  7 11:14:10 CET 2021
  * cache: refactor cacheHash and remove export

patch 091c755d6bde86d0e0c1dd7b0a60aacb9fc3fab2
Author: Ben Franksen <ben.franksen@online.de>
Date:   Mon Mar  8 10:36:08 CET 2021
  * avoid re-validation of already validated patch hashes

patch 64ebee50e76201dc651da2b632a905d89f0db193
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sun Mar 14 09:34:56 CET 2021
  * validate hashes in inventories on the ByteString side

  For reasons I haven't been able to figure out this drastically reduces
  memory consumption.

patch fbd443da22fdb6ec3012b95f2ba70059f2af5007
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sat Mar 27 08:53:24 CET 2021
  * move hash validation from D.R.Inventory to D.Util.Cache

patch 1a649931c95cf11bde605b324738cf64fe01602f
Author: Ben Franksen <ben.franksen@online.de>
Date:   Mon Apr 12 07:54:17 CEST 2021
  * make possible non-existence of hashes explicit

  This removes the NoHash constructor and (more or less mechanically) replaces
  Hash with Maybe Hash. This exposes lots of situations where we missed out on
  more precise typing i.e. where we know we have a hash but still work with a
  Maybe Hash. This patch doesn't clean these up, it just allows and encourages
  us to do so.

patch aa120e78415d2f17b0df30b196ea7d6450f0a46c
Author: Ben Franksen <ben.franksen@online.de>
Date:   Fri Apr 16 11:53:52 CEST 2021
  * use decoding to validate hashes

  This also delegates the implementation of okayHash to okayHashB.

patch 08d3835579f633bf1b03250a00a54b9c79f91cbb
Author: Ben Franksen <ben.franksen@online.de>
Date:   Wed Mar 24 03:32:16 CET 2021
  * fix generator for hashes in D.T.R.Inventory

patch a5e7aac823fbb0e216473d93e714dfc9b477e428
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sat Apr 17 12:16:23 CEST 2021
  * cleanup parsing and unparsing of hashed directories

  The parser now uses Darcs.Util.Parser. The function decodeWhiteName which is
  used by the parser is now explicit about decoding errors. In contrast, the
  unparsing is actually *not* supposed to fail: it has existing hashes for the
  subitems as a precondition; indeed we call hash update functions in various
  places before calling darcsFormatDir. As it stands this is quite brittle and
  should be improved but this has to wait for another patch.

patch 703d61283b18b26fe7b1a72a123180706a625608
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sun Mar  7 11:13:27 CET 2021
  * cache: inline copyFilesUsingCache

patch cc644a31e458ba99c533b789ecf3e50664e16712
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sat Mar 27 06:48:14 CET 2021
  * move D.R.Inventory to D.R.Inventory.Format

  This is so that we can move the code conerned with reading and writing
  inventories to D.R.Inventory w/o mixing the inventory format with its
  interpretation.

patch 0d657b2f0ad029650cf514d3d62af20c835082c5
Author: Ben Franksen <ben.franksen@online.de>
Date:   Fri Apr 16 13:24:03 CEST 2021
  * major refactor: internally store valid hashes in parsed form

  The main theme here is the "parse, don't validate" mantra. Storing hashes
  (optionally including the content size) in parsed form is memory efficient
  and gives us much better typing. The code for this has been moved into its
  own module Darcs.Util.ValidHash with a safe API. PatchInfoAnd now stores its
  hash as a PatchHash and the Tagged sections of a PatchSet store their hash
  as an InventoryHash. The HashedDir is now inferred from the type of the
  hash, which means we no longer have to pass it to the function exported by
  Darcs.Util.Cache, which simplifies the API and makes it more type safe (yet
  note that not a single 'forall' had to be added to type signatures).

  This refactor exposed a strange HashedDir mismatch in the handling of packs
  that I temporarily marked as FIXME. I suspect that some files are not placed
  in the right directories, resulting in a loss of efficiency when cloning
  packed repos. This needs further investigation.

patch b45e0e2afe32fc6ea1fc9933b9fb5ee552a14c61
Author: Ben Franksen <ben.franksen@online.de>
Date:   Fri Jul 16 17:06:29 CEST 2021
  * fix a FIXME: HashedDir mismatch in the handling of packs

  It turned out that the HashedDir parameter wasn't used anywhere, so this
  patch removes it. This makes sense since filepaths in the pack files already
  contain the subdirectory of all files.

patch 7a591395b0b66b6754ee145489772bbedf5c09d2
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sat Mar 27 07:18:05 CET 2021
  * move reading and writing of inventories from D.R.Hashed to D.R.Inventory

  This is a pure code move, except that I also cleaned up the layout of the
  import lists in D.R.Hashed.

patch 057ac977b25543eb7bfaf0e86b2965cd43e553f1
Author: Ben Franksen <ben.franksen@online.de>
Date:   Wed Apr 21 10:31:22 CEST 2021
  * refactor updateHashes in TreeMonad

  The central change here is in the type of updateHash from

    TreeItem m -> m (Maybe Hash)

  to

    Maybe (TreeItem m -> m Hash).

  Indeed, for a concrete instantiation we either have a total hash function or
  else no hash function at all. In the latter case this change avoids calling
  a procedure that always returns Nothing, potentially recursively over a
  large tree. It also gives us more precise typing.

  Everything else follows from that, with the exception of 'flushItem' which
  was rewritten to make it clearer what happens: we first update the hash,
  then update the item on disk and then replace the item in the tree we are
  tracking.

patch 120ad503225a42ef147135dee5ef16877a2c4035
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sun May 30 10:35:42 CEST 2021
  * cache: eliminate hashedFilePathReadOnly

  Since we pass the CacheLoc, we can make the distinction between bucketed
  (writable) and non-bucketed (not writable) cache inside hashedFilePath.

patch 1fc26cf24773f455adb4776ff2525b59e56f4456
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sun Jul 18 16:15:37 CEST 2021
  * fix in Darcs.Patch.PatchInfoAnd.fmapH: must invalidate hash

  The function we map over the contained patch may modify it, so any hash
  we may have had is now invalid.
Attachments
msg23178 (view) Author: ganesh Date: 2023-03-24.23:35:34
Incrementally reviewing

>   * use ShortByteString for Hash content

OK

>   * cache: refactor cacheHash and remove export

OK

>   * avoid re-validation of already validated patch hashes

OK

>   * validate hashes in inventories on the ByteString side

OK (I also can't see why it reduces memory consumption)
msg23179 (view) Author: ganesh Date: 2023-03-25.09:34:38
>   * move hash validation from D.R.Inventory to D.Util.Cache

OK

>   * make possible non-existence of hashes explicit

Nice. I vaguely remember wanting to do this for ages, can't remember
why I didn't.

>   * use decoding to validate hashes

OK

>   * use decoding to validate hashes

OK
msg23182 (view) Author: ganesh Date: 2023-03-25.21:14:53
>   * cleanup parsing and unparsing of hashed directories

OK - I didn't check the new parser in detail but if it was broken
we'd find out pretty quickly I think.

>   * cache: inline copyFilesUsingCache

OK

>  * move D.R.Inventory to D.R.Inventory.Format

OK
msg23185 (view) Author: ganesh Date: 2023-03-26.16:41:11
>   * major refactor: internally store valid hashes in parsed form

Nice! Looks good, though I haven't checked every line in detail.

Two questions/thoughts:

> hunk ./harness/Darcs/Test/HashedStorage.hs 472
> - -  let darcs = dir </> "_darcs"
> - -      h_inventory = darcs </> "hashed_inventory"
> - -  repo <- doesDirectoryExist darcs
> - -  unless repo $ fail $ "Not a darcs repository: " ++ dir
> - -  isHashed <- doesFileExist h_inventory
> - -  if isHashed

I guess support for non-hashed repos here was obsolete?

> +class (Eq h, IsSizeHash h) => ValidHash h where
> +  encodeValidHash :: h -> String
> +  decodeValidHash :: String -> Maybe h
[..]
> +  -- default definitions
> +  encodeValidHash = encodeSizeHash . getSizeHash
> +  decodeValidHash = fmap fromSizeHash . decodeSizeHash

These don't ever seem to be overridden, maybe remove them from
the class?
msg23186 (view) Author: ganesh Date: 2023-03-26.17:31:27
>   * fix a FIXME: HashedDir mismatch in the handling of packs

OK

>   * move reading and writing of inventories from D.R.Hashed to 
D.R.Inventory

OK (not checked)
msg23187 (view) Author: ganesh Date: 2023-03-26.17:33:40
>   * refactor updateHashes in TreeMonad

> -darcsHash :: Monad m => TreeItem m -> m (Maybe Hash)
> -darcsHash (SubTree t) = return (Just (darcsTreeHash t))
> -darcsHash (File blob) = Just . sha256 <$> readBlob blob
> -darcsHash _ = return Nothing
> +darcsHash :: Monad m => TreeItem m -> m Hash
> +darcsHash (SubTree t) = return (darcsTreeHash t)
> +darcsHash (File blob) = sha256 <$> readBlob blob
> +darcsHash (Stub unstub _) = darcsTreeHash <$> unstub

I like your new type signature much better but there is actually a
theoretical behaviour change here since before we would
always return `Nothing` for a stub. Does it matter?
msg23188 (view) Author: ganesh Date: 2023-03-26.21:48:28
>   * cache: eliminate hashedFilePathReadOnly

OK

>   * fix in Darcs.Patch.PatchInfoAnd.fmapH: must invalidate hash

This is an interesting question. The general idea of `fmap`, that it
leaves structure untouched, clashes with the type invariants here.

I also struggle a bit to see what kind of legitimate function could
leave the witnesses untouched but alter the contents that would get
hashed, at least in a way that would later matter.

There are two actual uses of this function:

1. Reading the legacy rebase format (extractOldStyleRebase), where
the work is definitely structural in nature. Losing the hashes here
might be a big deal if this wasn't for compatibility only.

2. Filtering patches for log, which would change the content to be
hashed, so trashing the hashes probably is safer.

Maybe we should just get rid of these helpers and force the caller
to decide? It is a bunch of nasty boilerplate though and would also
force us to export more data constructors.
msg23192 (view) Author: bfrk Date: 2023-03-26.23:34:35
Partial response ;-)

>> hunk ./harness/Darcs/Test/HashedStorage.hs 472
> 
> I guess support for non-hashed repos here was obsolete?

Yes, that was the idea.

>> +class (Eq h, IsSizeHash h) => ValidHash h where
>> +  encodeValidHash :: h -> String
>> +  decodeValidHash :: String -> Maybe h
> [..]
>> +  -- default definitions
>> +  encodeValidHash = encodeSizeHash . getSizeHash
>> +  decodeValidHash = fmap fromSizeHash . decodeSizeHash
> 
> These don't ever seem to be overridden, maybe remove them from
> the class?

Yes. Unfortunately this involves a bit of churn because it requires 
additional imports in all client modules; I suspect that was the reason I 
did not do this at the time.

Anyway, will send a patch, but it will depend on another two (closely 
related) patches, one of them a bug fix.

>>    * refactor updateHashes in TreeMonad
> 
>> -darcsHash :: Monad m => TreeItem m -> m (Maybe Hash)
>> -darcsHash (SubTree t) = return (Just (darcsTreeHash t))
>> -darcsHash (File blob) = Just . sha256 <$> readBlob blob
>> -darcsHash _ = return Nothing
>> +darcsHash :: Monad m => TreeItem m -> m Hash
>> +darcsHash (SubTree t) = return (darcsTreeHash t)
>> +darcsHash (File blob) = sha256 <$> readBlob blob
>> +darcsHash (Stub unstub _) = darcsTreeHash <$> unstub
> 
> I like your new type signature much better but there is actually a
> theoretical behaviour change here since before we would
> always return `Nothing` for a stub. Does it matter?

The only place this is used is as parameter to addMissingHashes, which 
already expands stubs before applying it. I could have turned it into a 
partial function instead, not sure if that's better though.
msg23193 (view) Author: bfrk Date: 2023-03-27.00:22:01
>>    * fix in Darcs.Patch.PatchInfoAnd.fmapH: must invalidate hash
> 
> This is an interesting question. The general idea of `fmap`, that it
> leaves structure untouched, clashes with the type invariants here.
> 
> I also struggle a bit to see what kind of legitimate function could
> leave the witnesses untouched but alter the contents that would get
> hashed, at least in a way that would later matter.

I will not dwell much on the question of witnesses, which is interesting but leads us 
too far astray here. You have a point but sticking to that principle in a strict way 
would be impractical IMO, think of e.g. coalescing or canonizing.

Now, hashes are a different matter. They are supposed to correspond to the on-disk 
representation. If you change content and don't invalidate the hash, then the hash is 
*wrong*. This may lead to the new content not being written to disk, which is 
certainly fatal. If "losing" a hash that no longer corresponds to the patch's content 
leads to a problem, then the code that is using that hash is doing something seriously 
wrong.
msg23194 (view) Author: bfrk Date: 2023-03-27.00:36:27
Here is the promised follow-up, together with two dependencies that are
closely enough related that adding them here makes more sense than to rebase
and send them separately.

3 patches for repository http://darcs.net/screened:

patch c293dbf55210b151de3266773f3f53fb3bc5e5fa
Author: Ben Franksen <ben.franksen@online.de>
Date:   Mon Mar 27 00:26:12 CEST 2023
  * bugfix and cleanup for maximum size in hashes

  The previous size limit was too small by a factor of 10. To avoid such
  stupid mistakes, we now calculate the limit from the number of digits and
  use a constant for both.

patch 9af072b5a77fef1ca8b8b3cce1ee00dc22d2c31b
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sat Dec 17 10:19:34 CET 2022
  * add proper parsing for (possibly size-prefixed) hashes

patch 01f043d25e45ee15c1fc58e5579b45936e02e288
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sun Mar 26 23:15:27 CEST 2023
  * turn some methods of ValidHash into plain functions
Attachments
msg23195 (view) Author: bfrk Date: 2023-03-27.01:27:16
Okay, I see what you meant with "structural" in the case of 
extractOldStyleRebase: here the content really doesn't change, the 
mapped function W.fromRebasing merely undoes the wrapping. So in 
this case, not invalidating the hash would be correct and slightly 
more efficient. This is a rather special case, though; in general 
the mapped function can (and in case of log does) change the 
content. I prefer to be on the safe side here.
msg23196 (view) Author: ganesh Date: 2023-03-27.07:00:44
> The only place this is used is as parameter to addMissingHashes,
> which already expands stubs before applying it. I could have turned
> it into a partial function instead, not sure if that's better though.

OK, if it's not a behaviour change I think the new code is fine.

> I will not dwell much on the question of witnesses, which is 
> interesting but leads us too far astray here. You have a point
> but sticking to that principle in a strict way would be impractical 
> IMO, think of e.g. coalescing or canonizing.

Actually those are good examples of keeping witnesses but changing
content that I had missed. (If we think of the witnesses as
representing tree states.)

I think my main concern was the implications of the `fmap` name.
But thinking about it some more I'm not even sure that concern was 
correct, the job of fmap is to preserve semantics not syntax.
History
Date User Action Args
2021-11-04 23:32:13bfrkcreate
2021-11-05 07:54:09bfrksetstatus: needs-screening -> needs-review
2023-03-24 23:35:36ganeshsetmessages: + msg23178
2023-03-25 09:34:40ganeshsetmessages: + msg23179
2023-03-25 21:14:54ganeshsetstatus: needs-review -> review-in-progress
messages: + msg23182
2023-03-26 16:41:12ganeshsetmessages: + msg23185
2023-03-26 17:31:27ganeshsetmessages: + msg23186
2023-03-26 17:33:40ganeshsetmessages: + msg23187
2023-03-26 21:48:28ganeshsetmessages: + msg23188
2023-03-26 23:34:35bfrksetmessages: + msg23192
2023-03-27 00:22:01bfrksetmessages: + msg23193
2023-03-27 00:36:28bfrksetfiles: + patch-preview.txt, bugfix-and-cleanup-for-maximum-size-in-hashes.dpatch
messages: + msg23194
2023-03-27 01:27:16bfrksetmessages: + msg23195
2023-03-27 07:00:47ganeshsetstatus: review-in-progress -> accepted-pending-tests
messages: + msg23196
2023-03-27 08:04:56ganeshsetstatus: accepted-pending-tests -> accepted