darcs

Patch 1887 remove the size prefix when writing hash... (and 14 more)

Title remove the size prefix when writing hash... (and 14 more)
Superseder Nosy List bf
Related Issues
Status in-discussion Assigned To
Milestone

Created on 2019-08-20.15:49:22 by bf, last changed 2019-09-21.10:30:22 by bf.

Files
File name Status Uploaded Type Edit Remove
patch-preview.txt bf, 2019-08-20.15:49:21 text/x-darcs-patch
patch-preview.txt bf, 2019-08-20.18:43:45 text/x-darcs-patch
patch-preview.txt bf, 2019-08-21.07:15:14 text/x-darcs-patch
remove-the-size-prefix-when-writing-hashed-inventories-and-patches.dpatch bf, 2019-08-20.15:49:21 application/x-darcs-patch
remove-the-size-prefix-when-writing-hashed-inventories-and-patches.dpatch bf, 2019-08-20.18:43:45 application/x-darcs-patch
remove-the-size-prefix-when-writing-hashed-inventories-and-patches.dpatch bf, 2019-08-21.07:15:14 application/x-darcs-patch
unnamed bf, 2019-08-20.15:49:21 text/plain
unnamed bf, 2019-08-20.18:43:45 text/plain
unnamed bf, 2019-08-21.07:15:14 text/plain
See mailing list archives for discussion on individual patches.
Messages
msg21149 (view) Author: bf Date: 2019-08-20.15:49:21
As promised, a cleaned up version of my latest refactorings for the
Repository subsystem. I won't screen this immediately as there are some
details that might warrant a bit of discussion or a better explanation.

15 patches for repository http://darcs.net/screened:

patch 6f51164b4d6f4f9b025ef6bd9008ac46ab8c74c2
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sat Aug 17 11:30:37 CEST 2019
  * remove the size prefix when writing hashed inventories and patches
  
  I guess the motivation for prefixing the (uncompressed) file size to the
  file name was to allow allocation of a ByteString of the exact size we need.
  It doesn't seem like this was ever realized. Note that for pristine.hashed
  we don't support size prefixes except for the top-level entry. Given that
  Trees use lazy ByteStrings for file content it is clear that the size
  information is useless for pristine files.
  Our test scripts extensively cover reading from existing repositories, some
  of them are even saved as tar balls, so I think this change is fully
  backward compatible.

patch 4e4f497e3f651e6e887d56fb4c6a69015d5400f6
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sat Aug 17 13:17:05 CEST 2019
  * fix doc comment for filterDirContents
  
  We changed the behavior slightly a while ago.

patch c788c68a4936b6643d1c8c870c83ada5e145e618
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sat Aug 17 18:49:53 CEST 2019
  * add a compatibility comment to readRecorded

patch 1841803eea6976c8e7d32b661f21ece55d1537ea
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 10:00:45 CEST 2019
  * simplify the type of readTentativeRepo
  
  Reading the tentative patches from a repo makes sense only for the current
  repo and we already require that this is located at ".".

patch f674822894c2e40548ef1e5d8eb294b1f91f338a
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 11:11:02 CEST 2019
  * tests/clone.sh: fix removing of repo dirs

patch 9661ce4c1e5e7ee5bb0d7cf6f94350d0c6769edc
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 13:39:03 CEST 2019
  * fix some haddocks in Darcs.Repository.Hashed

patch 8f47d28a39aec5b33ab32289f841ba21fb270343
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:39:07 CEST 2019
  * make changes to pending follow our standard transaction protocol
  
  Like for other changes to the repository, any change to pending is now first
  made to the tentative copy of the pending patch, then finalized by renaming
  the tentative copy back to the "real" pending. Commands that modify only on
  the pending patch (and/or the working tree: add, mark-conflicts, move,
  remove, replace, revert, rollback, setpref, and unrevert) now have to call
  finalizePending, except rollback and unrevert which already perform full
  finalization. The readPending we export from D.R.State now reads the
  tentative pending, falling back to the real pending in case we are not in a
  transaction. This is safe since the tentative pending is now guaranteed to
  be gone when the transaction is finalized.
  
  A bit of complication arises since the pending patch is intricately
  interwoven with reading Trees and many functions in D.R.State assumed it
  starts at the recorded state. This means we have to go all the way and make
  them refer to the tentative state instead. This is highly desirable anyway:
  previously these functions could only be used before any changes were made
  to the repository, a restriction that is now lifted. This percolates through
  to a few other modules under Darcs.Repository. In particular,
  tentativelyMergePatches can now be used after (tentatively) adding or
  removing patches, which allows us to remove the intermediate finalize/revert
  calls in applyPatchesForRebaseCmd.

patch 41663730aeb89c8bdf827d81648fe2174cb3f232
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:40:40 CEST 2019
  * require wR ~ wT for readRecorded
  
  Note that readRecorded does exactly what it claims: it reads the recorded
  state, ignoring the tentative state. But we call it everywhere, regardless
  of whether we are in a transaction or not. Since we don't tag states with
  witnesses, it seems safer to insist that no changes have been made to the
  repo, rather than risk that it gets called erroneously when in fact reading
  the current tentative state was intended.

patch f817a945c46f02b9fb37e2fc327e156fdd983a93
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:46:46 CEST 2019
  * make restrictSubpaths filters pure

patch 5a877ebf6d69cae59de12e300b12de5142ab7ed5
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:51:20 CEST 2019
  * diff command: use restrictSubpaths, not restrictSubpathsAfter
  
  The way this filter is used in the diff command translating paths through
  pending really make no sense.

patch c664900a0bea0d5482af1deb33fcef4780f9eb33
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:51:31 CEST 2019
  * extend comment about unsafety of addPendingDiffToPending

patch fb3cae46ee99523c54b042d870e204a9777fb4da
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:51:39 CEST 2019
  * rename setTentativePending to tentativelySetPending
  
  This is more in line with the convention for other changes to the tentative
  state.

patch 6d0cf92107250cadbcc205abc9f7443f26897f93
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:53:06 CEST 2019
  * cleanup imports in Darcs.Patch.PatchInfoAnd

patch 23653c3fe926e7af11ddd5d0a0fe559e518fa840
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:53:38 CEST 2019
  * cleanup messages and filepath/url concatentation in D.R.Cache

patch f0c57a97373a9174b01322991d2fb3be6beeac8b
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:53:45 CEST 2019
  * refactor representation of hashes and caching of hashed files
  
  This moves the hash related definitions (ValidHash, InventoryHash,
  PatchHash, PristineHash, etc) to Darcs.Repository.Cache and changes the
  internal representation of these hashes from String to ByteString (still b16
  encoded). The same change is made for PatchSets and PatchInfoAnd. Conversion
  to String happens only at the boundary: when we need a FilePath for IO or to
  display messages. Since validation of hashes is now done on ByteStrings, we
  can afford to make it a bit more thorough.
  
  The type distinction between hashes for different file types allows the type
  checker to infer the name of the cache directory from the type. We
  therefore remove the HashedDir argument for most functions exported from
  Darcs.Repository.Cache.
Attachments
msg21152 (view) Author: bf Date: 2019-08-20.18:43:45
The only new patch is the last one that fully removes the RepoType cruft
(sorry Ganesh ;-)

16 patches for repository http://darcs.net/screened:

patch 6f51164b4d6f4f9b025ef6bd9008ac46ab8c74c2
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sat Aug 17 11:30:37 CEST 2019
  * remove the size prefix when writing hashed inventories and patches
  
  I guess the motivation for prefixing the (uncompressed) file size to the
  file name was to allow allocation of a ByteString of the exact size we need.
  It doesn't seem like this was ever realized. Note that for pristine.hashed
  we don't support size prefixes except for the top-level entry. Given that
  Trees use lazy ByteStrings for file content it is clear that the size
  information is useless for pristine files.
  Our test scripts extensively cover reading from existing repositories, some
  of them are even saved as tar balls, so I think this change is fully
  backward compatible.

patch 4e4f497e3f651e6e887d56fb4c6a69015d5400f6
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sat Aug 17 13:17:05 CEST 2019
  * fix doc comment for filterDirContents
  
  We changed the behavior slightly a while ago.

patch c788c68a4936b6643d1c8c870c83ada5e145e618
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sat Aug 17 18:49:53 CEST 2019
  * add a compatibility comment to readRecorded

patch 1841803eea6976c8e7d32b661f21ece55d1537ea
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 10:00:45 CEST 2019
  * simplify the type of readTentativeRepo
  
  Reading the tentative patches from a repo makes sense only for the current
  repo and we already require that this is located at ".".

patch f674822894c2e40548ef1e5d8eb294b1f91f338a
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 11:11:02 CEST 2019
  * tests/clone.sh: fix removing of repo dirs

patch 9661ce4c1e5e7ee5bb0d7cf6f94350d0c6769edc
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 13:39:03 CEST 2019
  * fix some haddocks in Darcs.Repository.Hashed

patch 8f47d28a39aec5b33ab32289f841ba21fb270343
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:39:07 CEST 2019
  * make changes to pending follow our standard transaction protocol
  
  Like for other changes to the repository, any change to pending is now first
  made to the tentative copy of the pending patch, then finalized by renaming
  the tentative copy back to the "real" pending. Commands that modify only on
  the pending patch (and/or the working tree: add, mark-conflicts, move,
  remove, replace, revert, rollback, setpref, and unrevert) now have to call
  finalizePending, except rollback and unrevert which already perform full
  finalization. The readPending we export from D.R.State now reads the
  tentative pending, falling back to the real pending in case we are not in a
  transaction. This is safe since the tentative pending is now guaranteed to
  be gone when the transaction is finalized.
  
  A bit of complication arises since the pending patch is intricately
  interwoven with reading Trees and many functions in D.R.State assumed it
  starts at the recorded state. This means we have to go all the way and make
  them refer to the tentative state instead. This is highly desirable anyway:
  previously these functions could only be used before any changes were made
  to the repository, a restriction that is now lifted. This percolates through
  to a few other modules under Darcs.Repository. In particular,
  tentativelyMergePatches can now be used after (tentatively) adding or
  removing patches, which allows us to remove the intermediate finalize/revert
  calls in applyPatchesForRebaseCmd.

patch 41663730aeb89c8bdf827d81648fe2174cb3f232
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:40:40 CEST 2019
  * require wR ~ wT for readRecorded
  
  Note that readRecorded does exactly what it claims: it reads the recorded
  state, ignoring the tentative state. But we call it everywhere, regardless
  of whether we are in a transaction or not. Since we don't tag states with
  witnesses, it seems safer to insist that no changes have been made to the
  repo, rather than risk that it gets called erroneously when in fact reading
  the current tentative state was intended.

patch f817a945c46f02b9fb37e2fc327e156fdd983a93
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:46:46 CEST 2019
  * make restrictSubpaths filters pure

patch 5a877ebf6d69cae59de12e300b12de5142ab7ed5
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:51:20 CEST 2019
  * diff command: use restrictSubpaths, not restrictSubpathsAfter
  
  The way this filter is used in the diff command translating paths through
  pending really make no sense.

patch c664900a0bea0d5482af1deb33fcef4780f9eb33
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:51:31 CEST 2019
  * extend comment about unsafety of addPendingDiffToPending

patch fb3cae46ee99523c54b042d870e204a9777fb4da
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:51:39 CEST 2019
  * rename setTentativePending to tentativelySetPending
  
  This is more in line with the convention for other changes to the tentative
  state.

patch 6d0cf92107250cadbcc205abc9f7443f26897f93
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:53:06 CEST 2019
  * cleanup imports in Darcs.Patch.PatchInfoAnd

patch 23653c3fe926e7af11ddd5d0a0fe559e518fa840
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:53:38 CEST 2019
  * cleanup messages and filepath/url concatentation in D.R.Cache

patch f0c57a97373a9174b01322991d2fb3be6beeac8b
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:53:45 CEST 2019
  * refactor representation of hashes and caching of hashed files
  
  This moves the hash related definitions (ValidHash, InventoryHash,
  PatchHash, PristineHash, etc) to Darcs.Repository.Cache and changes the
  internal representation of these hashes from String to ByteString (still b16
  encoded). The same change is made for PatchSets and PatchInfoAnd. Conversion
  to String happens only at the boundary: when we need a FilePath for IO or to
  display messages. Since validation of hashes is now done on ByteStrings, we
  can afford to make it a bit more thorough.
  
  The type distinction between hashes for different file types allows the type
  checker to infer the name of the cache directory from the type. We
  therefore remove the HashedDir argument for most functions exported from
  Darcs.Repository.Cache.

patch d4c00aff0baa693b3820d2d99b63c446246f9e7f
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 20:39:59 CEST 2019
  * remove Darcs.Patch.RepoType
Attachments
msg21158 (view) Author: bf Date: 2019-08-21.07:15:14
Same as before, only rebased to no longer depend on patches we obliterated
from screened.

Regarding the last patch: Can you (Ganesh) tell me whether you agree with
this, in principle? I am not invested much in this change, it just occurred
to me that we don't actually need it anymore, since the rebase patch is now
kept separate from normal patches. If you think we should keep it, perhaps
because you have some plans to use it for a different purpose, just say so.

16 patches for repository http://darcs.net/screened:

patch 6f51164b4d6f4f9b025ef6bd9008ac46ab8c74c2
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sat Aug 17 11:30:37 CEST 2019
  * remove the size prefix when writing hashed inventories and patches
  
  I guess the motivation for prefixing the (uncompressed) file size to the
  file name was to allow allocation of a ByteString of the exact size we need.
  It doesn't seem like this was ever realized. Note that for pristine.hashed
  we don't support size prefixes except for the top-level entry. Given that
  Trees use lazy ByteStrings for file content it is clear that the size
  information is useless for pristine files.
  Our test scripts extensively cover reading from existing repositories, some
  of them are even saved as tar balls, so I think this change is fully
  backward compatible.

patch 4e4f497e3f651e6e887d56fb4c6a69015d5400f6
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sat Aug 17 13:17:05 CEST 2019
  * fix doc comment for filterDirContents
  
  We changed the behavior slightly a while ago.

patch c788c68a4936b6643d1c8c870c83ada5e145e618
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sat Aug 17 18:49:53 CEST 2019
  * add a compatibility comment to readRecorded

patch 1841803eea6976c8e7d32b661f21ece55d1537ea
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 10:00:45 CEST 2019
  * simplify the type of readTentativeRepo
  
  Reading the tentative patches from a repo makes sense only for the current
  repo and we already require that this is located at ".".

patch f674822894c2e40548ef1e5d8eb294b1f91f338a
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 11:11:02 CEST 2019
  * tests/clone.sh: fix removing of repo dirs

patch 9661ce4c1e5e7ee5bb0d7cf6f94350d0c6769edc
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 13:39:03 CEST 2019
  * fix some haddocks in Darcs.Repository.Hashed

patch 8f47d28a39aec5b33ab32289f841ba21fb270343
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:39:07 CEST 2019
  * make changes to pending follow our standard transaction protocol
  
  Like for other changes to the repository, any change to pending is now first
  made to the tentative copy of the pending patch, then finalized by renaming
  the tentative copy back to the "real" pending. Commands that modify only on
  the pending patch (and/or the working tree: add, mark-conflicts, move,
  remove, replace, revert, rollback, setpref, and unrevert) now have to call
  finalizePending, except rollback and unrevert which already perform full
  finalization. The readPending we export from D.R.State now reads the
  tentative pending, falling back to the real pending in case we are not in a
  transaction. This is safe since the tentative pending is now guaranteed to
  be gone when the transaction is finalized.
  
  A bit of complication arises since the pending patch is intricately
  interwoven with reading Trees and many functions in D.R.State assumed it
  starts at the recorded state. This means we have to go all the way and make
  them refer to the tentative state instead. This is highly desirable anyway:
  previously these functions could only be used before any changes were made
  to the repository, a restriction that is now lifted. This percolates through
  to a few other modules under Darcs.Repository. In particular,
  tentativelyMergePatches can now be used after (tentatively) adding or
  removing patches, which allows us to remove the intermediate finalize/revert
  calls in applyPatchesForRebaseCmd.

patch 41663730aeb89c8bdf827d81648fe2174cb3f232
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:40:40 CEST 2019
  * require wR ~ wT for readRecorded
  
  Note that readRecorded does exactly what it claims: it reads the recorded
  state, ignoring the tentative state. But we call it everywhere, regardless
  of whether we are in a transaction or not. Since we don't tag states with
  witnesses, it seems safer to insist that no changes have been made to the
  repo, rather than risk that it gets called erroneously when in fact reading
  the current tentative state was intended.

patch f817a945c46f02b9fb37e2fc327e156fdd983a93
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:46:46 CEST 2019
  * make restrictSubpaths filters pure

patch 5a877ebf6d69cae59de12e300b12de5142ab7ed5
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:51:20 CEST 2019
  * diff command: use restrictSubpaths, not restrictSubpathsAfter
  
  The way this filter is used in the diff command translating paths through
  pending really make no sense.

patch c664900a0bea0d5482af1deb33fcef4780f9eb33
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:51:31 CEST 2019
  * extend comment about unsafety of addPendingDiffToPending

patch fb3cae46ee99523c54b042d870e204a9777fb4da
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:51:39 CEST 2019
  * rename setTentativePending to tentativelySetPending
  
  This is more in line with the convention for other changes to the tentative
  state.

patch 23653c3fe926e7af11ddd5d0a0fe559e518fa840
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:53:38 CEST 2019
  * cleanup messages and filepath/url concatentation in D.R.Cache

patch ef95feec78454aeef1253e0fd441e713dffe43ca
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:53:06 CEST 2019
  * cleanup imports in Darcs.Patch.PatchInfoAnd

patch 7b10e505dbd744e93f05df74c07e12b9380dd584
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 16:53:45 CEST 2019
  * refactor representation of hashes and caching of hashed files
  
  This moves the hash related definitions (ValidHash, InventoryHash,
  PatchHash, PristineHash, etc) to Darcs.Repository.Cache and changes the
  internal representation of these hashes from String to ByteString (still b16
  encoded). The same change is made for PatchSets and PatchInfoAnd. Conversion
  to String happens only at the boundary: when we need a FilePath for IO or to
  display messages. Since validation of hashes is now done on ByteStrings, we
  can afford to make it a bit more thorough.
  
  The type distinction between hashes for different file types allows the type
  checker to infer the name of the cache directory from the type. We
  therefore remove the HashedDir argument for most functions exported from
  Darcs.Repository.Cache.

patch dd9a49ab094c9c189a443f1db474aaa8f9ea3dc4
Author: Ben Franksen <ben.franksen@online.de>
Date:   Tue Aug 20 20:39:59 CEST 2019
  * remove Darcs.Patch.RepoType
Attachments
msg21168 (view) Author: ganesh Date: 2019-08-23.22:19:23
Regarding the removal of RepoType: I have no particular objections
if it's not serving any purpose any more. I actually originally
introduced it, replacing the older "Rebasing p" wrapper type, to
make it easier to introduce a 'stash' feature for suspending patches
without changing their identity:

https://hub.darcs.net/ganesh/darcs-stash-draft-20150914

But there was various discussion about the final UI that I never
implemented and it'd now need a lot of work (and re-discussion) to
do so. I guess it would also look quite similar to rebase as it
is now so the arguments for keeping RepoType for it would be similar.

I guess the remaining argument for keeping RepoType would be that it
gives us static knowledge of the features of a repository (assuming
that the code that launches RepoJobs is correct). I don't know how
useful this is. Looking at current screened, I'm actually a bit confused
about how some of the code works now :-) For example, how does
the call in Darcs.Repository.Hashed.tentativelyAddPatches to
withTentativeRebase work if there's no rebase in progress?
msg21169 (view) Author: ganesh Date: 2019-08-23.23:06:14
Regarding dropping the size prefix, there's some history to its
existence that I can't really remember or fully understand from
searching old emails. The size prefix was dropped for pristine in 
2010:

http://bugs.darcs.net/patch156

There was also some discussion about how to manage the transition
(at least I think there was discussion, but I can only find my
summing up):

https://lists.osuosl.org/pipermail/darcs-users/2010-
March/023475.html

I guess my main questions about this would be
(a) what do we gain by dropping it
and
(b) how will the transition work in existing repositories (will
they end up with a mix of formats?) and in the cache (will we
lose sharing?)
msg21182 (view) Author: ganesh Date: 2019-08-25.17:15:49
A further thought about the RepoType parameter: I can't see any use to
keeping it on PatchInfoAnd, given that as you say we no longer drag
a containing rebase patch in it. But I think it may have some value
on Repository, as there is a difference between a rebase-in-progress
repository and a "normal" one. I don't know whether that value
outweighs the cost.
msg21190 (view) Author: bf Date: 2019-08-25.23:20:52
> Looking at current screened, I'm actually a bit confused
> about how some of the code works now :-) For example, how does
> the call in Darcs.Repository.Hashed.tentativelyAddPatches to
> withTentativeRebase work if there's no rebase in progress?

Ha! A good question. It took me a moment to find the answer. We are
indeed updating the rebase patch, regardless of whether there is a
rebase in progress. It's just that add/removeFixupsFromSuspended
immediately drops every change since it commutes cleanly past the empty
rebase patch!
msg21193 (view) Author: bf Date: 2019-08-26.06:18:13
> Regarding dropping the size prefix, there's some history to its
> existence that I can't really remember or fully understand from
> searching old emails. The size prefix was dropped for pristine in 
> 2010:
> 
> http://bugs.darcs.net/patch156

Hm, interesting read, though mostly of historical interest.

> I guess my main questions about this would be
> (a) what do we gain by dropping it

Code gets simpler and more uniform (also slightly more efficient). Same
for the on-disk representation.

One practical advantage (not yet implemented in the patch I sent) is
that we can now share all pristine files with the cache, not just the
top-level pristine tree. We cannot currently do that, because of the
difference in format, unless we pull the case distinction into
Darcs.Repository.Cache, which would further complicate the code in that
module. Using the cache for all pristine files would speed up cloning of
unpacked repos (over http) quite a bit, I think.

> and
> (b) how will the transition work in existing repositories (will
> they end up with a mix of formats?) and in the cache (will we
> lose sharing?)

Note we already read both formats. The only change would be that we
write without size-prefix.

But yes, with the patch as I sent it, there will be a mix in existing
repos and the cache... for a while. There will also be a certain amount
of duplication of files in the cache.

We could make the transition even less noticeable by hard-linking old
and new formats whenever we find both. However, this means eventually
getting rid of the old size-prefixed files will be a bit more difficult.

Excursion:

The way 'darcs optimize cache' currently works makes it practically
unusable. Searching the home directory for darcs repos is extremely
slow, while manually listing all the repos is completely out of the
question for anyone who has any actual work to do. I guess what people
do instead is to just rm -rf ~/.cache/darcs and accept that the next few
clones take a bit longer than usual.

I think we should just delete all files in the global cache that have a
hard-link count of one. This would make it a pretty fast operation which
means we could perform it implicitly on a regular basis, say, every 100
times a darcs command accesses the cache. This would help keeping the
size of the cache at a more reasonable level, even in the long run. Yes,
this removes cached files for repos that are not on the same file
system. For people who are regularly working in this kind of setting
(pretty uncommon nowadays, I guess) we can add an option to disable the
automatic garbage collection.

If there are hard-links between differently named files in the cache,
this becomes a bit more complicated and less efficient: we'd also have
to search for files with a hard-link count of two, then check if there
are both versions in the cache and that they are linked, before deleting
both.
msg21194 (view) Author: bf Date: 2019-08-26.07:37:20
> A further thought about the RepoType parameter: I can't see any use to
> keeping it on PatchInfoAnd, given that as you say we no longer drag
> a containing rebase patch in it. But I think it may have some value
> on Repository, as there is a difference between a rebase-in-progress
> repository and a "normal" one. I don't know whether that value
> outweighs the cost.

Regarding the cost/benefit ratio: I don't mind dragging around the 'rt'
type parameter so much. What I do mind is that actually using it (either
at runtime or at compile time) incurs this ridiculous amount of
syntactic overhead. It makes type signatures longer and harder to read
and the case distinctions themselves are riddled with complicated extra
type annotations and singleton type parameters. The fact that working
with singleton types in Haskell can be rather awkward has been
recognized even by experts that are otherwise ardent supporters of
dependent typing.

I am a personally big fan of Haskell's strong static type system and I
appreciate very much how it enables safe refactoring using some of the
more advanced features like phantom ("witness") types. I guess it
contributes a large part of the fun I have with hacking on darcs that I
/can/ do these large refactors and still get it working afterwards
without weeks of debugging.

But there are limits. If the typing overhead gets to the point of
obscuring code that is basically a plain and simple case distinction,
then IMO that limit is reached. We should also keep in mind that any
advanced type shenanigans make it harder for newcomers to contribute.
And finally, trying to capture ever more things at the type level isn't
the only, and perhaps not the best, way of making progress: re-writing
complicated procedures so that it becomes obvious what they do and why,
commenting anything that remains tricky, adding haddocks, consistent and
succinct but still suggestive names for variables etc etc..., in other
words improving code quality. This is what I have been doing for years
now, and I think it has proven quite effective.

Your use of RepoType when the rebase patch was mixed with normal patches
was (barely) justified (though it really gave me quite a headache to
understand it at first): it was important that we don't accidentally mix
rebase-in-progress patches with normal ones.

Nowadays, there is no longer any essential difference between a
rebase-in-progress repository and a "normal" one. We still make a
distinction when we start a RepoJob but I think that is no longer
needed. So what if we use e.g. 'darcs rebase unsuspend' or 'darcs rebase
log' and there is no rebase in progress? Nothing bad would happen at
all. This could be handled just like we handle other "empty" cases i.e.
simply report that there are no suspended patches after reading the
rebase patch.

We could even drop the final "Rebase in progress: N suspended patches"
message; I find it a bit distracting since almost all of my repos
nowadays have a few suspended patches. But I'm surely not the average
user, and I don't mind if we keep it. (It should certainly be reported
if we act on my proposal for an enhanced 'darcs status' command.)
msg21202 (view) Author: bf Date: 2019-08-26.10:00:59
It is good I haven't screened any of the patches yet.

I am having doubts about some details of the "make changes to pending
follow our standard transaction protocol" patch. I think the general
idea is fine, but there are issues with first reading the tentative
pending and falling back to reading the real pending if that fails. For
this to work correctly we have to make sure that we clean up the
tentative pending patch, even if the transaction fails (for any reason).
Otherwise the next time a non-transaction command reads pending it gets
the content of the aborted pending.tentative which is decidedly not
good. This is missing. It is interesting that this bug is not found by
any of the regression tests.
msg21327 (view) Author: ganesh Date: 2019-09-03.07:40:37
> One practical advantage (not yet implemented in the patch I sent) is
> that we can now share all pristine files with the cache, not just the
> top-level pristine tree. We cannot currently do that, because of the
> difference in format, unless we pull the case distinction into
> Darcs.Repository.Cache, which would further complicate the code in that
> module. Using the cache for all pristine files would speed up cloning of
> unpacked repos (over http) quite a bit, I think.

I don't understand what you mean - I see a pristine.hashed subfolder in
my cache that has recently updated individual pristine files in it. In
what way are we not already using the cache for all pristine files?

>> and
>> (b) how will the transition work in existing repositories (will
>> they end up with a mix of formats?) and in the cache (will we
>> lose sharing?)
> 
> Note we already read both formats. The only change would be that we
> write without size-prefix.
> 
> But yes, with the patch as I sent it, there will be a mix in existing
> repos and the cache... for a while. There will also be a certain amount
> of duplication of files in the cache.
> 
> We could make the transition even less noticeable by hard-linking old
> and new formats whenever we find both. However, this means eventually
> getting rid of the old size-prefixed files will be a bit more difficult.

You mean getting rid of the code to handle them, or the files
themselves? I'd have thought the old files would persist a long time if
people don't re-clone repos, as they'll be in the history. If we make
sure to hard-link between the old and the new names in the cache it's
probably not too painful.

> Excursion:
> 
> The way 'darcs optimize cache' currently works makes it practically
> unusable. Searching the home directory for darcs repos is extremely
> slow, while manually listing all the repos is completely out of the
> question for anyone who has any actual work to do. I guess what people
> do instead is to just rm -rf ~/.cache/darcs and accept that the next few
> clones take a bit longer than usual.

Yeah, I never really know how to do it. I don't clean up my cache very
often.


> I think we should just delete all files in the global cache that have a
> hard-link count of one.


The current behaviour was implemented in
http://darcs.net/GSoC/2014-Hashed-Files-And-Cache

http://bugs.darcs.net/issue2374
http://bugs.darcs.net/patch1180

I can't find any evidence that the idea of link counting was considered,
which is a bit surprising (perhaps it's only obvious in retrospect).

> This would make it a pretty fast operation which
> means we could perform it implicitly on a regular basis, say, every 100
> times a darcs command accesses the cache. This would help keeping the
> size of the cache at a more reasonable level, even in the long run. Yes,
> this removes cached files for repos that are not on the same file
> system. For people who are regularly working in this kind of setting
> (pretty uncommon nowadays, I guess) we can add an option to disable the
> automatic garbage collection.

I guess the problem is that darcs would be slower for those people by
default and they'd never know why. Filling your disk by default is also
not a great option but personally I never find the disk space used by
the cache significant even though I am often short of disk space.

But I agree it'd be rare. And I can't see any other downsides to
deleting cache files that aren't in a repository anywhere on disk. It
would hurt people that occasionally clone a repo and then delete it
again but I think that's an ok default.

> If there are hard-links between differently named files in the cache,
> this becomes a bit more complicated and less efficient: we'd also have
> to search for files with a hard-link count of two, then check if there
> are both versions in the cache and that they are linked, before deleting
> both.

We probably don't need to check they are linked, just that they both
exist. If they both exist then delete if link count <=2 , if not then if
<=1. If they both exist and aren't hard-linked then we might
accidentally delete something that was linked elsewhere. But it ought
not to happen if darcs has already done its job right, and it's not much
of a problem if it does happen.
msg21338 (view) Author: bf Date: 2019-09-03.14:18:59
>> One practical advantage (not yet implemented in the patch I sent) is
>> that we can now share all pristine files with the cache, not just the
>> top-level pristine tree. We cannot currently do that, because of the
>> difference in format, unless we pull the case distinction into
>> Darcs.Repository.Cache, which would further complicate the code in that
>> module. Using the cache for all pristine files would speed up cloning of
>> unpacked repos (over http) quite a bit, I think.
> 
> I don't understand what you mean - I see a pristine.hashed subfolder in
> my cache that has recently updated individual pristine files in it. In
> what way are we not already using the cache for all pristine files?

We don't write pristine files via Darcs.Repository.Cache but only
locally, see Darcs.Util.Tree.Hashed, except during fetch. Indeed, using
writeFileUsingCache for pristine files would currently be pointless as
this would write files with a size prefix.

We also don't read from the cache unless we transfer pristine files e.g.
during clone. So there is some sharing but it is limited to transfer
operations.

Suppose I record a new patch on top of a current clone of screened, then
push that, then make a fresh clone of screened. This clone will not
share any of the new pristine files with the old repo, so it must
download them. Contrast that with inventories and patches, where we need
not download a single file because all of them are already in the cache.

>> We could make the transition even less noticeable by hard-linking old
>> and new formats whenever we find both. However, this means eventually
>> getting rid of the old size-prefixed files will be a bit more difficult.
> 
> You mean getting rid of the code to handle them, or the files
> themselves?

The latter. Which is why I went on the excursion about optimize cache.

> I'd have thought the old files would persist a long time if
> people don't re-clone repos, as they'll be in the history. If we make
> sure to hard-link between the old and the new names in the cache it's
> probably not too painful.

That's what I meant. But it would be nice if at some point we would
automatically get rid of all the size-prefixed files. Not essential.

>> Excursion:
>>
>> The way 'darcs optimize cache' currently works makes it practically
>> unusable. Searching the home directory for darcs repos is extremely
>> slow, while manually listing all the repos is completely out of the
>> question for anyone who has any actual work to do. I guess what people
>> do instead is to just rm -rf ~/.cache/darcs and accept that the next few
>> clones take a bit longer than usual.
> 
> Yeah, I never really know how to do it. I don't clean up my cache very
> often.

It happens for me usually when I switch to a new machine.

>> I think we should just delete all files in the global cache that have a
>> hard-link count of one.
> 
> 
> The current behaviour was implemented in
> http://darcs.net/GSoC/2014-Hashed-Files-And-Cache
> 
> http://bugs.darcs.net/issue2374
> http://bugs.darcs.net/patch1180
> 
> I can't find any evidence that the idea of link counting was considered,
> which is a bit surprising (perhaps it's only obvious in retrospect).

I am still thinking about possible pitfalls.

>> This would make it a pretty fast operation which
>> means we could perform it implicitly on a regular basis, say, every 100
>> times a darcs command accesses the cache. This would help keeping the
>> size of the cache at a more reasonable level, even in the long run. Yes,
>> this removes cached files for repos that are not on the same file
>> system. For people who are regularly working in this kind of setting
>> (pretty uncommon nowadays, I guess) we can add an option to disable the
>> automatic garbage collection.
> 
> I guess the problem is that darcs would be slower for those people by
> default and they'd never know why.

Correct. Perhaps GCing the cache shouldn't be on by default.

> Filling your disk by default is also
> not a great option but personally I never find the disk space used by
> the cache significant even though I am often short of disk space.

I agree that nowadays a few GB for the cache aren't really a problem
anymore. It used to be for me where I work because on our development
machine we have quotas.This has regularly caused problems until I got
the idea to set XDG_CACHE_HOME for everybody to a directory on the disk
where we do our actual work and have lots of space.

> But I agree it'd be rare. And I can't see any other downsides to
> deleting cache files that aren't in a repository anywhere on disk. It
> would hurt people that occasionally clone a repo and then delete it
> again but I think that's an ok default.

So what we should do is to implement this but not make the automatic GC
the default. At least we will get a useful 'darcs optimize cache'
command, automatically calling it is an optional feature.

>> If there are hard-links between differently named files in the cache,
>> this becomes a bit more complicated and less efficient: we'd also have
>> to search for files with a hard-link count of two, then check if there
>> are both versions in the cache and that they are linked, before deleting
>> both.
> 
> We probably don't need to check they are linked, just that they both
> exist. If they both exist then delete if link count <=2 , if not then if
> <=1. If they both exist and aren't hard-linked then we might
> accidentally delete something that was linked elsewhere. But it ought
> not to happen if darcs has already done its job right, and it's not much
> of a problem if it does happen.

Yeah, something like that should do it.
History
Date User Action Args
2019-08-20 15:49:22bfcreate
2019-08-20 18:43:45bfsetfiles: + patch-preview.txt, remove-the-size-prefix-when-writing-hashed-inventories-and-patches.dpatch, unnamed
messages: + msg21152
2019-08-21 07:15:16bfsetfiles: + patch-preview.txt, remove-the-size-prefix-when-writing-hashed-inventories-and-patches.dpatch, unnamed
messages: + msg21158
2019-08-23 22:19:24ganeshsetmessages: + msg21168
2019-08-23 23:06:14ganeshsetmessages: + msg21169
2019-08-25 17:15:49ganeshsetmessages: + msg21182
2019-08-25 23:20:53bfsetmessages: + msg21190
2019-08-26 06:18:14bfsetmessages: + msg21193
2019-08-26 07:37:20bfsetmessages: + msg21194
2019-08-26 10:00:59bfsetmessages: + msg21202
2019-09-03 07:40:38ganeshsetmessages: + msg21327
2019-09-03 14:18:59bfsetmessages: + msg21338
2019-09-21 10:30:22bfsetstatus: needs-screening -> in-discussion