Patch 1887: remove the size prefix when writing hash... (and 14 more)

Title	remove the size prefix when writing hash... (and 14 more)
Superseder		Nosy List	bfrk
Related Issues
Status	obsoleted	Assigned To
Milestone

Created on 2019-08-20.15:49:22 by bfrk, last changed 2023-07-10.13:53:04 by bfrk.

Files
File name	Status	Uploaded	Type	Edit	Remove
patch-preview.txt		bfrk, 2019-08-20.15:49:21	text/x-darcs-patch
patch-preview.txt		bfrk, 2019-08-20.18:43:45	text/x-darcs-patch
patch-preview.txt		bfrk, 2019-08-21.07:15:14	text/x-darcs-patch
remove-the-size-prefix-when-writing-hashed-inventories-and-patches.dpatch		bfrk, 2019-08-20.15:49:21	application/x-darcs-patch
remove-the-size-prefix-when-writing-hashed-inventories-and-patches.dpatch		bfrk, 2019-08-20.18:43:45	application/x-darcs-patch
remove-the-size-prefix-when-writing-hashed-inventories-and-patches.dpatch		bfrk, 2019-08-21.07:15:14	application/x-darcs-patch
unnamed		bfrk, 2019-08-20.15:49:21	text/plain
unnamed		bfrk, 2019-08-20.18:43:45	text/plain
unnamed		bfrk, 2019-08-21.07:15:14	text/plain

See mailing list archives for discussion on individual patches.

Messages
msg21149 (view)	Author: bfrk	Date: 2019-08-20.15:49:21
As promised, a cleaned up version of my latest refactorings for the Repository subsystem. I won't screen this immediately as there are some details that might warrant a bit of discussion or a better explanation. 15 patches for repository http://darcs.net/screened: patch 6f51164b4d6f4f9b025ef6bd9008ac46ab8c74c2 Author: Ben Franksen <ben.franksen@online.de> Date: Sat Aug 17 11:30:37 CEST 2019 * remove the size prefix when writing hashed inventories and patches I guess the motivation for prefixing the (uncompressed) file size to the file name was to allow allocation of a ByteString of the exact size we need. It doesn't seem like this was ever realized. Note that for pristine.hashed we don't support size prefixes except for the top-level entry. Given that Trees use lazy ByteStrings for file content it is clear that the size information is useless for pristine files. Our test scripts extensively cover reading from existing repositories, some of them are even saved as tar balls, so I think this change is fully backward compatible. patch 4e4f497e3f651e6e887d56fb4c6a69015d5400f6 Author: Ben Franksen <ben.franksen@online.de> Date: Sat Aug 17 13:17:05 CEST 2019 * fix doc comment for filterDirContents We changed the behavior slightly a while ago. patch c788c68a4936b6643d1c8c870c83ada5e145e618 Author: Ben Franksen <ben.franksen@online.de> Date: Sat Aug 17 18:49:53 CEST 2019 * add a compatibility comment to readRecorded patch 1841803eea6976c8e7d32b661f21ece55d1537ea Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 10:00:45 CEST 2019 * simplify the type of readTentativeRepo Reading the tentative patches from a repo makes sense only for the current repo and we already require that this is located at ".". patch f674822894c2e40548ef1e5d8eb294b1f91f338a Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 11:11:02 CEST 2019 * tests/clone.sh: fix removing of repo dirs patch 9661ce4c1e5e7ee5bb0d7cf6f94350d0c6769edc Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 13:39:03 CEST 2019 * fix some haddocks in Darcs.Repository.Hashed patch 8f47d28a39aec5b33ab32289f841ba21fb270343 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:39:07 CEST 2019 * make changes to pending follow our standard transaction protocol Like for other changes to the repository, any change to pending is now first made to the tentative copy of the pending patch, then finalized by renaming the tentative copy back to the "real" pending. Commands that modify only on the pending patch (and/or the working tree: add, mark-conflicts, move, remove, replace, revert, rollback, setpref, and unrevert) now have to call finalizePending, except rollback and unrevert which already perform full finalization. The readPending we export from D.R.State now reads the tentative pending, falling back to the real pending in case we are not in a transaction. This is safe since the tentative pending is now guaranteed to be gone when the transaction is finalized. A bit of complication arises since the pending patch is intricately interwoven with reading Trees and many functions in D.R.State assumed it starts at the recorded state. This means we have to go all the way and make them refer to the tentative state instead. This is highly desirable anyway: previously these functions could only be used before any changes were made to the repository, a restriction that is now lifted. This percolates through to a few other modules under Darcs.Repository. In particular, tentativelyMergePatches can now be used after (tentatively) adding or removing patches, which allows us to remove the intermediate finalize/revert calls in applyPatchesForRebaseCmd. patch 41663730aeb89c8bdf827d81648fe2174cb3f232 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:40:40 CEST 2019 * require wR ~ wT for readRecorded Note that readRecorded does exactly what it claims: it reads the recorded state, ignoring the tentative state. But we call it everywhere, regardless of whether we are in a transaction or not. Since we don't tag states with witnesses, it seems safer to insist that no changes have been made to the repo, rather than risk that it gets called erroneously when in fact reading the current tentative state was intended. patch f817a945c46f02b9fb37e2fc327e156fdd983a93 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:46:46 CEST 2019 * make restrictSubpaths filters pure patch 5a877ebf6d69cae59de12e300b12de5142ab7ed5 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:51:20 CEST 2019 * diff command: use restrictSubpaths, not restrictSubpathsAfter The way this filter is used in the diff command translating paths through pending really make no sense. patch c664900a0bea0d5482af1deb33fcef4780f9eb33 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:51:31 CEST 2019 * extend comment about unsafety of addPendingDiffToPending patch fb3cae46ee99523c54b042d870e204a9777fb4da Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:51:39 CEST 2019 * rename setTentativePending to tentativelySetPending This is more in line with the convention for other changes to the tentative state. patch 6d0cf92107250cadbcc205abc9f7443f26897f93 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:53:06 CEST 2019 * cleanup imports in Darcs.Patch.PatchInfoAnd patch 23653c3fe926e7af11ddd5d0a0fe559e518fa840 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:53:38 CEST 2019 * cleanup messages and filepath/url concatentation in D.R.Cache patch f0c57a97373a9174b01322991d2fb3be6beeac8b Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:53:45 CEST 2019 * refactor representation of hashes and caching of hashed files This moves the hash related definitions (ValidHash, InventoryHash, PatchHash, PristineHash, etc) to Darcs.Repository.Cache and changes the internal representation of these hashes from String to ByteString (still b16 encoded). The same change is made for PatchSets and PatchInfoAnd. Conversion to String happens only at the boundary: when we need a FilePath for IO or to display messages. Since validation of hashes is now done on ByteStrings, we can afford to make it a bit more thorough. The type distinction between hashes for different file types allows the type checker to infer the name of the cache directory from the type. We therefore remove the HashedDir argument for most functions exported from Darcs.Repository.Cache. Attachments patch-preview.txt remove-the-size-prefix-when-writing-hashed-inventories-and-patches.dpatch unnamed
msg21152 (view)	Author: bfrk	Date: 2019-08-20.18:43:45
The only new patch is the last one that fully removes the RepoType cruft (sorry Ganesh ;-) 16 patches for repository http://darcs.net/screened: patch 6f51164b4d6f4f9b025ef6bd9008ac46ab8c74c2 Author: Ben Franksen <ben.franksen@online.de> Date: Sat Aug 17 11:30:37 CEST 2019 * remove the size prefix when writing hashed inventories and patches I guess the motivation for prefixing the (uncompressed) file size to the file name was to allow allocation of a ByteString of the exact size we need. It doesn't seem like this was ever realized. Note that for pristine.hashed we don't support size prefixes except for the top-level entry. Given that Trees use lazy ByteStrings for file content it is clear that the size information is useless for pristine files. Our test scripts extensively cover reading from existing repositories, some of them are even saved as tar balls, so I think this change is fully backward compatible. patch 4e4f497e3f651e6e887d56fb4c6a69015d5400f6 Author: Ben Franksen <ben.franksen@online.de> Date: Sat Aug 17 13:17:05 CEST 2019 * fix doc comment for filterDirContents We changed the behavior slightly a while ago. patch c788c68a4936b6643d1c8c870c83ada5e145e618 Author: Ben Franksen <ben.franksen@online.de> Date: Sat Aug 17 18:49:53 CEST 2019 * add a compatibility comment to readRecorded patch 1841803eea6976c8e7d32b661f21ece55d1537ea Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 10:00:45 CEST 2019 * simplify the type of readTentativeRepo Reading the tentative patches from a repo makes sense only for the current repo and we already require that this is located at ".". patch f674822894c2e40548ef1e5d8eb294b1f91f338a Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 11:11:02 CEST 2019 * tests/clone.sh: fix removing of repo dirs patch 9661ce4c1e5e7ee5bb0d7cf6f94350d0c6769edc Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 13:39:03 CEST 2019 * fix some haddocks in Darcs.Repository.Hashed patch 8f47d28a39aec5b33ab32289f841ba21fb270343 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:39:07 CEST 2019 * make changes to pending follow our standard transaction protocol Like for other changes to the repository, any change to pending is now first made to the tentative copy of the pending patch, then finalized by renaming the tentative copy back to the "real" pending. Commands that modify only on the pending patch (and/or the working tree: add, mark-conflicts, move, remove, replace, revert, rollback, setpref, and unrevert) now have to call finalizePending, except rollback and unrevert which already perform full finalization. The readPending we export from D.R.State now reads the tentative pending, falling back to the real pending in case we are not in a transaction. This is safe since the tentative pending is now guaranteed to be gone when the transaction is finalized. A bit of complication arises since the pending patch is intricately interwoven with reading Trees and many functions in D.R.State assumed it starts at the recorded state. This means we have to go all the way and make them refer to the tentative state instead. This is highly desirable anyway: previously these functions could only be used before any changes were made to the repository, a restriction that is now lifted. This percolates through to a few other modules under Darcs.Repository. In particular, tentativelyMergePatches can now be used after (tentatively) adding or removing patches, which allows us to remove the intermediate finalize/revert calls in applyPatchesForRebaseCmd. patch 41663730aeb89c8bdf827d81648fe2174cb3f232 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:40:40 CEST 2019 * require wR ~ wT for readRecorded Note that readRecorded does exactly what it claims: it reads the recorded state, ignoring the tentative state. But we call it everywhere, regardless of whether we are in a transaction or not. Since we don't tag states with witnesses, it seems safer to insist that no changes have been made to the repo, rather than risk that it gets called erroneously when in fact reading the current tentative state was intended. patch f817a945c46f02b9fb37e2fc327e156fdd983a93 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:46:46 CEST 2019 * make restrictSubpaths filters pure patch 5a877ebf6d69cae59de12e300b12de5142ab7ed5 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:51:20 CEST 2019 * diff command: use restrictSubpaths, not restrictSubpathsAfter The way this filter is used in the diff command translating paths through pending really make no sense. patch c664900a0bea0d5482af1deb33fcef4780f9eb33 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:51:31 CEST 2019 * extend comment about unsafety of addPendingDiffToPending patch fb3cae46ee99523c54b042d870e204a9777fb4da Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:51:39 CEST 2019 * rename setTentativePending to tentativelySetPending This is more in line with the convention for other changes to the tentative state. patch 6d0cf92107250cadbcc205abc9f7443f26897f93 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:53:06 CEST 2019 * cleanup imports in Darcs.Patch.PatchInfoAnd patch 23653c3fe926e7af11ddd5d0a0fe559e518fa840 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:53:38 CEST 2019 * cleanup messages and filepath/url concatentation in D.R.Cache patch f0c57a97373a9174b01322991d2fb3be6beeac8b Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:53:45 CEST 2019 * refactor representation of hashes and caching of hashed files This moves the hash related definitions (ValidHash, InventoryHash, PatchHash, PristineHash, etc) to Darcs.Repository.Cache and changes the internal representation of these hashes from String to ByteString (still b16 encoded). The same change is made for PatchSets and PatchInfoAnd. Conversion to String happens only at the boundary: when we need a FilePath for IO or to display messages. Since validation of hashes is now done on ByteStrings, we can afford to make it a bit more thorough. The type distinction between hashes for different file types allows the type checker to infer the name of the cache directory from the type. We therefore remove the HashedDir argument for most functions exported from Darcs.Repository.Cache. patch d4c00aff0baa693b3820d2d99b63c446246f9e7f Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 20:39:59 CEST 2019 * remove Darcs.Patch.RepoType Attachments patch-preview.txt remove-the-size-prefix-when-writing-hashed-inventories-and-patches.dpatch unnamed
msg21158 (view)	Author: bfrk	Date: 2019-08-21.07:15:14
Same as before, only rebased to no longer depend on patches we obliterated from screened. Regarding the last patch: Can you (Ganesh) tell me whether you agree with this, in principle? I am not invested much in this change, it just occurred to me that we don't actually need it anymore, since the rebase patch is now kept separate from normal patches. If you think we should keep it, perhaps because you have some plans to use it for a different purpose, just say so. 16 patches for repository http://darcs.net/screened: patch 6f51164b4d6f4f9b025ef6bd9008ac46ab8c74c2 Author: Ben Franksen <ben.franksen@online.de> Date: Sat Aug 17 11:30:37 CEST 2019 * remove the size prefix when writing hashed inventories and patches I guess the motivation for prefixing the (uncompressed) file size to the file name was to allow allocation of a ByteString of the exact size we need. It doesn't seem like this was ever realized. Note that for pristine.hashed we don't support size prefixes except for the top-level entry. Given that Trees use lazy ByteStrings for file content it is clear that the size information is useless for pristine files. Our test scripts extensively cover reading from existing repositories, some of them are even saved as tar balls, so I think this change is fully backward compatible. patch 4e4f497e3f651e6e887d56fb4c6a69015d5400f6 Author: Ben Franksen <ben.franksen@online.de> Date: Sat Aug 17 13:17:05 CEST 2019 * fix doc comment for filterDirContents We changed the behavior slightly a while ago. patch c788c68a4936b6643d1c8c870c83ada5e145e618 Author: Ben Franksen <ben.franksen@online.de> Date: Sat Aug 17 18:49:53 CEST 2019 * add a compatibility comment to readRecorded patch 1841803eea6976c8e7d32b661f21ece55d1537ea Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 10:00:45 CEST 2019 * simplify the type of readTentativeRepo Reading the tentative patches from a repo makes sense only for the current repo and we already require that this is located at ".". patch f674822894c2e40548ef1e5d8eb294b1f91f338a Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 11:11:02 CEST 2019 * tests/clone.sh: fix removing of repo dirs patch 9661ce4c1e5e7ee5bb0d7cf6f94350d0c6769edc Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 13:39:03 CEST 2019 * fix some haddocks in Darcs.Repository.Hashed patch 8f47d28a39aec5b33ab32289f841ba21fb270343 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:39:07 CEST 2019 * make changes to pending follow our standard transaction protocol Like for other changes to the repository, any change to pending is now first made to the tentative copy of the pending patch, then finalized by renaming the tentative copy back to the "real" pending. Commands that modify only on the pending patch (and/or the working tree: add, mark-conflicts, move, remove, replace, revert, rollback, setpref, and unrevert) now have to call finalizePending, except rollback and unrevert which already perform full finalization. The readPending we export from D.R.State now reads the tentative pending, falling back to the real pending in case we are not in a transaction. This is safe since the tentative pending is now guaranteed to be gone when the transaction is finalized. A bit of complication arises since the pending patch is intricately interwoven with reading Trees and many functions in D.R.State assumed it starts at the recorded state. This means we have to go all the way and make them refer to the tentative state instead. This is highly desirable anyway: previously these functions could only be used before any changes were made to the repository, a restriction that is now lifted. This percolates through to a few other modules under Darcs.Repository. In particular, tentativelyMergePatches can now be used after (tentatively) adding or removing patches, which allows us to remove the intermediate finalize/revert calls in applyPatchesForRebaseCmd. patch 41663730aeb89c8bdf827d81648fe2174cb3f232 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:40:40 CEST 2019 * require wR ~ wT for readRecorded Note that readRecorded does exactly what it claims: it reads the recorded state, ignoring the tentative state. But we call it everywhere, regardless of whether we are in a transaction or not. Since we don't tag states with witnesses, it seems safer to insist that no changes have been made to the repo, rather than risk that it gets called erroneously when in fact reading the current tentative state was intended. patch f817a945c46f02b9fb37e2fc327e156fdd983a93 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:46:46 CEST 2019 * make restrictSubpaths filters pure patch 5a877ebf6d69cae59de12e300b12de5142ab7ed5 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:51:20 CEST 2019 * diff command: use restrictSubpaths, not restrictSubpathsAfter The way this filter is used in the diff command translating paths through pending really make no sense. patch c664900a0bea0d5482af1deb33fcef4780f9eb33 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:51:31 CEST 2019 * extend comment about unsafety of addPendingDiffToPending patch fb3cae46ee99523c54b042d870e204a9777fb4da Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:51:39 CEST 2019 * rename setTentativePending to tentativelySetPending This is more in line with the convention for other changes to the tentative state. patch 23653c3fe926e7af11ddd5d0a0fe559e518fa840 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:53:38 CEST 2019 * cleanup messages and filepath/url concatentation in D.R.Cache patch ef95feec78454aeef1253e0fd441e713dffe43ca Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:53:06 CEST 2019 * cleanup imports in Darcs.Patch.PatchInfoAnd patch 7b10e505dbd744e93f05df74c07e12b9380dd584 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 16:53:45 CEST 2019 * refactor representation of hashes and caching of hashed files This moves the hash related definitions (ValidHash, InventoryHash, PatchHash, PristineHash, etc) to Darcs.Repository.Cache and changes the internal representation of these hashes from String to ByteString (still b16 encoded). The same change is made for PatchSets and PatchInfoAnd. Conversion to String happens only at the boundary: when we need a FilePath for IO or to display messages. Since validation of hashes is now done on ByteStrings, we can afford to make it a bit more thorough. The type distinction between hashes for different file types allows the type checker to infer the name of the cache directory from the type. We therefore remove the HashedDir argument for most functions exported from Darcs.Repository.Cache. patch dd9a49ab094c9c189a443f1db474aaa8f9ea3dc4 Author: Ben Franksen <ben.franksen@online.de> Date: Tue Aug 20 20:39:59 CEST 2019 * remove Darcs.Patch.RepoType Attachments patch-preview.txt remove-the-size-prefix-when-writing-hashed-inventories-and-patches.dpatch unnamed
msg21168 (view)	Author: ganesh	Date: 2019-08-23.22:19:23
Regarding the removal of RepoType: I have no particular objections if it's not serving any purpose any more. I actually originally introduced it, replacing the older "Rebasing p" wrapper type, to make it easier to introduce a 'stash' feature for suspending patches without changing their identity: https://hub.darcs.net/ganesh/darcs-stash-draft-20150914 But there was various discussion about the final UI that I never implemented and it'd now need a lot of work (and re-discussion) to do so. I guess it would also look quite similar to rebase as it is now so the arguments for keeping RepoType for it would be similar. I guess the remaining argument for keeping RepoType would be that it gives us static knowledge of the features of a repository (assuming that the code that launches RepoJobs is correct). I don't know how useful this is. Looking at current screened, I'm actually a bit confused about how some of the code works now :-) For example, how does the call in Darcs.Repository.Hashed.tentativelyAddPatches to withTentativeRebase work if there's no rebase in progress?
msg21169 (view)	Author: ganesh	Date: 2019-08-23.23:06:14
Regarding dropping the size prefix, there's some history to its existence that I can't really remember or fully understand from searching old emails. The size prefix was dropped for pristine in 2010: http://bugs.darcs.net/patch156 There was also some discussion about how to manage the transition (at least I think there was discussion, but I can only find my summing up): https://lists.osuosl.org/pipermail/darcs-users/2010- March/023475.html I guess my main questions about this would be (a) what do we gain by dropping it and (b) how will the transition work in existing repositories (will they end up with a mix of formats?) and in the cache (will we lose sharing?)
msg21182 (view)	Author: ganesh	Date: 2019-08-25.17:15:49
A further thought about the RepoType parameter: I can't see any use to keeping it on PatchInfoAnd, given that as you say we no longer drag a containing rebase patch in it. But I think it may have some value on Repository, as there is a difference between a rebase-in-progress repository and a "normal" one. I don't know whether that value outweighs the cost.
msg21190 (view)	Author: bfrk	Date: 2019-08-25.23:20:52
> Looking at current screened, I'm actually a bit confused > about how some of the code works now :-) For example, how does > the call in Darcs.Repository.Hashed.tentativelyAddPatches to > withTentativeRebase work if there's no rebase in progress? Ha! A good question. It took me a moment to find the answer. We are indeed updating the rebase patch, regardless of whether there is a rebase in progress. It's just that add/removeFixupsFromSuspended immediately drops every change since it commutes cleanly past the empty rebase patch!
msg21193 (view)	Author: bfrk	Date: 2019-08-26.06:18:13
> Regarding dropping the size prefix, there's some history to its > existence that I can't really remember or fully understand from > searching old emails. The size prefix was dropped for pristine in > 2010: > > http://bugs.darcs.net/patch156 Hm, interesting read, though mostly of historical interest. > I guess my main questions about this would be > (a) what do we gain by dropping it Code gets simpler and more uniform (also slightly more efficient). Same for the on-disk representation. One practical advantage (not yet implemented in the patch I sent) is that we can now share all pristine files with the cache, not just the top-level pristine tree. We cannot currently do that, because of the difference in format, unless we pull the case distinction into Darcs.Repository.Cache, which would further complicate the code in that module. Using the cache for all pristine files would speed up cloning of unpacked repos (over http) quite a bit, I think. > and > (b) how will the transition work in existing repositories (will > they end up with a mix of formats?) and in the cache (will we > lose sharing?) Note we already read both formats. The only change would be that we write without size-prefix. But yes, with the patch as I sent it, there will be a mix in existing repos and the cache... for a while. There will also be a certain amount of duplication of files in the cache. We could make the transition even less noticeable by hard-linking old and new formats whenever we find both. However, this means eventually getting rid of the old size-prefixed files will be a bit more difficult. Excursion: The way 'darcs optimize cache' currently works makes it practically unusable. Searching the home directory for darcs repos is extremely slow, while manually listing all the repos is completely out of the question for anyone who has any actual work to do. I guess what people do instead is to just rm -rf ~/.cache/darcs and accept that the next few clones take a bit longer than usual. I think we should just delete all files in the global cache that have a hard-link count of one. This would make it a pretty fast operation which means we could perform it implicitly on a regular basis, say, every 100 times a darcs command accesses the cache. This would help keeping the size of the cache at a more reasonable level, even in the long run. Yes, this removes cached files for repos that are not on the same file system. For people who are regularly working in this kind of setting (pretty uncommon nowadays, I guess) we can add an option to disable the automatic garbage collection. If there are hard-links between differently named files in the cache, this becomes a bit more complicated and less efficient: we'd also have to search for files with a hard-link count of two, then check if there are both versions in the cache and that they are linked, before deleting both.
msg21194 (view)	Author: bfrk	Date: 2019-08-26.07:37:20
> A further thought about the RepoType parameter: I can't see any use to > keeping it on PatchInfoAnd, given that as you say we no longer drag > a containing rebase patch in it. But I think it may have some value > on Repository, as there is a difference between a rebase-in-progress > repository and a "normal" one. I don't know whether that value > outweighs the cost. Regarding the cost/benefit ratio: I don't mind dragging around the 'rt' type parameter so much. What I do mind is that actually using it (either at runtime or at compile time) incurs this ridiculous amount of syntactic overhead. It makes type signatures longer and harder to read and the case distinctions themselves are riddled with complicated extra type annotations and singleton type parameters. The fact that working with singleton types in Haskell can be rather awkward has been recognized even by experts that are otherwise ardent supporters of dependent typing. I am a personally big fan of Haskell's strong static type system and I appreciate very much how it enables safe refactoring using some of the more advanced features like phantom ("witness") types. I guess it contributes a large part of the fun I have with hacking on darcs that I /can/ do these large refactors and still get it working afterwards without weeks of debugging. But there are limits. If the typing overhead gets to the point of obscuring code that is basically a plain and simple case distinction, then IMO that limit is reached. We should also keep in mind that any advanced type shenanigans make it harder for newcomers to contribute. And finally, trying to capture ever more things at the type level isn't the only, and perhaps not the best, way of making progress: re-writing complicated procedures so that it becomes obvious what they do and why, commenting anything that remains tricky, adding haddocks, consistent and succinct but still suggestive names for variables etc etc..., in other words improving code quality. This is what I have been doing for years now, and I think it has proven quite effective. Your use of RepoType when the rebase patch was mixed with normal patches was (barely) justified (though it really gave me quite a headache to understand it at first): it was important that we don't accidentally mix rebase-in-progress patches with normal ones. Nowadays, there is no longer any essential difference between a rebase-in-progress repository and a "normal" one. We still make a distinction when we start a RepoJob but I think that is no longer needed. So what if we use e.g. 'darcs rebase unsuspend' or 'darcs rebase log' and there is no rebase in progress? Nothing bad would happen at all. This could be handled just like we handle other "empty" cases i.e. simply report that there are no suspended patches after reading the rebase patch. We could even drop the final "Rebase in progress: N suspended patches" message; I find it a bit distracting since almost all of my repos nowadays have a few suspended patches. But I'm surely not the average user, and I don't mind if we keep it. (It should certainly be reported if we act on my proposal for an enhanced 'darcs status' command.)
msg21202 (view)	Author: bfrk	Date: 2019-08-26.10:00:59
It is good I haven't screened any of the patches yet. I am having doubts about some details of the "make changes to pending follow our standard transaction protocol" patch. I think the general idea is fine, but there are issues with first reading the tentative pending and falling back to reading the real pending if that fails. For this to work correctly we have to make sure that we clean up the tentative pending patch, even if the transaction fails (for any reason). Otherwise the next time a non-transaction command reads pending it gets the content of the aborted pending.tentative which is decidedly not good. This is missing. It is interesting that this bug is not found by any of the regression tests.
msg21327 (view)	Author: ganesh	Date: 2019-09-03.07:40:37
> One practical advantage (not yet implemented in the patch I sent) is > that we can now share all pristine files with the cache, not just the > top-level pristine tree. We cannot currently do that, because of the > difference in format, unless we pull the case distinction into > Darcs.Repository.Cache, which would further complicate the code in that > module. Using the cache for all pristine files would speed up cloning of > unpacked repos (over http) quite a bit, I think. I don't understand what you mean - I see a pristine.hashed subfolder in my cache that has recently updated individual pristine files in it. In what way are we not already using the cache for all pristine files? >> and >> (b) how will the transition work in existing repositories (will >> they end up with a mix of formats?) and in the cache (will we >> lose sharing?) > > Note we already read both formats. The only change would be that we > write without size-prefix. > > But yes, with the patch as I sent it, there will be a mix in existing > repos and the cache... for a while. There will also be a certain amount > of duplication of files in the cache. > > We could make the transition even less noticeable by hard-linking old > and new formats whenever we find both. However, this means eventually > getting rid of the old size-prefixed files will be a bit more difficult. You mean getting rid of the code to handle them, or the files themselves? I'd have thought the old files would persist a long time if people don't re-clone repos, as they'll be in the history. If we make sure to hard-link between the old and the new names in the cache it's probably not too painful. > Excursion: > > The way 'darcs optimize cache' currently works makes it practically > unusable. Searching the home directory for darcs repos is extremely > slow, while manually listing all the repos is completely out of the > question for anyone who has any actual work to do. I guess what people > do instead is to just rm -rf ~/.cache/darcs and accept that the next few > clones take a bit longer than usual. Yeah, I never really know how to do it. I don't clean up my cache very often. > I think we should just delete all files in the global cache that have a > hard-link count of one. The current behaviour was implemented in http://darcs.net/GSoC/2014-Hashed-Files-And-Cache http://bugs.darcs.net/issue2374 http://bugs.darcs.net/patch1180 I can't find any evidence that the idea of link counting was considered, which is a bit surprising (perhaps it's only obvious in retrospect). > This would make it a pretty fast operation which > means we could perform it implicitly on a regular basis, say, every 100 > times a darcs command accesses the cache. This would help keeping the > size of the cache at a more reasonable level, even in the long run. Yes, > this removes cached files for repos that are not on the same file > system. For people who are regularly working in this kind of setting > (pretty uncommon nowadays, I guess) we can add an option to disable the > automatic garbage collection. I guess the problem is that darcs would be slower for those people by default and they'd never know why. Filling your disk by default is also not a great option but personally I never find the disk space used by the cache significant even though I am often short of disk space. But I agree it'd be rare. And I can't see any other downsides to deleting cache files that aren't in a repository anywhere on disk. It would hurt people that occasionally clone a repo and then delete it again but I think that's an ok default. > If there are hard-links between differently named files in the cache, > this becomes a bit more complicated and less efficient: we'd also have > to search for files with a hard-link count of two, then check if there > are both versions in the cache and that they are linked, before deleting > both. We probably don't need to check they are linked, just that they both exist. If they both exist then delete if link count <=2 , if not then if <=1. If they both exist and aren't hard-linked then we might accidentally delete something that was linked elsewhere. But it ought not to happen if darcs has already done its job right, and it's not much of a problem if it does happen.
msg21338 (view)	Author: bfrk	Date: 2019-09-03.14:18:59
>> One practical advantage (not yet implemented in the patch I sent) is >> that we can now share all pristine files with the cache, not just the >> top-level pristine tree. We cannot currently do that, because of the >> difference in format, unless we pull the case distinction into >> Darcs.Repository.Cache, which would further complicate the code in that >> module. Using the cache for all pristine files would speed up cloning of >> unpacked repos (over http) quite a bit, I think. > > I don't understand what you mean - I see a pristine.hashed subfolder in > my cache that has recently updated individual pristine files in it. In > what way are we not already using the cache for all pristine files? We don't write pristine files via Darcs.Repository.Cache but only locally, see Darcs.Util.Tree.Hashed, except during fetch. Indeed, using writeFileUsingCache for pristine files would currently be pointless as this would write files with a size prefix. We also don't read from the cache unless we transfer pristine files e.g. during clone. So there is some sharing but it is limited to transfer operations. Suppose I record a new patch on top of a current clone of screened, then push that, then make a fresh clone of screened. This clone will not share any of the new pristine files with the old repo, so it must download them. Contrast that with inventories and patches, where we need not download a single file because all of them are already in the cache. >> We could make the transition even less noticeable by hard-linking old >> and new formats whenever we find both. However, this means eventually >> getting rid of the old size-prefixed files will be a bit more difficult. > > You mean getting rid of the code to handle them, or the files > themselves? The latter. Which is why I went on the excursion about optimize cache. > I'd have thought the old files would persist a long time if > people don't re-clone repos, as they'll be in the history. If we make > sure to hard-link between the old and the new names in the cache it's > probably not too painful. That's what I meant. But it would be nice if at some point we would automatically get rid of all the size-prefixed files. Not essential. >> Excursion: >> >> The way 'darcs optimize cache' currently works makes it practically >> unusable. Searching the home directory for darcs repos is extremely >> slow, while manually listing all the repos is completely out of the >> question for anyone who has any actual work to do. I guess what people >> do instead is to just rm -rf ~/.cache/darcs and accept that the next few >> clones take a bit longer than usual. > > Yeah, I never really know how to do it. I don't clean up my cache very > often. It happens for me usually when I switch to a new machine. >> I think we should just delete all files in the global cache that have a >> hard-link count of one. > > > The current behaviour was implemented in > http://darcs.net/GSoC/2014-Hashed-Files-And-Cache > > http://bugs.darcs.net/issue2374 > http://bugs.darcs.net/patch1180 > > I can't find any evidence that the idea of link counting was considered, > which is a bit surprising (perhaps it's only obvious in retrospect). I am still thinking about possible pitfalls. >> This would make it a pretty fast operation which >> means we could perform it implicitly on a regular basis, say, every 100 >> times a darcs command accesses the cache. This would help keeping the >> size of the cache at a more reasonable level, even in the long run. Yes, >> this removes cached files for repos that are not on the same file >> system. For people who are regularly working in this kind of setting >> (pretty uncommon nowadays, I guess) we can add an option to disable the >> automatic garbage collection. > > I guess the problem is that darcs would be slower for those people by > default and they'd never know why. Correct. Perhaps GCing the cache shouldn't be on by default. > Filling your disk by default is also > not a great option but personally I never find the disk space used by > the cache significant even though I am often short of disk space. I agree that nowadays a few GB for the cache aren't really a problem anymore. It used to be for me where I work because on our development machine we have quotas.This has regularly caused problems until I got the idea to set XDG_CACHE_HOME for everybody to a directory on the disk where we do our actual work and have lots of space. > But I agree it'd be rare. And I can't see any other downsides to > deleting cache files that aren't in a repository anywhere on disk. It > would hurt people that occasionally clone a repo and then delete it > again but I think that's an ok default. So what we should do is to implement this but not make the automatic GC the default. At least we will get a useful 'darcs optimize cache' command, automatically calling it is an optional feature. >> If there are hard-links between differently named files in the cache, >> this becomes a bit more complicated and less efficient: we'd also have >> to search for files with a hard-link count of two, then check if there >> are both versions in the cache and that they are linked, before deleting >> both. > > We probably don't need to check they are linked, just that they both > exist. If they both exist then delete if link count <=2 , if not then if > <=1. If they both exist and aren't hard-linked then we might > accidentally delete something that was linked elsewhere. But it ought > not to happen if darcs has already done its job right, and it's not much > of a problem if it does happen. Yeah, something like that should do it.

History
Date	User	Action	Args
2019-08-20 15:49:22	bfrk	create
2019-08-20 18:43:45	bfrk	set	files: + patch-preview.txt, remove-the-size-prefix-when-writing-hashed-inventories-and-patches.dpatch, unnamed messages: + msg21152
2019-08-21 07:15:16	bfrk	set	files: + patch-preview.txt, remove-the-size-prefix-when-writing-hashed-inventories-and-patches.dpatch, unnamed messages: + msg21158
2019-08-23 22:19:24	ganesh	set	messages: + msg21168
2019-08-23 23:06:14	ganesh	set	messages: + msg21169
2019-08-25 17:15:49	ganesh	set	messages: + msg21182
2019-08-25 23:20:53	bfrk	set	messages: + msg21190
2019-08-26 06:18:14	bfrk	set	messages: + msg21193
2019-08-26 07:37:20	bfrk	set	messages: + msg21194
2019-08-26 10:00:59	bfrk	set	messages: + msg21202
2019-09-03 07:40:38	ganesh	set	messages: + msg21327
2019-09-03 14:18:59	bfrk	set	messages: + msg21338
2019-09-21 10:30:22	bfrk	set	status: needs-screening -> in-discussion
2023-07-10 13:53:04	bfrk	set	status: in-discussion -> obsoleted

Patch 1887 remove the size prefix when writing hash... (and 14 more)

Attachments

Attachments

Attachments