darcs

Issue 1556 task: abandon tentative files and keep the information in memory

Title task: abandon tentative files and keep the information in memory
Priority feature Status needs-implementation
Milestone Resolved in
Superseder Nosy List darcs-devel, ganesh, kowey
Assigned To
Topics

Created on 2009-08-23.17:57:59 by kowey, last changed 2020-07-31.21:59:10 by bf.

Messages
msg8413 (view) Author: kowey Date: 2009-08-23.17:57:57
This is from David Roundy's msg5797 on issue992:

Anyhow, this should be combined with a safety refactor which would ensure that
the _darcs/hashed_inventory is only read once:  we should store its contents in
the Repository data structure, so we can't accidentally mix two views of a
remote repository during one command.  I don't think we currently make this
mistake, but it's troubling that we could.  

David goes on to comment on how this would fit into issue992:

Once this refactor is done (which
means that we'd read _darcs/hashed_inventory when first identifying the
Repository), we can easily make darcs read _darcs/inventories/xxx instead, if
the URL has some fancy format that includes a hash value.  Or if a file with
that hash isn't present in _darcs/inventories/ we'd look at
_darcs/hashed_inventory to see if that has the right hash.  This feature will
enable self-authenticating URLs, albeit URLs that only describe a specific version.
msg12472 (view) Author: kowey Date: 2010-09-06.11:47:30
Petr says this is already done as part of his adventure refactor:
http://irclog.perlgeek.de/darcs/2010-09-06#i_2790255
msg17406 (view) Author: gh Date: 2014-04-28.19:46:57
I think this was never ported from adventure to HEAD, so marking it as
"needs-implementation" again.
msg18751 (view) Author: gh Date: 2015-09-22.14:36:46
I would extend the scope of the proposal, following Petr's observations:

> But the actual motivation for that was getting rid of the tentative
files, which are superfluous and, to some extent, dangerous.
> You don't need to dump the intermediate states to disk, really.
> And they are dangerous because the API allows direct access to both
non-tentative and tentative stuff.

So the scope would be:

* always read hashed_inventory once
* never write tentative_hashed_inventory, tentative_pristine and maybe
pending.tentative, instead keep them in memory.

That would reduce Darcs' filesystem IO footprint, which is welcome
especially in cases like repositories in sshfs or dropbox.
msg20215 (view) Author: bf Date: 2018-07-19.11:23:02
Abandoning the tentative files and instead keeping them in memory sounds
like a viable alternative to my idea of tracking the transaction state
in a type witness.

However, must make sure that we do not accidentally keep the whole repo
in memory, only the head inventory.

Another obvious idea is to abandon hashed_inventory (including the
tentative version). Instead, always store the head inventory hashed and
keep only its hash in a file under _darcs. It may be difficult to make
this change in a backward compatible way, though.
msg20216 (view) Author: bf Date: 2018-07-21.15:06:32
Looking at the code for writeTentativeInventory reveals that we already
store the _darcs/hashed_inventory (minus the pristine hash) inside
_darcs/inventories in standard hashed format. Thus we could compress
_darcs/hashed_inventory to a file that contains only the pristine and
inventory hash as e.g. _darcs/current_state. Reading this file, caching
the two hashes in the Repository token, and updating it whenever the
repo is modified would be cheap and could be done for each command. We
should be careful to write the parser for this file in a way that allows
future extensions of the format, e.g. for when we add versions or branches.

We must continue to write hashed_inventory, though, to remain compatible
with previous versions of Darcs. So here is my revised plan:

* always read current_state and pending once
* never write tentative_hashed_inventory, tentative_pristine, and
pending.tentative, instead keep them in memory as members of Repository
* on finalization, write hashed_inventory, pending, and current_state
each atomically by first writing to a temporary name and then renaming
(like we  already do for pending)
msg20217 (view) Author: bf Date: 2018-07-21.15:14:49
This move could be accompanied by adding a Repository witness for the
pending state (wP).
msg20225 (view) Author: bf Date: 2018-07-23.08:48:53
I am currently persuing the idea of making this change in a way that is
compatible with and prepares for future extensions regarding in-repo
branches.

The idea is to condense hashed_inventory and patches/pending into small
files that contain three hashes: the inventory hash, the pristine hash,
and the pending hash. This requires the current head inventory to be
hashed but apparently we already do that (see writeTentativeInventory).
We also need to hash pending, which we do not do yet.

These branch files live under a new directory _darcs/branches. The first
step adds only a single branch named "current". We read that file once
when we identify a repo, falling back to reading the old special files
if it does not exist, and creating the branch file if this is our local
repository. The Repository type is extended to contain the branch data,
consisting of the three hashes and the branch name. On finalization we
atomically write the branch back to disk (but we need to maintain the
old special files, too, for compatibility).

If we take care to make the format extensible (in a forward and backward
compatible way) we can add more hashes later, e.g. for the unrevert
bundle or the rebase patch, once these are converted to a hashable
format or uncoupled from the normal patches, respectively.

I much prefer this refactor over the idea of tracking transaction mode
in the types.
msg20226 (view) Author: bf Date: 2018-07-23.13:27:25
Okay, if this is a preparation for branches, then I want to get things
right. I made a conceptual mistake when I proposed a
_darcs/branches/current file.

So a named branch is a file under _darcs/branches named like the branch.
We need exactly one branch to be active ("checked out" in git terms).
How do we represent that? A simple solution is to maintain a copy of the
active branch file as _darcs/active_branch. We could also store just the
branch name in _darcs/active_branch (akin to a symbolic link); but that
doesn't fit well with the idea of introducing branches step-by-step:
we'd have the conjure up a name for the single branch in a traditional
single-branch-repo. So, for the first step we don't actually need the
_darcs/branches directory, just a single file _darcs/active_branch or
perhaps just _darcs/active.
msg20227 (view) Author: bf Date: 2018-07-25.11:55:42
The more I hack on this the more I like it.

This refactor turns the Repository API into something much more
functional in style than it used to be. While most of the functions
still return IO actions (because we have to read and write the hashed
store), we largely avoid /stateful/ IO: what we read and write is
determined by the hashes, so we don't actually modify state in a
semantic sense... except right at the end when we finalize the branch
data for comsumption by subsequent darcs commands. The whole "tentative"
vs. "recorded" distinction simply disappears.
History
Date User Action Args
2009-08-23 17:57:59koweycreate
2009-08-23 18:00:41koweylinkissue992 superseder
2009-08-25 18:16:07adminsetnosy: + darcs-devel, - simon
2009-08-27 14:30:49adminsetnosy: kowey, darcs-devel, thorkilnaur, dmitry.kurochkin
2009-09-06 21:04:02koweysettopic: + Hashed
nosy: kowey, darcs-devel, thorkilnaur, dmitry.kurochkin
2010-04-03 12:06:40koweysettopic: + Library
2010-09-06 11:47:31koweysetstatus: needs-implementation -> has-patch
assignedto: mornfall
messages: + msg12472
nosy: + mornfall
2014-04-28 19:46:58ghsetstatus: has-patch -> needs-implementation
messages: + msg17406
2015-09-22 14:36:48ghsettopic: - Hashed, Library
nosy: + ganesh, - thorkilnaur, dmitry.kurochkin, mornfall
milestone: 2.12.0
messages: + msg18751
assignedto: mornfall ->
2018-07-19 11:23:04bfsetmessages: + msg20215
2018-07-21 15:06:33bfsetmessages: + msg20216
title: task: safety refactor to ensure that hashed_inventory is only read once -> task: abandon tentative files and keep the information in memory
2018-07-21 15:14:50bfsetmessages: + msg20217
2018-07-23 08:48:54bfsetmessages: + msg20225
2018-07-23 13:27:26bfsetmessages: + msg20226
2018-07-23 13:28:21bfsetmilestone: 2.12.0 -> 2.14.2
2018-07-25 11:55:44bfsetmessages: + msg20227
2020-07-31 21:59:10bfsetmilestone: 2.14.2 ->