darcs

Patch 2135 index refactors

Title index refactors
Superseder Nosy List bfrk
Related Issues
Status accepted Assigned To
Milestone

Created on 2020-12-20.14:35:40 by bfrk, last changed 2021-03-22.06:23:37 by ganesh.

Files
File name Status Uploaded Type Edit Remove
avoid-useless-cpp-in-d_u_index-and-d_r_state.dpatch bfrk, 2020-12-20.14:35:38 application/x-darcs-patch
patch-preview.txt bfrk, 2020-12-20.14:35:38 text/x-darcs-patch
patch-preview.txt bfrk, 2020-12-27.20:09:28 text/x-darcs-patch
remove-catchall-from-show-index-command.dpatch bfrk, 2020-12-27.20:09:28 application/x-darcs-patch
unnamed bfrk, 2020-12-20.14:35:38 text/plain
unnamed bfrk, 2020-12-27.20:09:28 text/plain
See mailing list archives for discussion on individual patches.
Messages
msg22579 (view) Author: bfrk Date: 2020-12-20.14:35:38
Refactors in and around the index. This is mostly cleanup (with some
expected performance improvements) except for the patches that simplify the
index invalidating logic. For some of the changes I expect performance
improvements on Windows (mainly when updating the index after the user made
lots of changes), but I haven't measured them.

21 patches for repository http://darcs.net/screened:

patch 3ae83321a2b5fcf0c08cdcf6f1ac1cf19266e127
Author: Ben Franksen <ben.franksen@online.de>
Date:   Wed Oct 28 00:20:18 CET 2020
  * avoid useless CPP in D.U.Index and D.R.State
  
  We can use the same action (renameFile instead of removeFile) for both
  Windows and Posix. The point here is that the index may still be open via an
  mmapped ForeignPtr, so removing it is not portable, but renaming it should
  work on all systems.

patch 731e44cba1951fdfcf418f033a3959914eef907c
Author: Ben Franksen <ben.franksen@online.de>
Date:   Thu Oct 29 08:59:33 CET 2020
  * index: re-use getFileStatus for getFileID' (Posix)

patch b4b0b8bd6ce7a877da4d0dddc71f0fcebd50cb28
Author: Ben Franksen <ben.franksen@online.de>
Date:   Thu Oct 29 09:11:45 CET 2020
  * index: move update of fileid out of createItem
  
  This is a preparation for a later refactor.

patch 2b527287ffef73537f33bb61fe667424d28b2970
Author: Ben Franksen <ben.franksen@online.de>
Date:   Thu Oct 29 09:15:35 CET 2020
  * index: use realPath instead of anchorPath ""

patch 25d707921b66aa1397478b63cead3a7c851a298f
Author: Ben Franksen <ben.franksen@online.de>
Date:   Thu Oct 29 11:07:39 CET 2020
  * index: in the docs, replace "line" with "Item", clarify size/alignment

patch 72d406e4401cdeefa2a6a5cbc1019d03ac41fca7
Author: Ben Franksen <ben.franksen@online.de>
Date:   Fri Oct 30 10:53:36 CET 2020
  * show index: show everything, not just hash, id, and path

patch 1a2129f7cc702331539cc987ab41f67bedc6dcd4
Author: Ben Franksen <ben.franksen@online.de>
Date:   Fri Oct 30 11:38:55 CET 2020
  * darcs check: add a warning when we cannot access the index
  
  This was already done for all other "read-only" commands.

patch 697902978ac4e171aad080a1baa57e34f3531bbf
Author: Ben Franksen <ben.franksen@online.de>
Date:   Mon Oct 26 18:33:10 CET 2020
  * Darcs.Util.Index: rename updateIndex to treeFromIndex
  
  This function does not update the index, it merely creates a Tree out of an
  Index. This also avoids the name collision with the function from
  Darcs.Repository.State which does update the index. Also add haddocks and
  rename a local variable that denotes a tree, not an index.

patch e51d37deed0bcf9e9d7216225b425118591b1eb5
Author: Ben Franksen <ben.franksen@online.de>
Date:   Wed Oct 28 00:35:36 CET 2020
  * rename D.Util.Index.readIndex to openIndex
  
  This is a more fitting name since this function merely creates a cursor for
  the index without actually reading anything interesting (besides the magic
  word). It also resolves the last name conflict in D.R.State which means we
  can now import D.Util.Index unqualified.

patch 0c003b272def1c5c65cdc6827905acc2c84d5429
Author: Ben Franksen <ben.franksen@online.de>
Date:   Thu Oct 29 15:01:41 CET 2020
  * simplify the logic that handles existence and validity of the index

patch 7ee5c726892e78168248b33004bf1691770b0060
Author: Ben Franksen <ben.franksen@online.de>
Date:   Mon Nov  2 18:16:15 CET 2020
  * add some debug messages (index, patch index)

patch b78855ff7da3c19be0875a4f2c1e3fd3f341d18b
Author: Ben Franksen <ben.franksen@online.de>
Date:   Wed Nov  4 11:24:37 CET 2020
  * index: update a few comments

patch 918a571e4f3081c5c3426104bb12e124b0497dca
Author: Ben Franksen <ben.franksen@online.de>
Date:   Thu Oct 29 10:42:32 CET 2020
  * index: remove unnecessary conversions

patch b22a29bcafb0cdbdb5fe2b6b5a315948a5136d16
Author: Ben Franksen <ben.franksen@online.de>
Date:   Wed Oct 28 00:53:16 CET 2020
  * minor refactor in D.Util.Index.readDir

patch a569d04d6baed9f274f36a4aa95dd231ca697985
Author: Ben Franksen <ben.franksen@online.de>
Date:   Wed Oct 28 23:50:15 CET 2020
  * use file IDs from unix-compat
  
  The Windows specific functionality in the index code to get at the Win32
  equivalent to Unix inode numbers is covered by the unix-compat package since
  version 0.4.1.3. This avoids unnecessary extra calls to createFile.

patch 52ce6f1c417cd95f2cb589f69d1c5dd4dd03b893
Author: Ben Franksen <ben.franksen@online.de>
Date:   Wed Nov  4 08:49:29 CET 2020
  * index: make readDir/readFile local to readItem

patch fb8c363e683e926df9416753db33d307e2303ff5
Author: Ben Franksen <ben.franksen@online.de>
Date:   Wed Nov  4 08:59:48 CET 2020
  * index: add progress reports when "reading" (actually updating) the index
  
  Indeed reading the index involves updating items: file size, timestamp, file
  ID, and content hash are updated in case they aren't properly indexed yet.
  This can take quite some time if the index is generated for the first time
  or there are lots of changes in the working tree.

patch f93aa806baf19528f73eec5ca9c2d21d834e1bf6
Author: Ben Franksen <ben.franksen@online.de>
Date:   Wed Nov  4 08:52:51 CET 2020
  * index: use iPath instead of realPath
  
  This is more consistent and perhaps a bit faster.

patch ae8954802a462e96872abb169b3e215b9a977d8b
Author: Ben Franksen <ben.franksen@online.de>
Date:   Wed Oct 28 08:54:20 CET 2020
  * fix haddocks for D.U.Index.updateItem

patch 065c5a62dea13e7aa4ec87197ce6b08de96c2fec
Author: Ben Franksen <ben.franksen@online.de>
Date:   Fri Oct 30 09:53:08 CET 2020
  * remove invalidateIndex
  
  Instead of creating _darcs/index_invalid we now directly call updateIndex
  where necessary. This was already done in finalizeRepositoryChanges (and
  also in the add and move commands) in order to support --look-for-moves
  which requires an up-to-date index. This is now handled uniformly by adding
  a call to updateIndex inside addToPending and addPendingDiffToPending, which
  are the two functions called by commands that modify only the pending patch
  and not any recorded patches and thus do not currently call
  finalizeRepositoryChanges. The UI layer no longer needs to concern itself
  with the index at all (other than passing down the value of the useIndex
  option).
  
  For compatibility we still check if _darcs/index_invalid exists and
  re-create the index in that case.

patch 01348e7c33365ef0836c443354ea1cc559431a80
Author: Ben Franksen <ben.franksen@online.de>
Date:   Fri Oct 30 09:02:26 CET 2020
  * D.R.State: build w/o warning with TEST_INDEX=1
Attachments
msg22610 (view) Author: ganesh Date: 2020-12-27.13:17:53
Partial review:

>   * avoid useless CPP in D.U.Index and D.R.State
>   
>   We can use the same action (renameFile instead of removeFile) for both
>   Windows and Posix. The point here is that the index may still be open via an
>   mmapped ForeignPtr, so removing it is not portable, but renaming it should
>   work on all systems.

> - -- TODO this conditional logic (rename or delete) is mirrored in
> - -- Darcs.Repository.State.checkIndex and should be refactored

I think it'd still be good to have a single function, or to keep the
TODO pointing between the two bits of code, even though the logic is
only one line now.

That way if someone does want to change it in future they won't
accidentally change one and not the other.

>   * index: re-use getFileStatus for getFileID' (Posix)

Is this intended to be behaviour-preserving? getFileStatus (defined in
the same file) seems to behave differently on symbolic links to the old
code.

>   * index: move update of fileid out of createItem
>   
>   This is a preparation for a later refactor.

OK - I shall assume the later refactor is worthwhile :-)

Does the doc comment for createItem need to change?

>   * index: use realPath instead of anchorPath ""

OK

>   * index: in the docs, replace "line" with "Item", clarify size/alignment

OK


>   * show index: show everything, not just hash, id, and path

[Darcs.UI.Commands.ShowIndex]
>  entries <- dumpIndex indexPath `catchall` return []

Should we avoid introducing new uses of catchall? Is there a specific
kind we could catch here?

>   * darcs check: add a warning when we cannot access the index
>   * Darcs.Util.Index: rename updateIndex to treeFromIndex
>   * rename D.Util.Index.readIndex to openIndex

OK
msg22612 (view) Author: bfrk Date: 2020-12-27.19:34:03
Am 27.12.20 um 14:17 schrieb Ganesh Sittampalam:
> 
> Ganesh Sittampalam <ganesh@earth.li> added the comment:
> 
> Partial review:
> 
>>   * avoid useless CPP in D.U.Index and D.R.State
>>   
>>   We can use the same action (renameFile instead of removeFile) for both
>>   Windows and Posix. The point here is that the index may still be open via an
>>   mmapped ForeignPtr, so removing it is not portable, but renaming it should
>>   work on all systems.
> 
>> - -- TODO this conditional logic (rename or delete) is mirrored in
>> - -- Darcs.Repository.State.checkIndex and should be refactored
> 
> I think it'd still be good to have a single function, or to keep the
> TODO pointing between the two bits of code, even though the logic is
> only one line now.
> 
> That way if someone does want to change it in future they won't
> accidentally change one and not the other.

This is indeed quite tricky. In the latest version I do have an extra
comment in both places. However, I am not 100% sure the reason we need
to use renameFile and not removeFile is the same. Inside
Darcs.Util.Index, the index is still open (mmapped) at this point
(creating the new index involves reading from the old one), so it makes
sense to me that we should not remove it, at least on Windows (on Posix
systems, removing a file does not invalidate access to an open handle to
that file). But in Darcs.Repository.State.checkIndex this should not be
the case; still, using removeFile there /also/ fails (with an "access
denied") and in this case it is pretty unclear to me why.

Can we postpone this question until after you have reviewed the full
bundle and seen the related changes, including the new comment I added?

>>   * index: re-use getFileStatus for getFileID' (Posix)
> 
> Is this intended to be behaviour-preserving?> getFileStatus (defined in
> the same file) seems to behave differently on symbolic links to the old
> code.

It is well possible that the new behavior differs in some corner cases
from the old one. Messy IO code like this is always difficult to reason
about ;-) I should also mention that the file IDs in the index are there
exclusively to detect file moves. So even if I made a mistake here, it
is likely that this does not break anything essential.

Anyway, I claim that the new code does what we want it to do (apart from
the extra warning): it should treat symbolic links as non-existent,
regardless of where the link points to. Note that
Darcs.Util.Index.getFileStatus calls Darcs.Util.File.getFileStatus which
calls getSymbolicLinkStatus and then returns Nothing if this fails with
an exception. This means that the existence tests are redundant (they
will cause exceptions and thus a return Nothing). If it exists and is a
symbolic link, Darcs.Util.Index.getFileStatus issues a warning and
returns Nothing. So the new code is "correct" in that it behaves as
specified above.

The extra warning is unwanted but is fixed in a later patch "use file
IDs from unix-compat" where we completely remove getFileID' and always
take it from the file status. This patch should be seen as a preparation
for that.

>>   * index: move update of fileid out of createItem
>>   
>>   This is a preparation for a later refactor.
> 
> OK - I shall assume the later refactor is worthwhile :-)

I think you will like it: it is the one I mentioned above, where we get
rid of getFileID' (including the CPP).

> Does the doc comment for createItem need to change?

No, I just checked and it still seems to be "correct" i.e. vague enough
not to be outright wrong ;-)

>>   * show index: show everything, not just hash, id, and path
> 
> [Darcs.UI.Commands.ShowIndex]
>>  entries <- dumpIndex indexPath `catchall` return []
> 
> Should we avoid introducing new uses of catchall? Is there a specific
> kind we could catch here?

Hmm. I must admit I was lazy here. My excuse is that 'darcs show index'
is more a debugging aid than a proper user level command. I think the
right thing to do here is to simply remove the catchall, which has the
same effect as catching and printing the exception and then aborting. I
just tried that and tested it by removing _darcs/index, which now gives me:

>darcs show index
darcs: _darcs/index: opening of '_darcs/index' failed: does not exist
(No such file or directory)

Will send a follow-up patch.
msg22613 (view) Author: bfrk Date: 2020-12-27.20:09:28
Following up on review.

1 patch for repository http://darcs.net/screened:

patch ff28712e9080cc63bbc272e945c8a0e664a69709
Author: Ben Franksen <ben.franksen@online.de>
Date:   Sun Dec 27 20:25:54 CET 2020
  * remove catchall from show index command
  
  This is a debugging aid and it is better to let it fail than hide problems.
Attachments
msg22617 (view) Author: bfrk Date: 2021-01-10.13:02:27
I have just now screened the follow-up patch.
msg22657 (view) Author: ganesh Date: 2021-03-21.22:15:59
Sorry for the long delay.

>   * simplify the logic that handles existence and validity of the index

OK - there are some slightly subtle changes here, but they seem fine:

 - In checkIndex, we now rely on indexFormatValid return False
   if the file doesn't exst
 - The 'invalid' file is just removed unconditionally instead of
   tracking its existence

>   * add some debug messages (index, patch index)

OK


>   * index: update a few comments

OK


>   * index: remove unnecessary conversions

OK


>   * minor refactor in D.Util.Index.readDir

OK

>   * use file IDs from unix-compat

OK (I didn't check every detail)

>   * index: make readDir/readFile local to readItem

OK (I assume it's just a code move)


>   * index: add progress reports when "reading" (actually updating) the index

OK

>   * index: use iPath instead of realPath

OK

>   * fix haddocks for D.U.Index.updateItem

OK

>   * remove invalidateIndex

>   Instead of creating _darcs/index_invalid we now directly call updateIndex
>   where necessary. This was already done in finalizeRepositoryChanges (and
>   also in the add and move commands) in order to support --look-for-moves
>   which requires an up-to-date index. This is now handled uniformly by adding
>   a call to updateIndex inside addToPending and addPendingDiffToPending, which
>   are the two functions called by commands that modify only the pending patch
>   and not any recorded patches and thus do not currently call
>   finalizeRepositoryChanges. The UI layer no longer needs to concern itself
>   with the index at all (other than passing down the value of the useIndex
>   option).
>   
>   For compatibility we still check if _darcs/index_invalid exists and
>   re-create the index in that case.

OK (I have not checked in detail that every path still updates the index
where needed, but your explanation makes sense).

>   * D.R.State: build w/o warning with TEST_INDEX=1

OK
msg22658 (view) Author: ganesh Date: 2021-03-21.22:16:46
>   * remove catchall from show index command

OK
msg22659 (view) Author: ganesh Date: 2021-03-21.22:22:31
>>> - -- TODO this conditional logic (rename or delete) is mirrored in
>>> - -- Darcs.Repository.State.checkIndex and should be refactored
>>
>> I think it'd still be good to have a single function, or to keep the
>> TODO pointing between the two bits of code, even though the logic is
>> only one line now.
>>
>> That way if someone does want to change it in future they won't
>> accidentally change one and not the other.
> 
> [...]
> Can we postpone this question until after you have reviewed the full
> bundle and seen the related changes, including the new comment I added?

OK, I'm at least somewhat convinced from reading the code as it is
with this whole bundle applied.

>>>   * index: re-use getFileStatus for getFileID' (Posix)
>>
>> Is this intended to be behaviour-preserving?> getFileStatus (defined in
>> the same file) seems to behave differently on symbolic links to the old
>> code.
> 
> It is well possible that the new behavior differs in some corner cases
> from the old one. Messy IO code like this is always difficult to reason
> about ;-) I should also mention that the file IDs in the index are there
> exclusively to detect file moves. So even if I made a mistake here, it
> is likely that this does not break anything essential.
> 
> Anyway, I claim that the new code does what we want it to do (apart from
> the extra warning): it should treat symbolic links as non-existent,
> regardless of where the link points to. [...]

OK. I guess if this affects any users we'll hear from them in due course!
History
Date User Action Args
2020-12-20 14:35:40bfrkcreate
2020-12-20 14:38:33bfrksetstatus: needs-screening -> needs-review
2020-12-27 13:17:55ganeshsetmessages: + msg22610
2020-12-27 19:34:07bfrksetmessages: + msg22612
2020-12-27 20:09:28bfrksetfiles: + patch-preview.txt, remove-catchall-from-show-index-command.dpatch, unnamed
messages: + msg22613
2021-01-10 13:02:27bfrksetmessages: + msg22617
2021-01-10 13:03:17bfrksetstatus: needs-review -> review-in-progress
2021-03-21 22:16:01ganeshsetmessages: + msg22657
2021-03-21 22:16:46ganeshsetmessages: + msg22658
2021-03-21 22:22:33ganeshsetmessages: + msg22659
2021-03-21 22:22:38ganeshsetstatus: review-in-progress -> accepted-pending-tests
2021-03-22 06:23:37ganeshsetstatus: accepted-pending-tests -> accepted