darcs

Patch 351 Optimize darcs diff.

Title Optimize darcs diff.
Superseder adventure branch
View: 374
Nosy List aslatter, ganesh, mndrix, mornfall
Related Issues
Status accepted Assigned To
Milestone

Created on 2010-08-18.21:00:47 by mornfall, last changed 2012-04-20.17:23:32 by mndrix. Tracked on DarcsWatch.

Files
File name Status Uploaded Type Edit Remove
first-stab-at-a-hashed_storage-0_6-port_.dpatch mornfall, 2010-08-18.21:00:46 text/x-darcs-patch
optimize-darcs-diff.dpatch gh, 2012-03-14.22:53:05 application/x-darcs-patch
optimize-darcs-diff.dpatch gh, 2012-03-15.00:33:02 application/x-darcs-patch
patch-preview.txt gh, 2012-03-14.22:53:05 text/x-darcs-patch
patch-preview.txt gh, 2012-03-15.00:33:02 text/x-darcs-patch
unnamed mornfall, 2010-08-18.21:00:46
unnamed gh, 2012-03-14.22:53:05
unnamed gh, 2012-03-15.00:33:02
See mailing list archives for discussion on individual patches.
Messages
msg12224 (view) Author: mornfall Date: 2010-08-18.21:00:46
Hi,

I have changed the diff code to only write those files that actually changed in
the temporary locations. On my other project (less than 1000 working copy
files):

(with cold cache)
head: darcs diff  1,09s user 0,58s system 9% cpu 16,752 total
now:  darcs diff  0,14s user 0,04s system 6% cpu 2,978 total

(with hot cache)
head: darcs diff  0,36s user 0,18s system 98% cpu 0,548 total
now:  darcs diff  0,06s user 0,01s system 86% cpu 0,078 total

Yours,
   Petr.

PS: Presumably, it would be yet more efficient to not write anything out and
not call an external diff tool in cases that can be avoided. I might release a
cabal library for formatting diffs, so if we know how to handle the options and
no explicit --diff-command is given, we could just do the diffing internally. I
guess the win is not going to be as big as from this change, though.

PPS: The bundle depends on the path refactor one, since I became too frustrated
with converting the paths half dozen times to satisfy the typechecker (I gave
up facing a FilePath -> SubPath conversion, which would probably require
another couple lines of path conversion code...)

8 patches for repository http://darcs.net/:

Wed Aug 11 17:39:29 CEST 2010  Petr Rockai <me@mornfall.net>
  * First stab at a hashed-storage 0.6 port.

Thu Aug 12 00:09:46 CEST 2010  Petr Rockai <me@mornfall.net>
  * Make SubPath just another alias for Relative.

Wed Aug 11 21:45:04 CEST 2010  Petr Rockai <me@mornfall.net>
  * Make FileName an alias to Relative (from Hashed.Storage.Path).

Thu Aug 12 00:02:43 CEST 2010  Petr Rockai <me@mornfall.net>
  * Replace FilePath with FileName in SelectChanges and ChooseTouching.

Thu Aug 12 00:16:21 CEST 2010  Petr Rockai <me@mornfall.net>
  * Introduce a new Darcs.Path module to centralise path handling.

Thu Aug 12 00:36:15 CEST 2010  Petr Rockai <me@mornfall.net>
  * Merge Darcs.Patch.FileName into Darcs.Path.

Thu Aug 12 00:47:39 CEST 2010  Petr Rockai <me@mornfall.net>
  * Remove the now-redundant sp2fn.

Wed Aug 18 22:53:38 CEST 2010  Petr Rockai <me@mornfall.net>
  * Optimize darcs diff.
Attachments
msg12225 (view) Author: mornfall Date: 2010-08-18.21:12:29
Hi,

Petr Ročkai <bugs@darcs.net> writes:
> I have changed the diff code to only write those files that actually changed in
> the temporary locations. On my other project (less than 1000 working copy
> files):

> Wed Aug 18 22:53:38 CEST 2010  Petr Rockai <me@mornfall.net>
>   * Optimize darcs diff.

this also breaks with non-hashed repositories. Personally I would like
to see them deprecated anyway, so that we can stop adding code to make
them work.

I.e. I would add a prominent note to 2.6 release notes that plain
(old-fashioned) repositories are deprecated, that darcs diff is not
supported with them and future releases will remove support for further
operations.

I believe all we really need to support is "darcs get" from an
old-fashioned repository.

Yours,
   Petr.
msg12226 (view) Author: ganesh Date: 2010-08-18.21:24:04
I didn't read this carefully, but it looks like it also restricts the 
diffed files when --diff-command is used, which isn't correct since some 
diff commands (e.g. meld) like to show the user the entire tree.
msg12227 (view) Author: aslatter Date: 2010-08-18.22:07:21
On Wed, Aug 18, 2010 at 4:24 PM, Ganesh Sittampalam <bugs@darcs.net> wrote:
>
> Ganesh Sittampalam <ganesh@earth.li> added the comment:
>
> I didn't read this carefully, but it looks like it also restricts the
> diffed files when --diff-command is used, which isn't correct since some
> diff commands (e.g. meld) like to show the user the entire tree.
>

The diff client build into Xcode does a whole-tree view as well, and
it's what I use with darcs.

Antoine
msg12229 (view) Author: mornfall Date: 2010-08-19.09:30:15
Antoine Latter <aslatter@gmail.com> writes:

> On Wed, Aug 18, 2010 at 4:24 PM, Ganesh Sittampalam <bugs@darcs.net> wrote:
>>
>> Ganesh Sittampalam <ganesh@earth.li> added the comment:
>>
>> I didn't read this carefully, but it looks like it also restricts the
>> diffed files when --diff-command is used, which isn't correct since some
>> diff commands (e.g. meld) like to show the user the entire tree.
>>
>
> The diff client build into Xcode does a whole-tree view as well, and
> it's what I use with darcs.
In these cases, the entire tree will consist of just the files that have
changed. I think that's reasonable (I would go as far as call that a
feature, not a bug). If you insist that seeing unchanged files in that
tree view is important, I'd suggest an extra option,
--include-unchanged-files or such to diff (I still don't see the
use-case, though).

Yours,
   Petr.

PS: All history browsers I can think of only show you files that have
actually changed. Interestingly enough, git difftool doesn't even
support tree diffs, it will run the tool once for each pair of files. I
agree *that* is inconvenient.
msg12234 (view) Author: kowey Date: 2010-08-19.13:34:37
On Wed, Aug 18, 2010 at 21:00:47 +0000, Petr Ročkai wrote:
> I have changed the diff code to only write those files that actually changed in
> the temporary locations. On my other project (less than 1000 working copy
> files):

Hmm, so the idea here is that this optimisation would benefit us in the
case of repositories with lots of files in them (and one nice thing
about it is that users don't have to think in advance about limiting the
search in order to get benefits).

> (with cold cache)
> head: darcs diff  1,09s user 0,58s system 9% cpu 16,752 total
> now:  darcs diff  0,14s user 0,04s system 6% cpu 2,978 total
> 
> (with hot cache)
> head: darcs diff  0,36s user 0,18s system 98% cpu 0,548 total
> now:  darcs diff  0,06s user 0,01s system 86% cpu 0,078 total

It sounds like we need some sort of darcs diff test for darcs-benchmark.
I'm not sure how, maybe systematically diff against the last tag?

> PS: Presumably, it would be yet more efficient to not write anything out and
> not call an external diff tool in cases that can be avoided. I might release a
> cabal library for formatting diffs, so if we know how to handle the options and
> no explicit --diff-command is given, we could just do the diffing internally. I
> guess the win is not going to be as big as from this change, though.

Risk assessment trying to do our own diff formatting?  I'd be a bit
concerned about increasing our areas of responsibility (even in the
form of a standalone library)...

> PPS: The bundle depends on the path refactor one, since I became too frustrated
> with converting the paths half dozen times to satisfy the typechecker (I gave
> up facing a FilePath -> SubPath conversion, which would probably require
> another couple lines of path conversion code...)

OK, if it helps, here's a quick look at the diff patch itself assuming
the refactor goes in and we figure out what to do about old-fashioned.

Optimize darcs diff.
--------------------
COMMENT: What can you say to re-assure me that this filtering does what
         we expect with file and directory moves?

         For example, do the contents of paths argument 
         need to be adjusted for move patches before we take the
         intersection with touches?

> -                            else fixSubPaths opts args
> +  paths <- if null args then return []
> +                        else fixSubPaths opts args

Path_list gets renamed to the simpler, better paths.

> +  patchset <- readRepo repository

> +  unrecorded <- fromPrims `fmap` unrecordedChanges (UseIndex, ScanKnown) repository paths
> +  unrecorded' <- n2pia `fmap` anonymous unrecorded

Would a point-freer approach be safer here (to avoid accidentally using
unrecorded when we really mean unrecorded'?)

> +  Sealed all <- return $ case (secondMatch opts, patchset) of
> +    (True, _) -> seal patchset
> +    (False, (PatchSet untagged tagged)) -> seal $ PatchSet (unrecorded' :<: untagged) tagged

If we're using a "second" matcher (--to-p, --index=N-m), then we don't
need to care about the working directory at all (because at this point
in time, using any sort of second matcher implies stopping at least at
recorded).  Otherwise we have to slip in the patches from recorded to
working.

> +  Sealed ctx <- return $ if firstMatch opts
> +                            then matchFirstPatchset opts patchset
> +                            else seal patchset

If we're diffing against some point in the past (so --last=N, --from-p,
--index=n-M) we need to unpull the all the way back to the starting
point.

If we don't specify any sort of firstMatch, then what we're really
asking for is a diff between recorded and working so the context *is*
the patchset.

> +  Sealed match <- return $ if secondMatch opts then matchSecondPatchset opts patchset
> +                                               else seal all

Now about that second matcher.  If we'd asked for one then we basically
have to unpull any patches that come "after" the first/second matchers
dependency-wise.  Without a second matcher, we just pass everything
through, working directory and all.

COMMENT: I'm not sure if 'patchset' or 'all' would be the one to pass in
here (although I guess they're the same anyway, so it's just a code
reading question).

> +  (_ :>> todiff) <- return $ findCommonWithThem match ctx
> +  (_ :>> tounapply) <- return $ findCommonWithThem all match

Recall from issue1873 that @findCommonWithThem us them@ uses PatchInfo
from them and commutes the sequence @us@ into @(common :>> just_us)@.

So above, the sequence is

  ctx --(todiff)--> match --(tounapply)--> all

> +  base <- if secondMatch opts then readRecorded repository
> +                              else readUnrecorded repository []

COMMENT: This almost sounds like we're redundantly making the decision
we made in deciding what 'all' is.  I wonder if there's any way around
it, like having some sort of big tuple

 (all, base, etc) <- if secondMatch opts
                        then ...
                        else ...

Perhaps something like this would help us keep the code bug free?
(although your version is nice and airy, sort of easy to
digest in little bits and pieces).

> +  let touched = listTouchedFiles todiff
> +      files = case paths of [] -> touched
> +                            _ -> touched `intersect` paths
> +  relevant <- restrictSubpaths repository files

Style tweak: if null?

Or maybe intersectWithUnlessEmpty xs with a less yucky name.

> +  {- print files
> +  print $ mapFL info todiff
> +  print $ mapFL info tounapply -}

Looks like debug code that accidentally got left in.

> +  let filt = applyTreeFilter relevant . snd
> +      ppath = "_darcs/pristine.hashed"

> +  oldtree <- filt `fmap` hashedTreeIO
> +                (apply . invert $ unsafeCoercePEnd todiff +>+ tounapply) base ppath
> +  newtree <- filt `fmap` hashedTreeIO
> +                (apply . invert $ tounapply) base ppath

COMMENT: Do we need the unsafeCoercePEnd? I'm not qualified to tell (at
least not without a Think-Real-Hard phase)

But this looks right otherwise.  The two trees are what you get by
unapplying down to the context and the match respectively.

Random remark: (apply . invert) seems like another one of these cases
where a little bit of duplication is actually quite nice than trying to
refactor every single possible thing you reuse, much like hPutStr
stderr.  There's a slight nuance I don't have the words for yet...

>    withTempDir ("old-"++thename) $ \odir -> do
> hunk ./src/Darcs/Commands/Diff.lhs 241
> -    setCurrentDirectory formerdir
>      withTempDir ("new-"++thename) $ \ndir -> do

[snip old tree-building machinery from the simpler more innocent code]

> -    thediff <- withCurrentDirectory (toFilePath odir ++ "/..") $
> -                   case path_list of
> -                   [] -> rundiff (takeFileName $ toFilePath odir) (takeFileName $ toFilePath ndir)
> -                   fs -> vcat `fmap`
> -                         mapM (\f -> rundiff
> -                               (takeFileName (toFilePath odir) ++ "/" ++ toFilePath f)
> -                               (takeFileName (toFilePath ndir) ++ "/" ++ toFilePath f)) fs
> -    morepatches <- readRepo repository
> -    putDoc $ changelog (getDiffInfo opts morepatches)
> -            $$ thediff
> +      withCurrentDirectory formerdir $ do
> +        writePlainTree oldtree (toFilePath odir)
> +        writePlainTree newtree (toFilePath ndir)
> +        thediff <- withCurrentDirectory (toFilePath odir ++ "/..") $
> +                       rundiff (takeFileName $ toFilePath odir) (takeFileName $ toFilePath ndir)
> +        morepatches <- readRepo repository
> +        putDoc $ changelog (getDiffInfo opts morepatches)
> +               $$ thediff

I *think* this block is just indentation change as a consequence of the
rest of the code (plus writing out the trees we built).  We run diff on
the tree as a whole.

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
For a faster response, try +44 (0)1273 64 2905 or
xmpp:kowey@jabber.fr (Jabber or Google Talk only)
msg12235 (view) Author: kowey Date: 2010-08-19.13:39:58
On Wed, Aug 18, 2010 at 23:15:19 +0200, Petr Rockai wrote:
> this also breaks with non-hashed repositories. Personally I would like
> to see them deprecated anyway, so that we can stop adding code to make
> them work.

In what way?  It didn't jump out at me when I tried to review the patch
(which seemed fairly straightforward)

> I.e. I would add a prominent note to 2.6 release notes that plain
> (old-fashioned) repositories are deprecated, that darcs diff is not
> supported with them and future releases will remove support for further
> operations.
> 
> I believe all we really need to support is "darcs get" from an
> old-fashioned repository.

I'll address this in a separate thread (perhaps not immediately)

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
For a faster response, try +44 (0)1273 64 2905 or
xmpp:kowey@jabber.fr (Jabber or Google Talk only)
msg12237 (view) Author: ganesh Date: 2010-08-19.22:29:00
On Thu, 19 Aug 2010, Petr Rockai wrote:

> Antoine Latter <aslatter@gmail.com> writes:
>
>> On Wed, Aug 18, 2010 at 4:24 PM, Ganesh Sittampalam <bugs@darcs.net> wrote:
>>>
>>> Ganesh Sittampalam <ganesh@earth.li> added the comment:
>>>
>>> I didn't read this carefully, but it looks like it also restricts the
>>> diffed files when --diff-command is used, which isn't correct since some
>>> diff commands (e.g. meld) like to show the user the entire tree.
>>>
>>
>> The diff client build into Xcode does a whole-tree view as well, and
>> it's what I use with darcs.
> In these cases, the entire tree will consist of just the files that have
> changed. I think that's reasonable (I would go as far as call that a
> feature, not a bug). If you insist that seeing unchanged files in that
> tree view is important, I'd suggest an extra option,
> --include-unchanged-files or such to diff (I still don't see the
> use-case, though).

Sounds fine, though it perhaps ought to default to true with 
--diff-command and false without.

Ganesh
msg12243 (view) Author: mornfall Date: 2010-08-20.08:50:37
Hi,

On Thu, Aug 19, 2010 at 02:42:59PM +0100, Eric Kow wrote:
> On Wed, Aug 18, 2010 at 23:15:19 +0200, Petr Rockai wrote:
> > this also breaks with non-hashed repositories. Personally I would like
> > to see them deprecated anyway, so that we can stop adding code to make
> > them work.
> 
> In what way?  It didn't jump out at me when I tried to review the patch
> (which seemed fairly straightforward)

because we apply the patches to trees in pristine.hashed to avoid keeping
everything in memory if the change happens to be really big (so that even
though it will take a long time, it won't swap your machine do death).

The 'efficient' patch application is only implemented for hashed trees, since
it's much trickier for plain trees. We also take advantage of the fact that
most files usually stay untouched, so they are shared with actual pristine.

So while it would be possible to do this with plain pristine, it would be extra
work and it would be less efficient.

Yours,
   Petr.
msg12381 (view) Author: kowey Date: 2010-08-31.09:52:22
Obsoleted by patch374 if I understand correctly
msg15320 (view) Author: gh Date: 2012-03-14.18:32:09
I'm reopening this patch since:

* as of now, old-fashioned repositories are deprecated in darcs HEAD,
hence this obstacle is gone. According to http://wiki.darcs.net/OF ,
"darcs diff" no longer supports OF repositories.
* there is patch374, but that bundle contains various changes, it seems
more convenient to discuss the new diff implementation here.
* the discussion that already occured here useful.

A question for Petr (mornfall): Should darcs switch to HS 0.6 first?
Implying, is HS 0.6 stable enough for that to happen?
msg15321 (view) Author: gh Date: 2012-03-14.22:53:05
Please don't screen it right away, I'd rather gather comments on that
one first..

This is a port from Petr Rockai's "adventure branch" to the current
code of darcs. It wasn't too hard to port it, and indeed path handling
seems dirty but I don't think we're losing much by not switching to Petr's
new path handling code (currently called pathlib).

Note that I ported the patch but I don't _understand_ this piece of code :-)

This new implementation does speed up darcs diff (twofold in the few cases
I tested).  When called with the -p flag, it no longer shows the
"copying pristine" message, which is nice.

Indeed there is a change of behaviour: it shows only the different files
when called with --diff-command, instead of passing the whole trees to
the external diff. On this point I agree with Petr: it's just a habit,
and apart from (ab)using the diff viewer as a file browser there is no
use for that. So I would suggest that if someone really wants the
--include-unchanged-files flag suggested above, let them implement it.

1 patch for repository http://darcs.net/screened:

Wed Mar 14 19:39:35 ART 2012  Guillaume Hoffmann <guillaumh@gmail.com>
  * Optimize darcs diff
  Port of Petr Rockai's code.
Attachments
msg15322 (view) Author: gh Date: 2012-03-15.00:33:02
Same patch with another that removes 2 unused function.
Screening it this time since Owen had a look at it.

2 patches for repository http://darcs.net/screened:

Wed Mar 14 19:39:35 ART 2012  Guillaume Hoffmann <guillaumh@gmail.com>
  * Optimize darcs diff
  Port of Petr Rockai's code.

Wed Mar 14 21:10:19 ART 2012  Guillaume Hoffmann <guillaumh@gmail.com>
  * remove two unused functions of Darcs.SelectChanges after darcs diff optimization
Attachments
msg15578 (view) Author: mndrix Date: 2012-04-20.17:23:32
It looks good: faster and easier to understand.
History
Date User Action Args
2010-08-18 21:00:47mornfallcreate
2010-08-18 21:05:16darcswatchsetdarcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-8bada56aeb3c333f82d16a7b415a792633306a9a
2010-08-18 21:12:30mornfallsetmessages: + msg12225
2010-08-18 21:24:04ganeshsetnosy: + ganesh
messages: + msg12226
2010-08-18 22:07:21aslattersetmessages: + msg12227
2010-08-19 09:30:16mornfallsetnosy: + aslatter
messages: + msg12229
2010-08-19 13:34:38koweysetmessages: + msg12234
2010-08-19 13:39:58koweysetmessages: + msg12235
2010-08-19 22:29:01ganeshsetmessages: + msg12237
2010-08-20 08:50:38mornfallsetmessages: + msg12243
2010-08-31 09:52:22koweysetstatus: needs-review -> obsoleted
messages: + msg12381
superseder: + adventure branch
2011-05-10 19:37:42darcswatchsetdarcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-8bada56aeb3c333f82d16a7b415a792633306a9a -> http://darcswatch.nomeata.de/repo_http:__darcs.net_reviewed.html#bundle-8bada56aeb3c333f82d16a7b415a792633306a9a
2012-03-14 18:32:09ghsetstatus: obsoleted -> followup-in-progress
messages: + msg15320
2012-03-14 18:32:57ghsetnosy: + mndrix
2012-03-14 22:53:05ghsetfiles: + patch-preview.txt, optimize-darcs-diff.dpatch, unnamed
messages: + msg15321
2012-03-15 00:33:02ghsetfiles: + patch-preview.txt, optimize-darcs-diff.dpatch, unnamed
messages: + msg15322
2012-03-15 00:35:45ghsetstatus: followup-in-progress -> needs-review
2012-04-20 17:23:32mndrixsetstatus: needs-review -> accepted
messages: + msg15578