darcs

Patch 310 Resolve issue 1599: automatically expire unused caches

Title Resolve issue 1599: automatically expire unused caches
Superseder Nosy List abuiles, galbolle, ganesh, kowey, mornfall
Related Issues
Status accepted Assigned To mornfall
Milestone

Created on 2010-07-22.06:12:24 by abuiles, last changed 2011-05-10.22:07:23 by darcswatch. Tracked on DarcsWatch.

Files
File name Status Uploaded Type Edit Remove
-resolve-issue-1599_-automatically-expire-unused-caches.dpatch abuiles, 2010-07-28.13:47:46 text/x-darcs-patch
resolve-issue-1599_-automatically-expire-unused-caches.dpatch abuiles, 2010-07-22.06:12:24 text/x-darcs-patch
resolve-issue-1599_-automatically-expire-unused-caches.dpatch abuiles, 2010-07-28.20:55:19 text/x-darcs-patch
resolve-issue-1599_-automatically-expire-unused-caches.dpatch abuiles, 2010-08-03.14:32:49 text/x-darcs-patch
resolve-issue-1599_-automatically-expire-unused-caches.dpatch abuiles, 2010-08-06.06:41:02 text/x-darcs-patch
resolve-issue-1599_-automatically-expire-unused-caches.dpatch abuiles, 2010-08-07.00:41:37 text/x-darcs-patch
unnamed abuiles, 2010-07-22.06:12:24
unnamed abuiles, 2010-07-28.13:47:46
unnamed abuiles, 2010-07-28.20:55:19
unnamed abuiles, 2010-08-03.14:32:49
unnamed abuiles, 2010-08-06.06:41:02
unnamed abuiles, 2010-08-07.00:41:37
See mailing list archives for discussion on individual patches.
Messages
msg11815 (view) Author: abuiles Date: 2010-07-22.06:12:24
1 patch for repository http://darcs.net:

Wed Jul 21 22:29:45 COT 2010  builes.adolfo@googlemail.com
  * Resolve issue 1599: automatically expire unused caches
Attachments
msg11825 (view) Author: kowey Date: 2010-07-22.15:47:56
On Thu, Jul 22, 2010 at 06:12:24 +0000, Adolfo Builes wrote:
> Wed Jul 21 22:29:45 COT 2010  builes.adolfo@googlemail.com
>   * Resolve issue 1599: automatically expire unused caches

Resolve issue 1599: automatically expire unused caches
------------------------------------------------------
> builes.adolfo@googlemail.com**20100722032945
>  Ignore-this: 38455634b5648ed7555380336ea74464
> ] hunk ./src/Darcs/Global.hs 30
>                        whenDebugMode, withDebugMode, setDebugMode,
>                        debugMessage, debugFail, putTiming,
>                        addCRCWarning, getCRCWarnings, resetCRCWarnings,
> -                      darcsdir
> +                      addBadCache, getBadCacheList, isBadCache, darcsdir,
> +                      isTrustedHttp, addTrustedHttp

> hunk ./src/Darcs/Global.hs 73
>          Just actions <- swapMVar atexitActions Nothing
>          -- from now on atexit will not register new actions
>          mapM_ runAction actions
> +        badCaches <- getBadCacheList
> +        when ( not $ null badCaches ) $ reportBadCaches badCaches

OK, so after running all of the atExitActions, we print out our
accumulated warnings about caches (should we worry about what
happens if one of the atexit actions fails)?

> -
> +    reportBadCaches caches = do
> +                               if length caches > 1
> +                                 then
> +                                  hPutStrLn stderr "\nI could not reach the following repositories (listed in _darcs/prefs/sources):"
> +                                 else
> +                                  hPutStrLn stderr "\nI could not reach the following repository (listed in _darcs/prefs/sources):"

For what it's worth, this could be more simply expressed using the
English module.  It lets you write something like

 "I could not reach the following"
 ++ englishNum (length caches) (Noun "repository") "(list in in ...)"

It looks like you'd have to modify Noun to account for this.
I'll submit a patch.

It may also be a good idea to give the user some advice about what
to do here.

> +                               mapM_ (hPutStrLn stderr) caches
> +                               hPutStrLn stderr ""

Alternatively, hPutStrLn (unlines caches ++ "\n")

> +type BadCacheList = [String]
> +{- NOINLINE _badCacheList -}
> +_badCacheList :: IORef BadCacheList
> +_badCacheList = unsafePerformIO $ newIORef []
> +
> +addBadCache :: String -> IO ()
> +addBadCache cache = modifyIORef _badCacheList (cache:)
> +
> +getBadCacheList :: IO [String]
> +getBadCacheList = readIORef _badCacheList
> +
> +{- NOINLINE isBadCache -}
> +isBadCache :: String -> Bool
> +isBadCache cache = unsafePerformIO $ do badCaches <- getBadCacheList
> +                                        return (cache `elem` badCaches)

Is it really necessary for this to be String -> Bool?
Why not String -> IO Bool?

Alternatively, why not IO (String -> Bool)?

I'm also a bit sceptical about the need for a type synonym.
Maybe just passing around [String] is clearer here.

> +type TrustedHttps = [String]
> +{- NOINLINE _trustedHttpList -}
> +_trustedHttpList :: IORef TrustedHttps
> +_trustedHttpList = unsafePerformIO $ newIORef []

So you maintain a list of URLs which you managed to reach the
first time you contacted them this session.

'Trust' is probably not the right notion to use here (since I'd
be thinking more in terms of security or something like that).
How about "reachable" instead?

> +addTrustedHttp :: String -> IO ()
> +addTrustedHttp http = modifyIORef _trustedHttpList (http:)
> +
> +getTrustedHttps :: IO TrustedHttps
> +getTrustedHttps = readIORef _trustedHttpList
> +
> +{- NOINLINE  isTrustedHttp -}
> +isTrustedHttp :: String -> Bool
> +isTrustedHttp http = unsafePerformIO $ do trustedHttps <- getTrustedHttps
> +                                          return (http `elem` trustedHttps)

Same complaint about possibly unnecessary unsafePerformIO.  I mean, we
certainly use it to achieve tracking of some global information in
Darcs, but it's something we should be be somewhat reluctant to do lest
Darcs surprise us in the future.

> +-- | Returns the sources of a given cache.
> +cacheSource :: CacheLoc -> String
> +cacheSource (Cache _ _ s) = s

Potentially useful helper function.  Seems like we could alternatively
modify the definition of Cache to use the curly brace syntax.

>  fetchFileUsingCache :: Cache -> HashedDir -> String -> IO (String, B.ByteString)
> -fetchFileUsingCache = fetchFileUsingCachePrivate Anywhere
> +fetchFileUsingCache c h s = do result <- fetchFileUsingCachePrivate Anywhere c h s
> +                               return result

This change does not seem necessary.  What was wrong with the
eta-reduced version of this code?

>            sfuc [] _ = return ()
> -          sfuc (c:cs) out | not $ writable c =
> +          sfuc (c:cs) out | (not $ badCache c) && (not $ writable c) =

> -                 then speculateFileOrUrl (fn c) out
> +                 then speculateFileOrUrl (fn c) out `catchNonSignal` (\e -> do
> +                                       addToBadCaches (show e) c)

> -                 else copyFileOrUrl [] (fn c) out Cachable
> +                 else copyFileOrUrl [] (fn c) out Cachable `catchNonSignal` (\e -> do
> +                                       addToBadCaches (show e) c)

OK, so if something goes wrong when speculating on or copying a file,
(but it's not just the user hitting ^-c), we note the cache down as
being problematic.
>  
> +-- | Checks if a given cache needs to be added to the list of bad caches.
> +-- It receives an error caught during execution and the cache.
> +-- For a local cache, if the given source doesn't exist anymore, it is added.
> +-- For HTTP sources if the error is timeout, it is added, if not we check if the source
> +-- still exist, if doesn't exist we added to the list of bad caches.
> +-- For SSH we check if the server is reachable.
> +-- The entries which get added to the cache are no longer tried for the rest of the command.
> +addToBadCaches :: String -> CacheLoc -> IO ()

Thanks for haddocking this.

It seems like addToBadCaches is may be a bit of misnomer first this
function primarily tests if a cache is reachable, and second because it
adds the cache entry to either the white or the blacklist.  But I'm not
sure; maybe I misunderstand the code.  After all, you do only call it
when something goes wrong...

Style point:
The code below is very heavily indented, which makes it harder to read
and review without a wide screen or window.  It's a tricky trade-off
between too much and not enough, so I don't really have any answers.
Perhaps this style could help here

   addToBadCaches e cache
    | condition 1 = 
        ...
    | condition 2 =
        ...

Note the retracted guards and also starting each case on a new line

> +addToBadCaches e cache | isFile (cacheSource cache) = case "No such file or directory" `isInfixOf`  (map toLower e) of
> +                                                        True  -> do
> +                                                                  exist <- doesDirectoryExist $ cacheSource $ cache
> +                                                                  when (not exist) $ addBadCache . cacheSource $ cache
> +                                                        False -> return ()

1. I'm not sure what you gain from pattern-matching on the
   Bool when you could just use an if-then-else.

2. Do we not have any exception types to work with here?  Matching
   on the string seems really fragile.

3. Case in point, you're matching "Foo" against (map toLower "foo")
   which means that this code path is never visited.

> +  | isUrl  (cacheSource cache) = case "timeout" `isInfixOf`  (map toLower e) of
> +                                   True  -> addBadCache . cacheSource $ cache
> +                                   False -> case isTrustedHttp (cacheSource cache) of
> +                                            True  -> return  ()
> +                                            False -> do
> +                                                      let url = if last (cacheSource cache) == '/'
> +                                                                 then cacheSource cache
> +                                                                 else (cacheSource cache) ++ "/"
> +                                                      rsp <- simpleHTTP (getRequest url )
> +                                                      do case rsp of
> +                                                          Left _ ->  return ()
> +                                                          Right response -> if rspCode response /= (2,0,0)
> +                                                                            then addBadCache . cacheSource $ cache
> +                                                                            else addTrustedHttp (cacheSource cache) >> return ()

OK earlier we see that this function is called if we receive some
exception when fetching over HTTP.

If the exception is not a timeout, we test if the URL is reachable and
whitelist it if so (the whitelist avoids future testing).  But why do we
do this?  And doesn't that confuse Darcs into thinking it successfully
fetched a file when it did not?  Have you tested the actual case where
you get an exception but not a timeout?

Also, are there not web server configurations where
http://example.com/foo gives you a 404, but
http://example.com/foo/_darcs/foo gives you a result?  What happens in
those cases?  In other words, does the test of trying to fetch
http://example.com/foo/ actually make sense?

Also, why is it important to normalise the URL to have a trailing slash?
Is it just to avoid duplicate entries in the whitelist?

Code tidiness:

1. It could be useful to look into reducing the amount of indentation
   you use.  It's a tricky trade-off between too much and not enough,
   so I don't really have any answers.  Perhaps this style could help here

   addToBadCaches e cache
    | condition 1 = 
        ...
    | condition 2 =
        ...

   It's all about making the code as easy to read as possible.

2. It seems like you should just name a source = cacheSource cache,
   which I think will make the code a lot simpler looking

3. Why does addTrustedHttp need to be followed by return () if it already
   returns ()?

> +                       | isSsh  (cacheSource cache) =  checkSshError (cacheSource cache) (getHttpUrl (cacheSource cache))

One principle I saw off some blog post somewhere is that when you have
a single unit of code, the code in that unit should tend to reside on
the same level of abstraction.

So a function like

  if foo
     then do low-level-A1
             low-level-A2
             low-level-A3
     else high-level-B

is a bit odd, when you could instead write something like

  if foo
     then do low-level-A1
             low-level-A2
             low-level-A3
     else do low-level-B1
             low-level-B2
             
OR alternatively

  if foo
     then do high-level-A
     else do high-level-B

But not a mix.  I think why this is useful for readability is that it
makes it very clear that A and B are on the same sort of level of
operation.  Anyway, don't take this sort of advice /too/ seriously.
It's one of these fuzzy things that I suspect you develop a feel for
over time.

In this particular case, you've chosen to break checkSshError into
a separate function, but not the HTTP stuff.  Why?

> +                      | otherwise                  = fail $ "unknown transport protocol for: " ++ (cacheSource cache)

> +-- | Helper function to check reachability of a ssh source.
> +checkSshError :: String -> String -> IO ()
> +checkSshError source url = do
> +                      simpleHTTP (getRequest url )
> +                        `catchNonSignal` (\e -> do
> +                                             case "timeout" `isInfixOf`  (map toLower (show e)) of
> +                                               True  -> (addBadCache source) >> return (Left (ErrorMisc "timeout"))
> +                                               False -> return () >> return (Left (ErrorMisc "Unkown error")))
> +                      return ()

> +-- | Given a SSH source it returns the server address appended with HTTP://
> +-- i.e: getHttpUrl "user@darcs.net:repo" returns "http://darcs.net"
> +getHttpUrl :: String -> String
> +getHttpUrl source = "http://" ++ ((\(_:xs) -> takeWhile (/=':') xs) . dropWhile (/='@') $ source)


Why does checkSshError take both the source and url argument?  Is there
a case where url cannot be systematically derived from the source?

What happens to the Either type that you return in your handler?
Doesn't it get discarded?

Why are we going about things in this roundabout way of checking the SSH
path by converting the source name to an HTTP URI and fetching it over
HTTP?  What about cases where people have a working SSH server but not
necessarily an HTTP server on a given machine?

>      where ffuc (c:cs)
> -           | not (writable c) && (Anywhere == fromWhere || isFile (fn c)) =
> +           | not (writable c) && (Anywhere == fromWhere || isFile (fn c)) && (not $ badCache c) =

This seems ok although incidentally,
  not (badCache c)
is probably simpler than
  (not $ badCache c)

> -              `catchall` ffuc cs
> +              `catchNonSignal` (\e -> do
> +                                       addToBadCaches (show e) c
> +                                       ffuc cs )

> -              `catchall` do (fname,x) <- ffuc cs
> -                            do createCache c subdir
> -                               createLink fname (fn c)
> -                               return (fn c, x)
> -                             `catchall`
> -                             do gzWriteFilePS (fn c) x `catchall` return ()
> -                                return (fname,x)
> +              `catchNonSignal` (\ e -> do addToBadCaches (show e) c ; (fname,x) <- ffuc cs
> +                                          do createCache c subdir
> +                                             createLink fname (fn c)
> +                                             return (fn c, x)
> +                                           `catchall`
> +                                           do gzWriteFilePS (fn c) x `catchall` return ()
> +                                              return (fname,x))

This block of code actually has very few changes in it, but the fact
would have been more apparent if you tried to avoid the re-indentation
(not always possible).

In this case, you could have

-              `catchall` do (fname,x) <- ffuc cs
-                            do createCache c subdir
-                               createLink fname (fn c)
...
+              `catchNonSignal` (\ e ->
+                         do addToBadCaches (show e) c
+                            (fname,x) <- ffuc cs
+                            do createCache c subdir

I also took the liberty of simplifying away the semicolon.

> +-- | Checks if a given cache is in the list of bad caches.
> +badCache :: CacheLoc -> Bool
> +badCache = isBadCache . cacheSource

Do you really use this?
>  import Data.Maybe ( isJust, catMaybes )
>  import Control.Monad ( msum )

> -    dff <- fetchFilePS (repo ++ "/" ++ df) Cachable `catchall` return B.empty
> +    (timeOut,dff) <- (fetchFilePS (repo ++ "/" ++ df) Cachable >>= \ r -> return (False,r))
> +                    `catchNonSignal` (\ e -> return (isTimeOut e,B.empty))

This may be more readable using do notation

     (timeOut,dff) <- do r <- fetchFilePS (repo ++ "/" ++ df) Cachable
                         return (False,r)
                      `catchNonSignal` (\ e -> return (isTimeOut e,B.empty))

The same sort of worry as above.  What happens if your exception is not
a timeout?  I'm concerned that you succeed here, where what you really
want to do maybe is re-throw the exception.

>      -- below is a workaround for servers that don't return a 404 on nonexistent files
> hunk ./src/Darcs/Repository/Format.hs 57
> -    rf <- if B.null dff || isJust (BC.elemIndex '<' dff)
[snip]
> -          else return $ Right $ parseRepoFormat dff
> +    rf <- case timeOut of
> +          False ->
> +               if B.null dff || isJust (BC.elemIndex '<' dff)
[snip indentation change]
> +               else return $ Right $ parseRepoFormat dff
> +          True  -> return $ Left $ "Failed establishing a connection with "++ repo ++ " Timeout was reached"

>                                    (\e -> return (Left (prettyException e)))
> +          isTimeOut e = "timeout" `isInfixOf` map toLower (show e)

It seems like the notion of whether something is a timeout or not (you
check this by looking at the exception text) could be refactored.
I guess it's not a big deal if it is though.

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
For a faster response, please try +44 (0)1273 64 2905.
msg11828 (view) Author: mornfall Date: 2010-07-22.18:08:14
Eric Kow <kowey@darcs.net> writes:
> If the exception is not a timeout, we test if the URL is reachable and
> whitelist it if so (the whitelist avoids future testing).  But why do we
> do this?  And doesn't that confuse Darcs into thinking it successfully
> fetched a file when it did not?  Have you tested the actual case where
> you get an exception but not a timeout?

I believe that we do this because a 404 could be quite legitimate: the
repository could be partially lazy but still have a lot of things we
want to use. We don't want to blacklist a source for this reason
alone. We also don't want to inspect a source each time it gives a 404,
since this is not for free, either.

> Also, are there not web server configurations where
> http://example.com/foo gives you a 404, but
> http://example.com/foo/_darcs/foo gives you a result?  What happens in
> those cases?  In other words, does the test of trying to fetch
> http://example.com/foo/ actually make sense?
>
> Also, why is it important to normalise the URL to have a trailing slash?
> Is it just to avoid duplicate entries in the whitelist?

I suspect that there are webservers that treat foo/ and foo
differently. But I am not saying one is better than the other. Anyway, I
think the right thing to look for is foo/_darcs/hashed_inventory (we
shouldn't be using caches for non-hashed repositories).

I wholeheartedly agree with the indentation issue Eric has pointed out,
too. Please try to fit your code in 80-90 columns if possible, although
100 is acceptable and 110 is stretching it but could be forgiven if
there's a good reason. : - ) Code that is wider is usually hard to read
for almost everyone (or so I guess).

Yours,
   Petr.
msg11831 (view) Author: abuiles Date: 2010-07-23.01:35:57
Hi Eric,

>
> OK, so after running all of the atExitActions, we print out our
> accumulated warnings about caches (should we worry about what
> happens if one of the atexit actions fails)?
>
Yes, we should care about it, I will fix it.

>
>> +type BadCacheList = [String]
>> +{- NOINLINE _badCacheList -}
>> +_badCacheList :: IORef BadCacheList
>> +_badCacheList = unsafePerformIO $ newIORef []
>> +
>> +addBadCache :: String -> IO ()
>> +addBadCache cache = modifyIORef _badCacheList (cache:)
>> +
>> +getBadCacheList :: IO [String]
>> +getBadCacheList = readIORef _badCacheList
>> +
>> +{- NOINLINE isBadCache -}
>> +isBadCache :: String -> Bool
>> +isBadCache cache = unsafePerformIO $ do badCaches <- getBadCacheList
>> +                                        return (cache `elem` badCaches)
>
> Is it really necessary for this to be String -> Bool?
> Why not String -> IO Bool?
>
> Alternatively, why not IO (String -> Bool)?
>
I used  (String -> Bool) basically to escape the IO so that I could
use it easily in the guards.

> I'm also a bit sceptical about the need for a type synonym.
> Maybe just passing around [String] is clearer here.
>
The synonym was to give a better insight of what we were getting, but
indeed, it can be inferred from the function name.

>> +type TrustedHttps = [String]
>> +{- NOINLINE _trustedHttpList -}
>> +_trustedHttpList :: IORef TrustedHttps
>> +_trustedHttpList = unsafePerformIO $ newIORef []
>
> So you maintain a list of URLs which you managed to reach the
> first time you contacted them this session.
>
> 'Trust' is probably not the right notion to use here (since I'd
> be thinking more in terms of security or something like that).
> How about "reachable" instead?
>
"reachable" or "reliable" could be the words to express what I wanted.


>> +addTrustedHttp :: String -> IO ()
>> +addTrustedHttp http = modifyIORef _trustedHttpList (http:)
>> +
>> +getTrustedHttps :: IO TrustedHttps
>> +getTrustedHttps = readIORef _trustedHttpList
>> +
>> +{- NOINLINE  isTrustedHttp -}
>> +isTrustedHttp :: String -> Bool
>> +isTrustedHttp http = unsafePerformIO $ do trustedHttps <- getTrustedHttps
>> +                                          return (http `elem` trustedHttps)
>
> Same complaint about possibly unnecessary unsafePerformIO.  I mean, we
> certainly use it to achieve tracking of some global information in
> Darcs, but it's something we should be be somewhat reluctant to do lest
> Darcs surprise us in the future.
I totally agree that I abused of unsafePerformIO, in this case I used
for the same reason I gave before.

>> +-- | Returns the sources of a given cache.
>> +cacheSource :: CacheLoc -> String
>> +cacheSource (Cache _ _ s) = s
>
> Potentially useful helper function.  Seems like we could alternatively
> modify the definition of Cache to use the curly brace syntax.
>
I would totally prefer curly brace syntax, do we have any reason for
not using it ?

>>  fetchFileUsingCache :: Cache -> HashedDir -> String -> IO (String, B.ByteString)
>> -fetchFileUsingCache = fetchFileUsingCachePrivate Anywhere
>> +fetchFileUsingCache c h s = do result <- fetchFileUsingCachePrivate Anywhere c h s
>> +                               return result
>
> This change does not seem necessary.  What was wrong with the
> eta-reduced version of this code?
>
There is nothing wrong with it, what happened was that I was using
something else in the middle, and then took it away, I forgot to
rewrite it.


>>            sfuc [] _ = return ()
>> -          sfuc (c:cs) out | not $ writable c =
>> +          sfuc (c:cs) out | (not $ badCache c) && (not $ writable c) =
>
>> -                 then speculateFileOrUrl (fn c) out
>> +                 then speculateFileOrUrl (fn c) out `catchNonSignal` (\e -> do
>> +                                       addToBadCaches (show e) c)
>
>> -                 else copyFileOrUrl [] (fn c) out Cachable
>> +                 else copyFileOrUrl [] (fn c) out Cachable `catchNonSignal` (\e -> do
>> +                                       addToBadCaches (show e) c)
>
> OK, so if something goes wrong when speculating on or copying a file,
> (but it's not just the user hitting ^-c), we note the cache down as
> being problematic.
>
If we determinate that the cause of the error is one that we think is
problematic, we added.

>
> It seems like addToBadCaches is may be a bit of misnomer first this
> function primarily tests if a cache is reachable, and second because it
> adds the cache entry to either the white or the blacklist.  But I'm not
> sure; maybe I misunderstand the code.  After all, you do only call it
> when something goes wrong...
>
I use the whitelist basically to don't test again if the repo is
reachable or not. ( Given the fact that I can get an error just
because certain file is not  in that repo, but the repo does exist )

> Style point:
> The code below is very heavily indented, which makes it harder to read
> and review without a wide screen or window.  It's a tricky trade-off
> between too much and not enough, so I don't really have any answers.
> Perhaps this style could help here
>
>   addToBadCaches e cache
>    | condition 1 =
>        ...
>    | condition 2 =
>        ...
>
Just a stupid I idea I had in my mind that I would have to do things in this way
addToBadCaches e cache
                                          | condition 1 =
                                                   ...
                                          | condition 2 =
                                                  ....


>
> 2. Do we not have any exception types to work with here?  Matching
>   on the string seems really fragile.
>
I agree with you, I tried to work with the exception, unfortunately
"fromException" was giving nothing to this error ( I will look at it
again though).

> 3. Case in point, you're matching "Foo" against (map toLower "foo")
>   which means that this code path is never visited.
I used specifically for "timeout" since the error could show "Timeout"
or "timeout".

>> +  | isUrl  (cacheSource cache) = case "timeout" `isInfixOf`  (map toLower e) of
>> +                                   True  -> addBadCache . cacheSource $ cache
>> +                                   False -> case isTrustedHttp (cacheSource cache) of
>> +                                            True  -> return  ()
>> +                                            False -> do
>> +                                                      let url = if last (cacheSource cache) == '/'
>> +                                                                 then cacheSource cache
>> +                                                                 else (cacheSource cache) ++ "/"
>> +                                                      rsp <- simpleHTTP (getRequest url )
>> +                                                      do case rsp of
>> +                                                          Left _ ->  return ()
>> +                                                          Right response -> if rspCode response /= (2,0,0)
>> +                                                                            then addBadCache . cacheSource $ cache
>> +                                                                            else addTrustedHttp (cacheSource cache) >> return ()
>
> OK earlier we see that this function is called if we receive some
> exception when fetching over HTTP.
>
> If the exception is not a timeout, we test if the URL is reachable and
> whitelist it if so (the whitelist avoids future testing).  But why do we
> do this?  And doesn't that confuse Darcs into thinking it successfully
> fetched a file when it did not?  Have you tested the actual case where
> you get an exception but not a timeout?
>
I did test it, and the reasons I add it to the white list, is to avoid
inspecting again ( as Petr said).

> Also, are there not web server configurations where
> http://example.com/foo gives you a 404, but
> http://example.com/foo/_darcs/foo gives you a result?  What happens in
> those cases?  In other words, does the test of trying to fetch
> http://example.com/foo/ actually make sense?
>
> Also, why is it important to normalise the URL to have a trailing slash?
> Is it just to avoid duplicate entries in the whitelist?
>
I added the trailing slash basically because of the reason Petr said,
cause some servers treats foo/ and foo differently, but as Petr
suggest what I should test is if foo/_darcs/hashed_inventory exist.


> Code tidiness:
>
> 1. It could be useful to look into reducing the amount of indentation
>   you use.  It's a tricky trade-off between too much and not enough,
>   so I don't really have any answers.  Perhaps this style could help here
>
>   addToBadCaches e cache
>    | condition 1 =
>        ...
>    | condition 2 =
>        ...
>
>   It's all about making the code as easy to read as possible.
>
> 2. It seems like you should just name a source = cacheSource cache,
>   which I think will make the code a lot simpler looking
>
I'll rewrite it thanks :).


>
> In this particular case, you've chosen to break checkSshError into
> a separate function, but not the HTTP stuff.  Why?
>
I did it just because it was getting too long ( yes, I tried to keep
the columns <= 80, but I had that clumsy idea I mentioned before)

>
>> +-- | Given a SSH source it returns the server address appended with HTTP://
>> +-- i.e: getHttpUrl "user@darcs.net:repo" returns "http://darcs.net"
>> +getHttpUrl :: String -> String
>> +getHttpUrl source = "http://" ++ ((\(_:xs) -> takeWhile (/=':') xs) . dropWhile (/='@') $ source)
>
>
> Why does checkSshError take both the source and url argument?  Is there
> a case where url cannot be systematically derived from the source?
>
none, if fact the url is infere from the source.

> What happens to the Either type that you return in your handler?
> Doesn't it get discarded?
>
Yes is discarded, but in fact that's wrong, I wasn't very happy with
it and I realized that it is wrong.

> Why are we going about things in this roundabout way of checking the SSH
> path by converting the source name to an HTTP URI and fetching it over
> HTTP?  What about cases where people have a working SSH server but not
> necessarily an HTTP server on a given machine?
>
You are right here, and I realized today that this is not the way I
should test that the ssh is reachable. ( I was just assuming that
every server will give a 200 when fetching over HTTP, I forgot about
cases like the one you mentioned).  I didn't used the error message in
this case, because with ssh we don't get an error like the one with
http (i.e "timeout reached"), I will look better how can I handle that
error, also in this patch checking for ssh I "wrongly" do it to see if
it was a timeout error, but I didn't check if it was because the repo
wasn't there any longer, I guess I would have to do something like
"existRemoteFileSsh" and check for foo/_darcs/hashed_inventory

>
> Do you really use this?
>>  import Data.Maybe ( isJust, catMaybes )
>>  import Control.Monad ( msum )
>
That's not mine. I just didn't clean that up, as I think that kind of
things should go in a different patch.

>
>     (timeOut,dff) <- do r <- fetchFilePS (repo ++ "/" ++ df) Cachable
>                         return (False,r)
>                      `catchNonSignal` (\ e -> return (isTimeOut e,B.empty))
>
> The same sort of worry as above.  What happens if your exception is not
> a timeout?  I'm concerned that you succeed here, where what you really
> want to do maybe is re-throw the exception.
>

If the exception is not timeout that means that It could be a
possibility that "repo++"/"++darcsdir++"/inventory" exist, if not, we
throw the final error.

Now I have the doubt of how to go with the ssh server, I mean, how
could I check that the SSH server is reachable or that the
_darcs/hashed_inventory exist, a first idea come to my mind maybe with
ssh, but I'm not sure of portability.

Thanks !

--
Adolfo
msg11837 (view) Author: kowey Date: 2010-07-23.11:50:46
On Thu, Jul 22, 2010 at 20:38:35 -0500, Adolfo Builes wrote:
> > Is it really necessary for this to be String -> Bool?
> > Why not String -> IO Bool?
> >
> > Alternatively, why not IO (String -> Bool)?
> >
> I used  (String -> Bool) basically to escape the IO so that I could
> use it easily in the guards.

...

> I totally agree that I abused of unsafePerformIO, in this case I used
> for the same reason I gave before.

I think an IO (String -> Bool) approach could maybe still work because
you can grab the (String -> Bool) function within IO, and then use it
within a guard.

You'll have to decide if minimising the unsafePerformIO is worth the
extra code complexity or not.  I tend to say it is, but others may
disagree with me.

> >> +cacheSource (Cache _ _ s) = s
> >
> > Potentially useful helper function.  Seems like we could alternatively
> > modify the definition of Cache to use the curly brace syntax.
> >
> I would totally prefer curly brace syntax, do we have any reason for
> not using it ?

I'd say go for it.  I trust you mean something like

  data Cache = Cache { ...
                     , cacheSource :: String }

> I use the whitelist basically to don't test again if the repo is
> reachable or not. ( Given the fact that I can get an error just
> because certain file is not  in that repo, but the repo does exist )

OK, that makes some sense, that there may be a legitimate 404 on a
file that does not exist, but the repository is reachable.  Also, would
it be necessary to worry about other HTTP status codes, such as the
redirect ones (301 and 302?).  Is there a more robust way to do this?

[Being sort of an abstract, high-level kind of guy, working on Darcs
 has been a wonderful way for me to learn how things work a bit more
 on the lower level]

> I agree with you, I tried to work with the exception, unfortunately
> "fromException" was giving nothing to this error ( I will look at it
> again though).

Thanks.  System.IO.Error may help.

> > 3. Case in point, you're matching "Foo" against (map toLower "foo")
> >   which means that this code path is never visited.
> I used specifically for "timeout" since the error could show "Timeout"
> or "timeout".

Sorry, I was referring to this:

  case "No such file or directory" `isInfixOf`  (map toLower e) of

> > If the exception is not a timeout, we test if the URL is reachable and
> > whitelist it if so (the whitelist avoids future testing).  But why do we
> > do this?  And doesn't that confuse Darcs into thinking it successfully
> > fetched a file when it did not?  Have you tested the actual case where
> > you get an exception but not a timeout?
> >
> I did test it, and the reasons I add it to the white list, is to avoid
> inspecting again ( as Petr said).

Do you test it automatically?
 
> I added the trailing slash basically because of the reason Petr said,
> cause some servers treats foo/ and foo differently, but as Petr
> suggest what I should test is if foo/_darcs/hashed_inventory exist.

Yes.

> > In this particular case, you've chosen to break checkSshError into
> > a separate function, but not the HTTP stuff.  Why?
> >
> I did it just because it was getting too long ( yes, I tried to keep
> the columns <= 80, but I had that clumsy idea I mentioned before)

OK, well what I would suggest is to break the local and HTTP stuff out too if
you're going to break the SSH stuff out (now I may decide I was wrong after
looking at the results of this, because I don't yet have a good feel about
what is most readable)...
 
>  I guess I would have to do something like
> "existRemoteFileSsh" and check for foo/_darcs/hashed_inventory

Seems fair enough.

> > Do you really use this?
> >>  import Data.Maybe ( isJust, catMaybes )
> >>  import Control.Monad ( msum )
> >
> That's not mine. I just didn't clean that up, as I think that kind of
> things should go in a different patch.

I was referring to badCache :: CacheLoc -> Bool

This is one of these subtle and very subjective issues that I don't really
have an answer for, but I have a feeling that there is sometimes a point
where you're better off redoing application or composition than writing a
helper function.  An example of this would be a hypothetical

 putStrLnError = hPutStrLn stderr

(In fact, I use this sort of thing in some of my own code).  Seems like
there the cost of the extra helper function to think about is is higher
than that of just repeating 'hPutStrLn stderr'.  Maybe somebody else has
clearer thinking on this tradeoff.  These things aren't big deals, of course.
Maybe there is no right answer and there's no point talking about them.

> > The same sort of worry as above.  What happens if your exception is not
> > a timeout?  I'm concerned that you succeed here, where what you really
> > want to do maybe is re-throw the exception.
> 
> If the exception is not timeout that means that It could be a
> possibility that "repo++"/"++darcsdir++"/inventory" exist, if not, we
> throw the final error.

OK, I have to admit I don't really understand the problem yet (I think Petr
does) so I can trust the two of you.  What I was worried about was that you
were just absorbing the final error instead of throwing it.  Are you sure
you're actually throwing it?
 
> Now I have the doubt of how to go with the ssh server, I mean, how
> could I check that the SSH server is reachable or that the
> _darcs/hashed_inventory exist, a first idea come to my mind maybe with
> ssh, but I'm not sure of portability.

Why not use the preExisting copyFileOrUrl (or was it fetch? not too
sure I remember the difference) functions?

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
For a faster response, please try +44 (0)1273 64 2905.
msg11839 (view) Author: ganesh Date: 2010-07-23.11:58:05
Eric Kow wrote:
> On Thu, Jul 22, 2010 at 20:38:35 -0500, Adolfo Builes wrote:
>>> Is it really necessary for this to be String -> Bool?
>>> Why not String -> IO Bool?
>>> 
>>> Alternatively, why not IO (String -> Bool)?
>>> 
>> I used  (String -> Bool) basically to escape the IO so that I could
>> use it easily in the guards.
> 
> ...
> 
>> I totally agree that I abused of unsafePerformIO, in this case I used
>> for the same reason I gave before.
> 
> I think an IO (String -> Bool) approach could maybe still work
> because you can grab the (String -> Bool) function within IO, and
> then use it within a guard.  
> 
> You'll have to decide if minimising the unsafePerformIO is worth the
> extra code complexity or not.  I tend to say it is, but others may
> disagree with me.  

I'm totally against use of unsafePerformIO simply to reduce local code complexity. Where we use it, it's to deal with _global_ issues like needing to thread state or log messages everywhere.

Ganesh

=============================================================================== 
Please access the attached hyperlink for an important electronic communications disclaimer: 
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html 
===============================================================================
msg11895 (view) Author: abuiles Date: 2010-07-28.13:47:46
1 patch for repository http://darcs.net:

Wed Jul 28 08:07:31 COT 2010  builes.adolfo@googlemail.com
  *  Resolve issue 1599: automatically expire unused caches
Attachments
msg11898 (view) Author: kowey Date: 2010-07-28.17:46:38
Just some quick English and Hlinting (not a proper review yet)

Resolve issue 1599: automatically expire unused caches
-------------------------------------------------------
> +        hPutStrLn stderr $ "\nI could not reach the following " ++
> +          englishNum (length sources)  (Noun "repository") " (listed in _darcs/prefs/sources):"
> +        hPutStrLn stderr $ (unlines sources) ++  "\nIt is recommendable to delete the " ++
> +         englishNum (length sources)  (Noun "entry") " listed aboved if not used"

How about this?

 I could not reach the repositories:
   foo
   bar
   baz
   quux
 If you're not using them, you should probably delete the
 the corresponding entries from _darcs/prefs/sources

> +isBadSource :: IO (String -> Bool)
> +isBadSource = do badSources <- getBadSourcesList
> +                 return (`elem` badSources)

Thanks for getting rid of the needless unsafePerformIO

> +checkCacheReachability :: String -> CacheLoc -> IO ()
> +checkCacheReachability e cache
> + | isFile (source cache) = do
> +   reachable <- isReachableSource
> +   if not $ reachable (source cache)
> +     then
> +      do
> +       exist <- doesDirectoryExist $ source cache
> +       when (not exist) $ addBadSource $ source cache
> +       when exist $ addReachableSource $ source cache
> +     else
> +       return ()

I think you want
      
      unless (reachable (source cache)) $ do

You could also maybe refactor this a little bit with a helper function
that takes the (String -> Bool) function as an argument.  Not sure  if
that would be a good thing.

> +      else if reachable $ source cache
> +            then return  ()
> +            else

see above

> + | otherwise                  = fail $ "unknown transport protocol for: " ++ source cache

those spaces are bit superfluous

Overall, I think your code is looking a lot cleaner now!

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
For a faster response, please try +44 (0)1273 64 2905.
msg11899 (view) Author: abuiles Date: 2010-07-28.20:55:19
1 patch for repository http://darcs.net:

Wed Jul 28 15:37:43 COT 2010  builes.adolfo@googlemail.com
  * Resolve issue 1599: automatically expire unused caches
Attachments
msg11905 (view) Author: kowey Date: 2010-08-01.18:33:09
Sorry, Adolfo,

I've been remiss in my reviewing duty.

Anyway, I've interspersed Eric-reads-the-code style comments with
comments to the patch author.  The latter are highlighted with
REMARK

Resolve issue 1599: automatically expire unused caches
======================================================
> +        badSources <- getBadSourcesList
> +        when ( not $ null badSources ) $ reportBadSources badSources

So here we're modifying the withAtExit function directly to report on
the caches that we could not reach.

REMARK: An alternative approach may be to register this as an exit
action via atexit, which seems a bit cleaner/more modular to me.  I
guess the component that would register this would be the main function
for darcs.  On the other hand, I suppose one advantage of baking this
into withAtExit is that we control the order of execution; this fires
after all the actions have been run.  Are there other reasons?  If we
don't care about the order of execution so much, I'd lean more towards
the more modular approach

> +    reportBadSources sources = do
> +        hPutStderr $ "\nI could not reach the following " ++
> +          englishNum (length sources)  (Noun "repository") ":"
> +        hPutStderr $ (unlines sources) ++  "If you're not using " ++ itThem ( length sources) ++
> +          ", you should probably delete\nthe corresponding " ++
> +          englishNum (length sources) (Noun "entry") " from _darcs/prefs/sources."
> +    hPutStderr = hPutStrLn stderr
> +    itThem num = case num > 1 of
> +                  False -> "it"
> +                  True  -> "them"

REMARK: Nice touch on the pronoun.  This could probably go into the
English module.  You could even make a Pronoun type which implements the
Countable instance, which allows you to reuse englishNum.

> +{- NOINLINE _badSourcesList -}
> +_badSourcesList :: IORef [String]
> +_badSourcesList = unsafePerformIO $ newIORef []
> +
> +addBadSource :: String -> IO ()
> +addBadSource cache = modifyIORef _badSourcesList (cache:)
> +
> +getBadSourcesList :: IO [String]
> +getBadSourcesList = readIORef _badSourcesList
> +
> +isBadSource :: IO (String -> Bool)
> +isBadSource = do badSources <- getBadSourcesList
> +                 return (`elem` badSources)

Looks fine; similar for reachableSources

> -data CacheLoc = Cache !CacheType !WritableOrNot !String
> +data CacheLoc = Cache { cacheType:: !CacheType, writableOrNot:: !WritableOrNot, source:: !String }

REMARK: I tend to prefix names of field accessors, which makes the names
uglier but reduces the chances I'll need to qualify them later on, for
example.

  data CacheLoc = Cache { cacheType:: !CacheType
                        , cacheWritable :: !WritableOrNot
                        , cacheSource:: !String }

copyFileUsingCache
------------------
> -       sfuc cache stickItHere
> +       badSource <- isBadSource
> +       sfuc cache stickItHere badSource

> -          sfuc [] _ = return ()
> -          sfuc (c:cs) out | not $ writable c =
> +          sfuc [] _ _ = return ()
> +          sfuc (c:cs) out badSource | not (badSource (source c)) && not (writable c) =
>                if oos == OnlySpeculate

This new version of sfuc makes use of the list of unreachable sources.

REMARK: Perhaps another approach you could take is to just apply a
filter on the cache, maybe changing it to

 case filter (\c -> not (badSource c || writable c) cs of
  []    -> return ()
  (c:_) -> sfuc c stickItHere


Personally, I'd just drop the guards and use an if-then-else for clarity

> hunk ./src/Darcs/Repository/Cache.hs 222
> -                 then speculateFileOrUrl (fn c) out
> +                 then speculateFileOrUrl (fn c) out `catchNonSignal` (\e ->
> +                                       checkCacheReachability (show e) c)

> -                 else copyFileOrUrl [] (fn c) out Cachable
> -                          | otherwise = sfuc cs out
> +                 else copyFileOrUrl [] (fn c) out Cachable `catchNonSignal` (\e ->
> +                                       checkCacheReachability (show e) c)

OK, so the other change being that if something should go wrong when
speculating or trying to copy a file, you check to see if the cache
is alright.

> +                          | otherwise = sfuc cs out badSource

REMARK: The indentation for this otherwise case is now a bit off, which
makes the code harder to understand.  But it would be a moot point if
you went with the filter approach.

checkCacheReachability
-----------------------
> +-- | Checks if a given cache entry is reachable or not.
> +-- It receives an error caught during execution and the cache entry.
> +-- For a local cache, if the given source doesn't exist anymore, it is added.
> +-- For HTTP sources if the error is timeout, it is added, if not we check for the
> +-- _darcs/hashed_inventory file, if doesn't exist it means we are pointing to a repository
> +-- which used to exist there, but had been moved.
> +-- For SSH if we get an error we try to get the file _darcs/hashed_inventory again, if it fails
> +-- we add the entry to the list of sources which are no reachables.
> +-- The entries which get added to the cache are no longer tried for the rest of the command.
> +checkCacheReachability :: String -> CacheLoc -> IO ()
> +checkCacheReachability e cache
> + | isFile (source cache) = do
> +     reachable <- isReachableSource
> +     unless (reachable (source cache)) $ do
> +       exist <- doesDirectoryExist $ source cache
> +       unless exist $ addBadSource $ source cache
> +       when exist $ addReachableSource $ source cache

If a source has already been whitelisted, we don't need to check it;
otherwise, otherwise either whitelist or blacklist it depending on
whether the directory exists.

REMARK: I think you can share the source cache expression using a where
clause which will be common to all the guards.

REMARK: These last two cases look like they could be more simply
expressed using an if-then-else

  if exist
     then addReachableSource (source cache)
     else addBadSource (source cache)

> + | isUrl (source cache) = do
> +     reachable <- isReachableSource
> +     let string = case dropWhile (/='(') e of
> +                    (_:xs) -> fst (break (==')') xs)
> +                    _      -> e
> +     let cerror = case reads string ::[(HTTP.ConnectionError,String)] of
> +                     [(ce,_)] -> Just ce
> +                     _        -> Nothing

REMARK: We're still parsing error strings here, which seems rather
fragile to me.  Weren't we going to capture status codes from curl or
something?  It's an improvement at least that the error code parsing is
in one place and not scattered throughout the darcs source.

> +     if cerror /= Nothing
> +      then addBadSource $ source cache
> +      else
> +       unless (reachable (source cache)) $
> +                  withTemp $ \tempout -> do
> +                    let f = source cache ++ "/" ++darcsdir ++ "/" ++ "hashed_inventory"
> +                    copyFileOrUrl [] f tempout Cachable
> +                    addReachableSource $ source cache
> +                  `catchNonSignal` (\_ -> addBadSource $ source cache)

So the logic here is:

If we get a connection error of some sort, then we add the cache to the
blacklist, even if it was already previously whitelisted.

If it's not a connection error that we got, and we have not already
whitelisted the server, then we test for reachability by trying to
fetch the hashed_inventory file.  If the server is indeed reachable
it is whitelisted so we don't bother checking again; otherwise, we
add it to the blacklist so we know not to fetch from it again.

REMARK: It seems like you have a situation where an item could be
added to the blacklist even though it's already whitelisted.  Is
that OK?  Could there be some sort of unpredictable behaviour when
an entry is in both lists?  Perhaps one way to think about this -- not
necessarily the right way, mind you -- is that

 * If an entry is in the whitelist, it can be added to the blacklist
   (but then it's removed from the whitelist).  This would apply to
   cases where a server goes down in the middle of a fetch.
 * If an entry is in the blacklist, and you try to add it to the
   whitelist, we have an error (bug in darcs...)

REMARK: we're baking in the assumption that hashed repositories have a
hashed_inventory file... hopefully this sort of baking it won't bite us
in the long run.  I guess there's not much we can do about this.

REMARK: You may find the isJust function to be useful.

> + | isSsh  (source cache) =
> +     withTemp $ \tempout -> do
> +       let f = source cache ++ "/" ++darcsdir ++ "/" ++ "hashed_inventory"
> +       copyFileOrUrl [] f tempout Cachable
> +       addReachableSource $ source cache
> +     `catchNonSignal` (\_ -> addBadSource $ source cache)

REMARK: It looks here like you have some duplication with the HTTP code,
maybe you could refactor this into a helper function like checkRemote.

REMARK: You seem to have omitted the case where the cache entry is
already whitelisted.  (Seems like that logic could go into checkRemote)

> + | otherwise = fail $ "unknown transport protocol for: " ++ source cache

fetchFileUsingCachePrivate
--------------------------
REMARK: What's the difference between copyFileUsingCache and
fetchFileUsingCachePrivate?  I think it's useful if you know,
since you're modifying these two functions.

> -       ffuc cache
> +       badSource <- isBadSource
> +       ffuc cache badSource

> -    `catchall` debugFail ("Couldn't fetch `"++f++"'\nin subdir "++(hashedDir subdir)++
> +    `catchall` debugFail ("Couldn't fetch `"++f++"'\nin subdir "++ hashedDir subdir ++

REMARK: This should be in a separate patch.  (Don't worry too much about
this, just being super-picky about minimal patches)

> hunk ./src/Darcs/Repository/Cache.hs 286
> -    where ffuc (c:cs)
> -           | not (writable c) && (Anywhere == fromWhere || isFile (fn c)) =
> +    where ffuc (c:cs) badSource
> +           | not (writable c) && (Anywhere == fromWhere || isFile (fn c)) && not (badSource  (source c)) =

REMARK: Similar remark as with copyFileUsingCache.  Maybe we could just
filter out the bad sources and leave the ffuc logic intact.

>                do debugMessage $ "In fetchFileUsingCachePrivate I'm going manually"
>                   debugMessage $ "    getting "++f
>                   debugMessage $ "    from " ++ fn c
> hunk ./src/Darcs/Repository/Cache.hs 299
>                                      fail $ "Hash failure in " ++ fn c
>                              return (fn c, x')
>                      else return (fn c, x) -- FIXME: create links in caches
> -              `catchall` ffuc cs
> +              `catchNonSignal` (\e -> do
> +                                       checkCacheReachability (show e) c
> +                                       foo <- isBadSource

OK, so if something goes wrong during fetchFileUsingCachePrivate, we
need to checkCacheReachability again.  This can modify the white and
black lists so we need to pass an updated isBadSource.

REMARK: wait a second, why does do we need to care about the updated
isBadSource?  It's not like sources reappear in the list, do they?

REMARK: Actually naming your variables 'foo' is probably a bad idea
(particularly since foo tends to be used casually to refer to some
hypothetical variable, not an actual one)

 
> -              `catchall` do (fname,x) <- ffuc cs
> -                            do createCache c subdir
> -                               createLink fname (fn c)
> -                               return (fn c, x)
> -                             `catchall`
> -                             do gzWriteFilePS (fn c) x `catchall` return ()
> -                                return (fname,x)
> -           | otherwise = ffuc cs
> +              `catchNonSignal` (\ e ->
> +                                 do
> +                                   checkCacheReachability (show e) c
> +                                   foo <- isBadSource
> +                                   (fname,x) <- ffuc cs foo
> +                                   do createCache c subdir
> +                                      createLink fname (fn c)
> +                                      return (fn c, x)
> +                                    `catchall`
> +                                      do gzWriteFilePS (fn c) x `catchall` return ()
> +                                         return (fname,x))

Just a similar change; if something goes wrong we update the black/white
lists accordingly and try from the subsequent cache.

REMARK: I'm still worried about this moving from catchall to
catchNonSignal...  Are we relying on any of these handlers doing
something clever when the user kills darcs ?

> -    where handler' se =
> +    where handler' se = do

REMARK: this noise does not belong in the patch :-)

HTTP module
-----------
> +data ConnectionError = CouldntResolveHost     |
> +                       CouldntConnectToServer |
> +                       OperationTimeout
> +               deriving (Eq, Read, Show)

REMARK: I would avoid the contraction here and call it
CouldNotResolveHost, etc.  I can't really say why; I guess
some part of me feels it's clearer that way, or some part of
me wants ADTs and haddocks to be fairly formal.

> -waitNextUrl :: IO (String, String)
> +waitNextUrl :: IO (String, String, Maybe ConnectionError)

REMARK: It may be worth thinking about just replacing that second String
with Maybe ConnectionError...  (not sure how wise this is in practice)

>  #ifdef HAVE_HTTP
>  
> hunk ./src/HTTP.hs 96
>  waitNextUrl = do
>    (u, f) <- readIORef requestedUrl
>    if null u
> -     then return ("", "No URL requested")
> +     then return ("", "No URL requested", Nothing)
>       else do writeIORef requestedUrl ("", "")
>               e <- (fetchUrl u >>= \s -> B.writeFile f (BC.pack s) >> return "") `catch` h
> hunk ./src/HTTP.hs 99
> -             return (u, e)
> +             let ce = case e of
> +                       "timeout" -> Just OperationTimeout
> +                       _         -> Nothing

It looks like this e string is always an error.

REMARK: I guess there's not much more than trust the strings here.  Have
you tested this with the HTTP module?

URL module
----------
Phew, this seems a bit confusing.  I think the URL module contains both
high-level code and curl-specific code which is #ifdef'ed out.

Seems like the low-level Curl code should be moved to its own module.

> +
> +

REMARK: another irrelevant change, grumble :-P

> hunk ./src/URL.hs 223

>    let l = pipeLength st
>    when (l > 0) $ do
>                  dbg "URL.waitNextUrl start"
> -                (u, e) <- liftIO $ waitNextUrl'
> +                (u, e, ce) <- liftIO $ waitNextUrl'
>                  let p = inProgress st
>                      new_st = st { inProgress = Map.delete u p
>                                  , pipeLength = l - 1 }
> hunk ./src/URL.hs 238
>                           else case Map.lookup u p of
>                                  Just (f, _, _) -> do
>                                    removeFileMayNotExist (f++"-new_"++randomJunk st)
> -                                  downloadComplete u e
> +                                  case ce of
> +                                    Just httpError -> downloadComplete u (show httpError)
> +                                    Nothing        -> downloadComplete u e
>                                    debugMessage $ "URL.waitNextUrl failed: "++
>                                                 u++" "++f++" "++e
>                                  Nothing -> bug $ "Another possible bug in URL.waitNextUrl: "++u++" "++e

This is the high-level waitNextUrl, extended to make use of high-level
errors or fall back to string type errors in our reporting.

REMARK: does this mean the two cannot co-exist?  If not, the type
signature of waitNextUrl' could probably reflect somehow, maybe
using Either or another type like

  HttpStatus = HttpOK
             | HttpConnectionError ConnectionError
             | HttpOtherError String
          

> -                        unless (null e) (debugFail $ "Failed to download URL "++u++": "++e)
> +                        unless (null e) $ do
> +                          debugMessage $ "Failed to download URL "++u++": "++e
> +                          fail e

REMARK: irrelevant change (I understand this may be a bit of a pain, but
any effort you could put into making your patches easy to understand
pays off)

darcs revert and darcs record are interactive... (although if you find
them a bit of a pain to use, there could a UI discussion there...)

> -waitNextUrl' :: IO (String, String)
> +waitNextUrl' :: IO (String, String, Maybe ConnectionError)

This looks like the curl version of waitNextUrl'

>  waitNextUrl' = do
>    e <- curl_wait_next_url >>= peekCString
> +  ce <- if not (null e)
> +       then do
> +         errorNum <- curl_last_error_num
> +         case errorNum of
> +           6  -> return $ Just CouldntResolveHost
> +           7  -> return $ Just CouldntConnectToServer
> +           29 -> return $ Just OperationTimeout
> +           _  -> return Nothing
> +       else
> +         return Nothing

Hmm, I'll take this on faith.

REMARK: potential minor refactor

 return $ case errorNum of
           6  -> Just CouldntResolveHost
           7  -> Just CouldntConnectToServer
           29 -> Just OperationTimeout
           _  -> return Nothing

> +foreign import ccall "hscurl.h curl_last_error_num"
> +  curl_last_error_num :: IO CInt

Exciting; I keep avoiding the FFI for some reason.

hscurl.c
--------
> +static CURLcode errornum  = 0;
>  
>  const char *curl_wait_next_url()
>  {
> +  errornum = -1;
> +
>  
> -      if (result != CURLE_OK)
> +      if (result != CURLE_OK){
> +        errornum = result;
>          return curl_easy_strerror(result);
> +      }

REMARK: Hmm, this sort of stateful way of doing things tends to make me
nervous (also, do we need to worry about what happens when you call
wait_next_url several times in parallel?). Could we not return a tuple
instead?  I guess we'd have to  find a way to deal with structs in the
FFI?

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
For a faster response, please try +44 (0)1273 64 2905.
msg11906 (view) Author: kowey Date: 2010-08-01.18:37:41
On Sun, Aug 01, 2010 at 18:33:11 +0000, Eric Kow wrote:
> REMARK: Hmm, this sort of stateful way of doing things tends to make me
> nervous (also, do we need to worry about what happens when you call
> wait_next_url several times in parallel?). Could we not return a tuple
> instead?  I guess we'd have to  find a way to deal with structs in the
> FFI?

I cut out too much context here.

I was referring to this change:

  +const int *curl_last_error_num();                                                                                                                                           

which is how we grab the error code from our last attempt to fetch a
URL.  I was asking if there is a way we could somehow return this in the
fetch function.  Solutions I can think of with my
non-C-programmer-limited brain (*) are some kind of struct or maybe
passing in the error code as a pointer...  Maybe somebody wiser can
chime in.

(*) Yeah yeah, kids these days...

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
For a faster response, please try +44 (0)1273 64 2905.
msg11921 (view) Author: abuiles Date: 2010-08-03.05:50:38
Hi Eric,

I'm almost done write the modifications with the patch, I have a small
issue though, with:

>
> REMARK: An alternative approach may be to register this as an exit
> action via atexit, which seems a bit cleaner/more modular to me.  I
> guess the component that would register this would be the main function
> for darcs.  On the other hand, I suppose one advantage of baking this
> into withAtExit is that we control the order of execution; this fires
> after all the actions have been run.  Are there other reasons?  If we
> don't care about the order of execution so much, I'd lean more towards
> the more modular approach
>

I register the action using atexit, and it works, but happens that it
calls the action twice, so I end with the message repeated.

This is the part where I added the atexit

hunk ./src/darcs.hs 54
+  atexit reportBadSources
hunk ./src/darcs.hs 77
-
+    where
+    reportBadSources = do
+        sources <- getBadSourcesList
+        when ( not $ null sources ) $ do
+           hPutStderr $ "\nI could not reach the following " ++
+             englishNum (length sources)  (Noun "repository") ":"
+           hPutStderr $ (unlines sources) ++  "If you're not using " ++
+             englishNum (length sources) Pronoun ", you should
probably delete\nthe corresponding " ++
+             englishNum (length sources) (Noun "entry") " from
_darcs/prefs/sources."
+    hPutStderr = hPutStrLn stderr

Maybe any idea  what is going on  and why the action gets called twice  ?.

--
Adolfo
msg11923 (view) Author: kowey Date: 2010-08-03.10:25:32
On Tue, Aug 03, 2010 at 00:53:18 -0500, Adolfo Builes wrote:
> I register the action using atexit, and it works, but happens that it
> calls the action twice, so I end with the message repeated.

Perhaps it's being registered more than once?  To debug this, you could
modify atexit to take a string along with an action and count up the
number of instances of each string.

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
For a faster response, please try +44 (0)1273 64 2905.
msg11926 (view) Author: abuiles Date: 2010-08-03.14:32:49
1 patch for repository http://darcs.net:

Tue Aug  3 09:15:16 COT 2010  builes.adolfo@googlemail.com
  * Resolve issue 1599: automatically expire unused caches
Attachments
msg11975 (view) Author: kowey Date: 2010-08-05.13:37:59
Hi Adolfo,

OK! Only more more iteration with me.  In my opinion, what we should
after your next revision (sorry! it's mostly nitpicks, and feel free to
ignore me if I'm saying things which are evidently stupid or just
subjective), we should send your bundle to Somebody Else to review for
correctness.  I'm bound to have missed the forest for the trees
somewhere.

We've seen most of this code before, so I'll only comment on what I
think are the new bits (or any old bits that may need modifying still)

Resolve issue 1599: automatically expire unused caches
------------------------------------------------------
> -       sfuc cache stickItHere
> +       badSource <- isBadSource
> +       let cacheFiltered = filter (\ cacheEntry -> not . badSource $ cacheSource cacheEntry ) cache
> +       sfuc cacheFiltered stickItHere

FIXME: This is one place that point-free style could produce nicer code

> -          sfuc (c:cs) out | not $ writable c =
>
> -                          | otherwise = sfuc cs out

> +          sfuc (c:cs) out
> +            | not (writable c) =
...

> +            | otherwise = sfuc cs out

I won't really grumble too much here about the irrelevant change
since it seems to be partly a consequence of your other work.
Improving code style is a good thing; I just try to encourage it
in separate patches when I can.

> -   then speculateFileOrUrl (fn c) out
> -   else copyFileOrUrl [] (fn c) out Cachable
> + then speculateFileOrUrl (fn c) out `catchNonSignal` (\e -> checkCacheReachability (show e) c)
> + else copyFileOrUrl [] (fn c) out Cachable `catchNonSignal` (\e -> checkCacheReachability (show e) c)

[note, whitespace chopped for review]

FIXME: Please amend out the conflict with Petr's DefaultDarcsRepo
change.

Old news on catching the copy/speculate failure

> +-- | Checks if a given cache entry is reachable or not.
> +-- It receives an error caught during execution and the cache entry.

> +-- For a local cache, if the given source doesn't exist anymore, it is added.

> +-- For HTTP sources if the error is timeout, it is added, if not we check for the
> +-- _darcs/hashed_inventory file, if doesn't exist it means we are pointing to a repository
> +-- which used to exist there, but had been moved.
>
> +-- For SSH if we get an error we try to get the file _darcs/hashed_inventory again, if it fails
> +-- we add the entry to the list of sources which are no reachables.
> +-- The entries which get added to the cache are no longer tried for the rest of the command.

FIXME: The haddock above is a bit ambiguous.  Added where?  You could
probably say blacklisted and/or whitelisted.  Also, making each of these
into bullet points could make it a bit easier to read.

> +checkCacheReachability :: String -> CacheLoc -> IO ()
> +checkCacheReachability e cache


> + | isUrl source = do
> +     reachable <- isReachableSource
> +     unless (reachable source) $ do
> +            let string = case dropWhile (/='(') e of
> +                          (_:xs) -> fst (break (==')') xs)
> +                          _      -> e
> +            let cerror = case reads string ::[(HTTP.ConnectionError,String)] of
> +                           [(ce,_)] -> Just ce
> +                           _        -> Nothing
> +            if isJust cerror
> +             then addBadSource source
> +              else do
> +               checkHashedInventoryReachability cache
> +               addReachableSource source
> +              `catchNonSignal` (\_ -> addBadSource source)

...

> +-- | Checks if the  _darcs/hashed_inventory exist and is reachable
> +checkHashedInventoryReachability :: CacheLoc -> IO ()
> +checkHashedInventoryReachability cache =
> +     withTemp $ \tempout -> do
> +       let f = cacheSource cache ++ "/" ++darcsdir ++ "/" ++ "hashed_inventory"
> +       copyFileOrUrl [] f tempout Cachable


FIXME: I see you've done a bit of refactoring.  FIXME I think you could
make it a bit better.  How about returning IO Bool instead?  The idea is
that you'd move the exception handling into here and the code above
could just test a conditional.

Code clarity is the number one thing I tend to go after (unfortunately
this does not necessarily mean I'm any good at it).  It's why I seem to
be quite insistent on minor details like that; I think they pay off in
the long run especially when working in a team.

> -       ffuc cache
> -    `catchall` debugFail ("Couldn't fetch `"++f++"'\nin subdir "++(hashedDir subdir)++
> +       badSource <- isBadSource
> +       let cacheFiltered = filter (\ cacheEntry -> not . badSource $ cacheSource cacheEntry ) cache
> +       ffuc cacheFiltered
> +    `catchall` debugFail ("Couldn't fetch `"++f++"'\nin subdir "++ hashedDir subdir ++
>                            " from sources:\n\n"++show (Ca cache))

FIXME: sounds like you could do another pointfree refactor here

By the way, notice how much a little higher-order function like filter
lets us do without changing very much code?  Just compose some stuff
together, feed it to filter and you're done.

> -              `catchall` ffuc cs
> +              `catchNonSignal` (\e -> do
> +                                      checkCacheReachability (show e) c
> +                                      badSource <- isBadSource
> +                                      let cacheFiltered = filter (\ cacheEntry -> not . badSource $ cacheSource cacheEntry ) cs
> +                                      ffuc cacheFiltered)


When I complained about catchall and catchNonSignal you pointed out that
  x `catchall` y
is actually just
  x `catchNonSignal` const y
As an aside, this kinda makes me think that maybe catchall isn't
/really/ needed, but I could be wrong.

FIXME: you're using filter (not . badSource . cacheSource) quite a lot.

Maybe what you want is something like this

  filterOutBadSources :: [Cache] -> IO [Cache]
  filterOutBadSources cs = do
    badSource <- isBadSource
    return $ filter (not . badSource . cacheSource) cs

HTTP module
~~~~~~~~~~~
> -waitNextUrl' = do
> -  e <- curl_wait_next_url >>= peekCString
> -  u <- curl_last_url >>= peekCString
> -  return (u, e)
> +waitNextUrl' =
> +  alloca $ \ errorPointer -> do
> +    e <- curl_wait_next_url errorPointer >>= peekCString
> +    ce <- if not (null e)
> +          then do
> +           errorNum <- peek errorPointer
> +           case errorNum of
> +             6  -> return $ Just CouldNotResolveHost
> +             7  -> return $ Just CouldNotConnectToServer
> +             29 -> return $ Just OperationTimeout
> +             _  -> return Nothing
> +          else
> +           return Nothing
> +    u <- curl_last_url >>= peekCString
> +    return (u, e, ce)

FIXME: You should have a comment pointing us to some documentation
on the error codes.

As a chatty aside, I like bracket functions like alloca which
prevent forgetful people like me from introducing random memory
leaks for want of a free and also avoid bugs like me accidentally
freeing something I shouldn't.  I think we have similar idioms in
Darcs.Lock.  Sort of an inane comment on my part :-)

>  foreign import ccall "hscurl.h curl_wait_next_url"
> -  curl_wait_next_url :: IO CString
> +  curl_wait_next_url :: Ptr CInt -> IO CString

Thanks for teaching me a little more FFI

>  main = withAtexit $ withSignalsHandled $
>    flip catch execExceptionHandler $
>    handle (\(AssertionFailed e) -> bug e) $ do
> +  atexit reportBadSources

I think this is much cleaner than modifying Darcs.Global.

> +    reportBadSources = do
> +        sources <- getBadSourcesList
> +        when ( not $ null sources ) $ do

FIXME: Clearer as unless (null sources), I think

> +           hPutStderr $ "\nI could not reach the following " ++
> +             englishNum (length sources)  (Noun "repository") ":"
> +           hPutStderr $ (unlines sources) ++  "If you're not using " ++
> +             englishNum (length sources) It ", you should probably delete\nthe corresponding " ++
> +             englishNum (length sources) (Noun "entry") " from _darcs/prefs/sources."

Not sure if this really belongs in main, maybe in one of the Cache
modules? I'm not the best person to judge.

FIXME: Maybe worth rewriting this using only one hPutStrLn stderr
plus some string concatenation.  Concat may also be helpful here.
Basically, just try to make the code as easy to read as possible.
It's a bit subjective though, but I bet there are things you can
do which are objectively nicer.

> +    hPutStderr = hPutStrLn stderr

FIXME: Ahah, for all my carrying on about refactoring, I actually think
this is one of those nuanced cases where the code duplication of just
saying hPutStrLn stderr is clearer than asking people to keep track of
an extra name, particularly people who have to maintain the code.
Subtle, huh?! You duplicate code and Eric complains.  You refactor code
and then he complains some more! Can't win...

On the other hand, maybe I'd refactor the length sources if I were
you.


>        int error = curl_easy_getinfo(easy, CURLINFO_PRIVATE, (char **)&url_data);
> -      if (error != CURLE_OK)
> +      if (error != CURLE_OK){

The if (foo) bar; style always seems to me like a clarity error
wrt if (foo) { bar; }... Good thing your change necessitates getting
rid of it.

> +        *errorCode = error;
> +      }

-- 
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
For a faster response, please try +44 (0)1273 64 2905.
msg11985 (view) Author: abuiles Date: 2010-08-06.06:41:02
1 patch for repository http://darcs.net:

Fri Aug  6 01:27:53 COT 2010  builes.adolfo@googlemail.com
  * Resolve issue 1599: automatically expire unused caches
Attachments
msg11986 (view) Author: kowey Date: 2010-08-06.08:08:47
Hi Florent,

This patch bundle has been through a few kowey-lints.  But I suspect 
that in all my tweaking, I may have missed the forest for the trees.  
May I hand this over to you? (let me know if you're still catching up 
from holidays, or otherwise are not quite available yet)
msg11987 (view) Author: kowey Date: 2010-08-06.08:50:00
Oh, Petr says he's got this one
msg12009 (view) Author: mornfall Date: 2010-08-06.21:13:41
Hi,

since Eric requested a parallel review, let's go for it. Overall, I am
of the opinion that the patch could be pushed as it is, with a few
things fixed in a followup patch. I am pointing out some possible
stylistic cleanups below, and some refactoring opportunities. I'll be
happier if you pick up on those (be it as a followup or through
amendment), although I am not going to be terribly unhappy if the patch
stays as it is.

Yours,
   Petr.

Resolve issue 1599: automatically expire unused caches
------------------------------------------------------

[export/import stuff]

> hunk ./src/Darcs/Global.hs 154
>  resetCRCWarnings :: IO ()
>  resetCRCWarnings = writeIORef _crcWarningList []
>  
> +{- NOINLINE _badSourcesList -}
> +_badSourcesList :: IORef [String]
> +_badSourcesList = unsafePerformIO $ newIORef []
> +
> +addBadSource :: String -> IO ()
> +addBadSource cache = modifyIORef _badSourcesList (cache:)
> +
> +getBadSourcesList :: IO [String]
> +getBadSourcesList = readIORef _badSourcesList
> +
> +isBadSource :: IO (String -> Bool)
> +isBadSource = do badSources <- getBadSourcesList
> +                 return (`elem` badSources)
> +
> +{- NOINLINE _reachableSourcesList -}
> +_reachableSourcesList :: IORef [String]
> +_reachableSourcesList = unsafePerformIO $ newIORef []
> +
> +addReachableSource :: String -> IO ()
> +addReachableSource src = modifyIORef _reachableSourcesList (src:)
> +
> +getReachableSources :: IO [String]
> +getReachableSources = readIORef _reachableSourcesList
> +
> +isReachableSource :: IO (String -> Bool)
> +isReachableSource =  do reachableSources <- getReachableSources
> +                        return (`elem` reachableSources)
> +
>  darcsdir :: String
>  darcsdir = "_darcs"
Some global bookkeeping. Doesn't make me very happy, but for now it
happens to be the best tradeoff we can get.

[import/export wibbles for Darcs.Repository]

[import/export bits of Darcs.Repository.Cache]

> hunk ./src/Darcs/Repository/Cache.hs 63
>  
>  data WritableOrNot = Writable | NotWritable deriving ( Show )
>  data CacheType = Repo | Directory deriving ( Eq, Show )
> -data CacheLoc = Cache !CacheType !WritableOrNot !String
> +data CacheLoc = Cache { cacheType:: !CacheType, cacheWritable:: !WritableOrNot, cacheSource:: !String }
>  newtype Cache = Ca [CacheLoc] -- abstract type for hiding cache
>  
>  instance Eq CacheLoc where
> hunk ./src/Darcs/Repository/Cache.hs 209
>      do debugMessage $ "I'm doing copyFileUsingCache on "++(hashedDir subdir)++"/"++f
>         Just stickItHere <- cacheLoc cache
>         createDirectoryIfMissing False (reverse $ dropWhile (/='/') $ reverse stickItHere)
> -       sfuc cache stickItHere
> +       cacheFiltered <- filterBadSources cache
> +       sfuc cacheFiltered stickItHere
Filter the cache. I'd probably write

filterBadSources cache >>= (flip sfuc) stickItHere

(unless you need the cacheFiltered identifier somewhere later, but I
don't see that)

This could go in as an extra patch, if you feel like changing it.

>      `catchall` return ()
>      where cacheLoc [] = return Nothing
>            cacheLoc (c:cs) | not $ writable c = cacheLoc cs
> hunk ./src/Darcs/Repository/Cache.hs 221
>                                 case othercache of Just x -> return $ Just x
>                                                    Nothing -> return $ Just (fn c)
>            sfuc [] _ = return ()
> -          sfuc (c:cs) out | not $ writable c =
> +          sfuc (c:cs) out
> +            | not (writable c) =
>                if oos == OnlySpeculate
> hunk ./src/Darcs/Repository/Cache.hs 224
> -                 then speculateFileOrUrl (fn c) out
> -                 else copyFileOrUrl DefaultRemoteDarcs (fn c) out Cachable
> -                          | otherwise = sfuc cs out
> +               then speculateFileOrUrl (fn c) out `catchNonSignal` (\e -> checkCacheReachability (show e) c)
> +               else copyFileOrUrl DefaultRemoteDarcs (fn c) out Cachable `catchNonSignal` (\e -> checkCacheReachability (show e) c)
> +            | otherwise = sfuc cs out
OK. Some more checking of source validity.

>            fn c = hashedFilePath c subdir f
>  
>  copyFilesUsingCache :: OrOnlySpeculate -> Cache -> HashedDir -> [String] -> IO ()
> hunk ./src/Darcs/Repository/Cache.hs 236
>  
>  data FromWhere = LocalOnly | Anywhere deriving ( Eq )
>  
> +-- | Checks if a given cache entry is reachable or not.
> +-- It receives an error caught during execution and the cache entry.
> +-- If the caches is not reachable it is blacklisted and not longer tried for
> +-- the rest of the session. If it is reachable it is whitelisted and future errors with such
> +-- cache get ignore.
> +-- To determine reachability:
> +--  * For a local cache, if the given source doesn't exist anymore, it is blacklisted.
> +--  * For remote sources if the error is timeout, it is blacklisted, if not,
> +--    it checks if _darcs/hashed_inventory  exist, if it does, the entry is whitelisted, if
> +--    it doesn't, it is blacklisted.
> +checkCacheReachability :: String -> CacheLoc -> IO ()
> +checkCacheReachability e cache
> + | isFile source = do
> +     reachable <- isReachableSource
> +     unless (reachable source) $ do
> +       exist <- doesDirectoryExist source
> +       if exist
> +        then
> +         addReachableSource source
> +        else
> +         addBadSource source
> + | isUrl source = do
> +     reachable <- isReachableSource
> +     unless (reachable source) $ do
> +            let string = case dropWhile (/='(') e of
> +                          (_:xs) -> fst (break (==')') xs)
> +                          _      -> e
> +            let cerror = case reads string ::[(HTTP.ConnectionError,String)] of
> +                           [(ce,_)] -> Just ce
> +                           _        -> Nothing
> +            if isJust cerror
> +             then addBadSource source
> +             else checkFileReachability
> +
> + | isSsh source = do
> +   reachable <- isReachableSource
> +   unless (reachable source) checkFileReachability
> +
> + | otherwise = fail $ "unknown transport protocol for: " ++ source
> + where source = cacheSource cache
> +       checkFileReachability = do
> +         reachable <- checkHashedInventoryReachability cache
> +         if reachable
> +          then
> +           addReachableSource source
> +          else
> +           addBadSource source
OK. I am not exactly excited about the abuse of String here, although
IIRC you have discussed that with Eric already. I won't dither on this
too much.

> +-- | Returns a list of reachables cache entries,
> +--   taking out the blacklisted entries
> +filterBadSources :: [CacheLoc] -> IO [CacheLoc]
> +filterBadSources cache = do
> +   badSource <- isBadSource
> +   return $ filter (not . badSource . cacheSource) cache
OK.


> +-- | Checks if the  _darcs/hashed_inventory exist and is reachable
> +checkHashedInventoryReachability :: CacheLoc -> IO Bool
> +checkHashedInventoryReachability cache =
> +     withTemp $ \tempout -> do
> +       let f = cacheSource cache ++ "/" ++darcsdir ++ "/" ++ "hashed_inventory"
> +       copyFileOrUrl DefaultRemoteDarcs f tempout Cachable
> +       return True
> +     `catchNonSignal` (\_ -> return False)
You could use System.FilePath.Posix ( (</>) ) above (note the _Posix_
there). Other than that, OK.

>  fetchFileUsingCachePrivate :: FromWhere -> Cache -> HashedDir -> String -> IO (String, B.ByteString)
>  fetchFileUsingCachePrivate fromWhere (Ca cache) subdir f =
>      do when (fromWhere == Anywhere) $ copyFileUsingCache ActuallyCopy (Ca cache) subdir f
> hunk ./src/Darcs/Repository/Cache.hs 303
> -       ffuc cache
> -    `catchall` debugFail ("Couldn't fetch `"++f++"'\nin subdir "++(hashedDir subdir)++
> +       cacheFiltered <- filterBadSources cache
> +       ffuc cacheFiltered
> +    `catchall` debugFail ("Couldn't fetch `"++f++"'\nin subdir "++ hashedDir subdir ++
>                            " from sources:\n\n"++show (Ca cache))
OK. Filter the cache. Why not (filterBadSources cache >>= ffuc)?

>      where ffuc (c:cs)
>             | not (writable c) && (Anywhere == fromWhere || isFile (fn c)) =
> hunk ./src/Darcs/Repository/Cache.hs 320
>                                      fail $ "Hash failure in " ++ fn c
>                              return (fn c, x')
>                      else return (fn c, x) -- FIXME: create links in caches
> -              `catchall` ffuc cs
> +              `catchNonSignal` (\e -> do
> +                                      checkCacheReachability (show e) c
> +                                      cacheFiltered <- filterBadSources cs
> +                                      ffuc cacheFiltered)
Another place to check and filter the cache.

Makes me wonder if you wouldn't like to write something like

    checkCaches (show e) c cs >>= ffuc

instead...

>  
>             | writable c =
>                do x1 <- gzFetchFilePS (fn c) Cachable
> hunk ./src/Darcs/Repository/Cache.hs 337
>                        else return x1
>                   mapM_ (tryLinking (fn c)) cs
>                   return (fn c, x)
> -              `catchall` do (fname,x) <- ffuc cs
> -                            do createCache c subdir
> -                               createLink fname (fn c)
> -                               return (fn c, x)
> -                             `catchall`
> -                             do gzWriteFilePS (fn c) x `catchall` return ()
> -                                return (fname,x)
> +              `catchNonSignal` (\ e ->
> +                                 do
> +                                   checkCacheReachability (show e) c
> +                                   cacheFiltered <- filterBadSources cs
> +                                   (fname,x) <- ffuc cacheFiltered
> +                                   do createCache c subdir
> +                                      createLink fname (fn c)
> +                                      return (fn c, x)
> +                                    `catchall`
> +                                      do gzWriteFilePS (fn c) x `catchall` return ()
> +                                         return (fname,x))
Looks equivalent, plus fixes up the cache list on errors plus filters
the caches some more. May I add that fetchFileUsingCache is in a dire
need of refactor? If you have a bit of time later, I'd welcome that.

>             | otherwise = ffuc cs
>  
>            ffuc [] = debugFail $ "No sources from which to fetch file `"++f++"'\n"++ show (Ca cache)

> hunk ./src/Darcs/Repository/Cache.hs 409
> +-- | Prints an error message with a list of bad caches.
> +reportBadSources :: IO ()
> +reportBadSources = do
> +  sources <- getBadSourcesList
> +  let size = length sources
> +  unless (null sources) $ do
> +   hPutStrLn stderr $
> +    concat ["\nI could not reach the following ",
> +            englishNum size  (Noun "repository") ":"]
> +   hPutStrLn stderr $
> +    concat [unlines sources,
> +            "If you're not using ",
> +            englishNum size It ", you should probably delete\nthe corresponding ",
> +            englishNum size (Noun "entry") " from _darcs/prefs/sources."]
(I know Eric made you use concat here... I'd go with unlines
personally. You can do this in an extra patch if you want to, or just
forget it otherwise.)

[export]
> hunk ./src/HTTP.hs 22
>  import qualified Data.ByteString.Char8 as BC
>  #endif
>  
> +data ConnectionError = CouldNotResolveHost     |
> +                       CouldNotConnectToServer |
> +                       OperationTimeout
> +               deriving (Eq, Read, Show)
Enumerate some common connection errors for use with above.

> +
>  fetchUrl :: String -> IO String
>  postUrl
>      :: String     -- ^ url
> hunk ./src/HTTP.hs 35
>      -> IO ()  -- ^ result
>  
>  requestUrl :: String -> FilePath -> a -> IO String
> -waitNextUrl :: IO (String, String)
> +waitNextUrl :: IO (String, String, Maybe ConnectionError)
This Maybe ConnectionError feels a bit backwards, but OK.

>  #ifdef HAVE_HTTP
>  
> hunk ./src/HTTP.hs 94
>  waitNextUrl = do
>    (u, f) <- readIORef requestedUrl
>    if null u
> -     then return ("", "No URL requested")
> +     then return ("", "No URL requested", Nothing)
>       else do writeIORef requestedUrl ("", "")
>               e <- (fetchUrl u >>= \s -> B.writeFile f (BC.pack s) >> return "") `catch` h
> hunk ./src/HTTP.hs 97
> -             return (u, e)
> +             let ce = case e of
> +                       "timeout" -> Just OperationTimeout
> +                       _         -> Nothing
> +             return (u, e, ce)
Extract a possible connection error. OK.

>      where h = return . ioeGetErrorString
>  
>  getProxy :: IO String

[imports again]
> hunk ./src/URL.hs 225
>    let l = pipeLength st
>    when (l > 0) $ do
>                  dbg "URL.waitNextUrl start"
> -                (u, e) <- liftIO $ waitNextUrl'
> +                (u, e, ce) <- liftIO $ waitNextUrl'
>                  let p = inProgress st
>                      new_st = st { inProgress = Map.delete u p
>                                  , pipeLength = l - 1 }
> hunk ./src/URL.hs 240
>                           else case Map.lookup u p of
>                                  Just (f, _, _) -> do
>                                    removeFileMayNotExist (f++"-new_"++randomJunk st)
> -                                  downloadComplete u e
> +                                  case ce of
> +                                    Just httpError -> downloadComplete u (show httpError)
> +                                    Nothing        -> downloadComplete u e
Pass errors from waitNextUrl down the pipe. OK.

>                                    debugMessage $ "URL.waitNextUrl failed: "++
>                                                 u++" "++f++" "++e
>                                  Nothing -> bug $ "Another possible bug in URL.waitNextUrl: "++u++" "++e
> hunk ./src/URL.hs 265
>                   Just var -> do
>                          e <- readMVar var
>                          modifyMVar_ urlNotifications (return . (Map.delete u))
> -                        unless (null e) (debugFail $ "Failed to download URL "++u++": "++e)
> +                        unless (null e) $ do
> +                          debugMessage $ "Failed to download URL "++u++": "++e
> +                          fail e
Is this any different?

>                   Nothing  -> return () -- file was already downloaded
>  
>  dbg :: String -> StateT a IO ()
> hunk ./src/URL.hs 293
>  
>  setDebugHTTP :: IO ()
>  requestUrl :: String -> FilePath -> Cachable -> IO String
> -waitNextUrl' :: IO (String, String)
> +waitNextUrl' :: IO (String, String, Maybe ConnectionError)
>  pipeliningEnabled :: IO Bool
>  
>  #ifdef HAVE_CURL
> hunk ./src/URL.hs 306
>        err <- curl_request_url ustr fstr (cachableToInt cache) >>= peekCString
>        return err
>  
> -waitNextUrl' = do
> -  e <- curl_wait_next_url >>= peekCString
> -  u <- curl_last_url >>= peekCString
> -  return (u, e)
> +waitNextUrl' =
> +  bracket malloc free $ \ errorPointer -> do
> +    e <- curl_wait_next_url errorPointer >>= peekCString
> +    ce <- if not (null e)
> +          then do
> +           errorNum <- peek errorPointer
> +           case errorNum of
> +             6  -> return $ Just CouldNotResolveHost
> +             7  -> return $ Just CouldNotConnectToServer
> +             29 -> return $ Just OperationTimeout
> +             _  -> return Nothing
> +          else
> +           return Nothing
> +    u <- curl_last_url >>= peekCString
> +    return (u, e, ce)
Oi, ugly. Well, that's the destiny of FFI code... Just translates cURL
error codes to something more malleable (ConnectionError). (I'd probably
use alloca myself, but bracket malloc free is probably equivalent.)

>  pipeliningEnabled = do
>    r <- curl_pipelining_enabled
> hunk ./src/URL.hs 330
>    curl_request_url :: CString -> CString -> CInt -> IO CString
>  
>  foreign import ccall "hscurl.h curl_wait_next_url"
> -  curl_wait_next_url :: IO CString
> +  curl_wait_next_url :: Ptr CInt -> IO CString
>  
>  foreign import ccall "hscurl.h curl_last_url"
>    curl_last_url :: IO CString
FFI bits. OK.

[imports for darcs.hs]
> hunk ./src/darcs.hs 53
>  main = withAtexit $ withSignalsHandled $
>    flip catch execExceptionHandler $
>    handle (\(AssertionFailed e) -> bug e) $ do
> +  atexit reportBadSources
>    argv <- getArgs
>    case argv of
>      -- User called "darcs" without arguments.
[OK] Register bad sources reporting to run at exit time.

> hunk ./src/hscurl.c 253
>    return error_strings[RESULT_OK];
>  }
>  
> -const char *curl_wait_next_url()
> +const char *curl_wait_next_url(int* errorCode)
>  {
> hunk ./src/hscurl.c 255
> +  *errorCode = -1;
> +
>    if (last_url != NULL)
>      {
>        free(last_url);
> hunk ./src/hscurl.c 280
>        CURLcode result = msg->data.result;
>        struct UrlData *url_data;
>        int error = curl_easy_getinfo(easy, CURLINFO_PRIVATE, (char **)&url_data);
> -      if (error != CURLE_OK)
> +      if (error != CURLE_OK){
Iew, that ){ is hurting my eyes. A bit of whitespace couldn't hurt! :)

> +        *errorCode = error;
>          return curl_easy_strerror(error);
> hunk ./src/hscurl.c 283
> +      }
>  
>        last_url = url_data->url;
>        fclose(url_data->file);
> hunk ./src/hscurl.c 295
>          return curl_multi_strerror(error);
>        curl_easy_cleanup(easy);
>  
> -      if (result != CURLE_OK)
> +      if (result != CURLE_OK){
> +        *errorCode = result;
>          return curl_easy_strerror(result);
> hunk ./src/hscurl.c 298
> +      }
>      }
>    else
>      return error_strings[RESULT_UNKNOWN_MESSAGE];
> hunk ./src/hscurl.h 5
>                               const char *filename,
>                               int cache_time);
>  
> -const char *curl_wait_next_url();
> +const char *curl_wait_next_url(int *errorCode);
>  
>  const char *curl_last_url();
OK.
msg12010 (view) Author: abuiles Date: 2010-08-07.00:32:03
Hi Petr,

Thanks for the review !.. I'm amending the stylistic things and resending.
waitNextUrl: "++u++" "++e
>> hunk ./src/URL.hs 265
>>                   Just var -> do
>>                          e <- readMVar var
>>                          modifyMVar_ urlNotifications (return . (Map.delete u))
>> -                        unless (null e) (debugFail $ "Failed to download URL "++u++": "++e)
>> +                        unless (null e) $ do
>> +                          debugMessage $ "Failed to download URL "++u++": "++e
>> +                          fail e
> Is this any different?
>
The message of debugFail  is all the string you pass it as parameter.

--
Adolfo
msg12011 (view) Author: abuiles Date: 2010-08-07.00:41:37
1 patch for repository http://darcs.net:

Fri Aug  6 19:33:39 COT 2010  builes.adolfo@googlemail.com
  * Resolve issue 1599: automatically expire unused caches
Attachments
msg12021 (view) Author: darcswatch Date: 2010-08-07.13:42:16
This patch bundle (with 1 patches) was just applied to the repository http://darcs.net/.
This message was brought to you by DarcsWatch
http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-90eebb0527b9056e3ab6a71c4e2df7e296fd8866
msg14055 (view) Author: darcswatch Date: 2011-05-10.17:35:41
This patch bundle (with 1 patches) was just applied to the repository http://darcs.net/reviewed.
This message was brought to you by DarcsWatch
http://darcswatch.nomeata.de/repo_http:__darcs.net_reviewed.html#bundle-90eebb0527b9056e3ab6a71c4e2df7e296fd8866
History
Date User Action Args
2010-07-22 06:12:24abuilescreate
2010-07-22 06:14:48darcswatchsetdarcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-a21fdcf51e31ec24feb7ff9c98bf3be16c0649b0
2010-07-22 15:47:57koweysetnosy: + kowey
messages: + msg11825
2010-07-22 18:08:14mornfallsetnosy: + mornfall
messages: + msg11828
2010-07-23 01:35:58abuilessetmessages: + msg11831
2010-07-23 11:50:46koweysetmessages: + msg11837
2010-07-23 11:50:58koweysetstatus: needs-review -> followup-in-progress
2010-07-23 11:51:15koweysetassignedto: abuiles
2010-07-23 11:58:05ganeshsetnosy: + ganesh
messages: + msg11839
2010-07-28 13:47:46abuilessetfiles: + -resolve-issue-1599_-automatically-expire-unused-caches.dpatch, unnamed
messages: + msg11895
2010-07-28 13:49:42darcswatchsetdarcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-a21fdcf51e31ec24feb7ff9c98bf3be16c0649b0 -> http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-1092d9b218048d9df86a58a1dd51f87cec476246
2010-07-28 17:46:38koweysetmessages: + msg11898
2010-07-28 20:55:19abuilessetfiles: + resolve-issue-1599_-automatically-expire-unused-caches.dpatch, unnamed
messages: + msg11899
2010-07-28 21:00:05darcswatchsetdarcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-1092d9b218048d9df86a58a1dd51f87cec476246 -> http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-886b1eeb38de006ebd09934e0cc42cb01d4192ad
2010-08-01 18:33:11koweysetmessages: + msg11905
2010-08-01 18:37:41koweysetmessages: + msg11906
2010-08-03 05:50:38abuilessetmessages: + msg11921
2010-08-03 10:25:32koweysetmessages: + msg11923
2010-08-03 14:32:49abuilessetfiles: + resolve-issue-1599_-automatically-expire-unused-caches.dpatch, unnamed
messages: + msg11926
2010-08-03 14:34:28darcswatchsetdarcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-886b1eeb38de006ebd09934e0cc42cb01d4192ad -> http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-9d53d8c4757b9f6f451ddcb9cd5d1112aa3f73e6
2010-08-05 13:38:00koweysetstatus: followup-in-progress -> followup-requested
messages: + msg11975
2010-08-06 06:41:02abuilessetfiles: + resolve-issue-1599_-automatically-expire-unused-caches.dpatch, unnamed
messages: + msg11985
2010-08-06 06:43:50darcswatchsetdarcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-9d53d8c4757b9f6f451ddcb9cd5d1112aa3f73e6 -> http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-3cbfa82b84c0f9e7da37c676f5f7203bd88bf716
2010-08-06 08:08:48koweysetstatus: followup-requested -> needs-review
assignedto: abuiles -> galbolle
messages: + msg11986
nosy: + galbolle
2010-08-06 08:50:00koweysetassignedto: galbolle -> mornfall
messages: + msg11987
2010-08-06 21:13:42mornfallsetmessages: + msg12009
2010-08-07 00:32:03abuilessetmessages: + msg12010
2010-08-07 00:41:37abuilessetfiles: + resolve-issue-1599_-automatically-expire-unused-caches.dpatch, unnamed
messages: + msg12011
2010-08-07 00:43:46darcswatchsetdarcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-3cbfa82b84c0f9e7da37c676f5f7203bd88bf716 -> http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-90eebb0527b9056e3ab6a71c4e2df7e296fd8866
2010-08-07 13:42:16darcswatchsetstatus: needs-review -> accepted
messages: + msg12021
2011-05-10 17:35:41darcswatchsetmessages: + msg14055
2011-05-10 17:36:13darcswatchsetdarcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_.html#bundle-90eebb0527b9056e3ab6a71c4e2df7e296fd8866 -> http://darcswatch.nomeata.de/repo_http:__darcs.net_reviewed.html#bundle-a21fdcf51e31ec24feb7ff9c98bf3be16c0649b0
2011-05-10 18:05:52darcswatchsetdarcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_reviewed.html#bundle-a21fdcf51e31ec24feb7ff9c98bf3be16c0649b0 -> http://darcswatch.nomeata.de/repo_http:__darcs.net_reviewed.html#bundle-9d53d8c4757b9f6f451ddcb9cd5d1112aa3f73e6
2011-05-10 19:05:43darcswatchsetdarcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_reviewed.html#bundle-9d53d8c4757b9f6f451ddcb9cd5d1112aa3f73e6 -> http://darcswatch.nomeata.de/repo_http:__darcs.net_reviewed.html#bundle-886b1eeb38de006ebd09934e0cc42cb01d4192ad
2011-05-10 19:37:40darcswatchsetdarcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_reviewed.html#bundle-886b1eeb38de006ebd09934e0cc42cb01d4192ad -> http://darcswatch.nomeata.de/repo_http:__darcs.net_reviewed.html#bundle-3cbfa82b84c0f9e7da37c676f5f7203bd88bf716
2011-05-10 22:07:23darcswatchsetdarcswatchurl: http://darcswatch.nomeata.de/repo_http:__darcs.net_reviewed.html#bundle-3cbfa82b84c0f9e7da37c676f5f7203bd88bf716 -> http://darcswatch.nomeata.de/repo_http:__darcs.net_reviewed.html#bundle-1092d9b218048d9df86a58a1dd51f87cec476246