darcs

Issue 1536 break repo caches up into subdirectories

Title break repo caches up into subdirectories
Priority feature Status unknown
Milestone 3.0.0 Resolved in
Superseder break global cache up into subdirectories
View: 1624
Nosy List darcs-devel, dmitry.kurochkin, ganesh, kolibrie, kowey, mornfall, simonmar, thorkilnaur, tux_rocker
Assigned To
Topics Hashed, Performance

Created on 2009-08-18.08:36:25 by kowey, last changed 2024-08-11.10:13:39 by bfrk.

Messages
msg8228 (view) Author: kowey Date: 2009-08-18.08:36:19
Darcs caches can get very large, which can make looking up files in them slow.

This is really a hashed-storage feature request, so we should probably link to
its ticket, but I'm creating this task because it's particularly relevant to
darcs and I think we should be keeping track of it.
msg8597 (view) Author: simonmar Date: 2009-08-30.13:26:50
I have some figures for how much this may be affecting performance on Linux with
a recent ext3 (kernel 2.6.28).

For a GHC repository with 21000 files in _darcs/patches, I used a program that
opens and closes every file in the directory.

  - cold cache: 14.70s real   0.15s user   0.62s system 
  - warm cache:  0.26s real   0.09s user   0.14s system

(to flush the cache before running the test, I used "echo 3
>/proc/sys/vm/drop_caches")

After making 16 subdirectories 0/ 1/ ... e/ f/ and splitting the patches amongst
the subdirectories:

  - cold cache: 4.70s real   0.09s user   0.74s system
  - warm cache: 0.24s real   0.11s user   0.12s system 

Conclusion: with a warm cache, there's no difference - presumably Linux's name
lookup cache is big enough to hold all 21k lookups.  Without anything cached,
the subdirectory version is 3x faster (but the difference is all in real time,
not system time, which implies that this is due to reading less data from disk
rather than poor algorithms in the kernel's lookup code).

Program I used to measure this:

import System.IO
import Control.Monad
import System.Posix
import System.Environment
import qualified Data.ByteString.Char8 as B

main = do
  [file] <- getArgs
  ls <- B.split '\n' `fmap` B.readFile file
  forM_ ls $ \file -> do
    let str = B.unpack file
    when (not (null str)) $ do
      fd <- openFd str ReadOnly Nothing defaultFileFlags
      closeFd fd
msg8856 (view) Author: kowey Date: 2009-09-21.15:37:02
I'm splitting off issue1624 into a separate ticket.  This requires a format
change, so it may be best to lump it in together with other things like packs.
msg11494 (view) Author: tux_rocker Date: 2010-06-20.13:47:06
I'm going to bump this to 2.6 because no-one is working on it right now
and it requires a format change.
msg24073 (view) Author: bfrk Date: 2024-08-10.17:39:52
I am re-opening this issue because it turns out that this is not just a 
matter of performance. On simple file systems like FAT32 it leads to 
failures with large enough repositories (like that of darcs itself). The 
error message (on Linux) is, unfortunately, a misleading/unhelpful "no 
space left on device" (ENOSPC). Patch file names are 76 bytes long, so 
according to https://superuser.com/questions/446282/max-files-per-
directory-on-ntfs-vol-vs-fat32/1544848#1544848, on FAT32 the maximum is 
(2^16 - 3) * ceil (76 / 13) ~= 11000 patch files.

The practical relevance is that FAT32 is often used for temporary storage 
e.g. on a USB memory stick.

This should be fixed in darcs-3 by using bucketed hashed dirs for 
repositories, too.
msg24074 (view) Author: bfrk Date: 2024-08-11.10:13:39
When we fix this we should also drop the (unused) size prefixes and 
change the file names so as not to repeat the two hex digits in the 
bucket. Like in

 (bucketdir,filename) = splitAt 2 (to_file_path hash)
History
Date User Action Args
2009-08-18 08:36:25koweycreate
2009-08-25 18:15:31adminsetnosy: + darcs-devel, - simon
2009-08-27 14:25:58adminsetnosy: kowey, darcs-devel, thorkilnaur, kolibrie, dmitry.kurochkin, mornfall
2009-08-30 13:26:53simonmarsetnosy: + simonmar
messages: + msg8597
2009-09-14 10:57:37koweysettopic: + Target-2.4
nosy: kowey, darcs-devel, simonmar, thorkilnaur, kolibrie, dmitry.kurochkin, mornfall
2009-09-21 15:37:11koweysetstatus: needs-implementation -> waiting-for
title: break caches up into subdirectories -> break repo caches up into subdirectories
nosy: kowey, darcs-devel, simonmar, thorkilnaur, kolibrie, dmitry.kurochkin, mornfall
messages: + msg8856
topic: + Target-2.5, - Target-2.4
superseder: + break global cache up into subdirectories
2009-10-23 22:36:51adminsetnosy: + marlowsd, - simonmar
2009-10-23 23:35:21adminsetnosy: + simonmar, - marlowsd
2010-03-25 14:21:39koweysettopic: + Hashed
2010-06-15 20:52:01adminsetmilestone: 2.5.0
2010-06-15 20:59:28adminsettopic: - Target-2.5
2010-06-20 13:47:07tux_rockersetnosy: + tux_rocker
messages: + msg11494
milestone: 2.5.0 -> 2.8.0
2014-06-19 23:24:24ghsetnosy: + ganesh
2017-07-31 01:28:43ghsetstatus: waiting-for -> given-up
2024-08-10 17:39:53bfrksetstatus: given-up -> unknown
messages: + msg24073
milestone: 2.8.0 -> 3.0.0
2024-08-11 10:13:39bfrksetmessages: + msg24074