darcs

Issue 152 use FreeArc instead of gzip?

Title use FreeArc instead of gzip?
Priority wishlist Status given-up
Milestone Resolved in
Superseder packed storage combining many small files into fewer larger ones
View: 1535
Nosy List darcs-devel, dmitry.kurochkin, jch, kowey, markstos, rs, thorkilnaur, tommy, zooko
Assigned To
Topics Performance, Provisional

Created on 2006-03-27.10:34:15 by rs, last changed 2017-07-30.23:07:01 by gh.

Messages
msg584 (view) Author: rs Date: 2006-03-27.10:34:14
Why darcs use gzip for compress patches?
BZip2 is better than gzip, so why You don't use them?
msg585 (view) Author: zooko Date: 2006-03-27.11:04:19
For my repository here, it isn't obviously a significant win, although curiously
I was not able to recompress the patches with gzip and get them back to their
original gzipped size.  Pay attention to the times as well as to the sizes.  Try
this on your own repository and tell us what the results are (and please report
it anyway, even if you get disappointing or uninteresting results).

HACL yumyum:~/work$ cd experiment/_darcs/patches/
HACL yumyum:~/work/experiment/_darcs/patches$ ls  | wc -l
814
HACL yumyum:~/work/experiment/_darcs/patches$ du -sk .
7555    .
HACL yumyum:~/work/experiment/_darcs/patches$ time gunzip *.gz

real    0m1.838s
user    0m0.168s
sys     0m0.140s
HACL yumyum:~/work/experiment/_darcs/patches$ du -sk .
19297   .
HACL yumyum:~/work/experiment/_darcs/patches$ ls  | wc -l
814
HACL yumyum:~/work/experiment/_darcs/patches$ time bzip2 200*

real    0m4.920s
user    0m4.356s
sys     0m0.248s
HACL yumyum:~/work/experiment/_darcs/patches$ du -sk .
7041    .
HACL yumyum:~/work/experiment/_darcs/patches$ ls  | wc -l
814
HACL yumyum:~/work/experiment/_darcs/patches$ time bunzip2 *.bz2

real    0m1.474s
user    0m1.192s
sys     0m0.220s
HACL yumyum:~/work/experiment/_darcs/patches$ du -sk .
19297   .
HACL yumyum:~/work/experiment/_darcs/patches$ ls  | wc -l
814
HACL yumyum:~/work/experiment/_darcs/patches$ time gzip 200*

real    0m1.332s
user    0m1.084s
sys     0m0.172s
HACL yumyum:~/work/experiment/_darcs/patches$ du -sk .
7563    .
HACL yumyum:~/work/experiment/_darcs/patches$ ls  | wc -l
814
HACL yumyum:~/work/experiment/_darcs/patches$ time gunzip *.gz

real    0m0.405s
user    0m0.188s
sys     0m0.164s
HACL yumyum:~/work/experiment/_darcs/patches$ time gzip -9 200*

real    0m3.798s
user    0m2.376s
sys     0m0.108s
HACL yumyum:~/work/experiment/_darcs/patches$ du -sk .
7531    .
HACL yumyum:~/work/experiment/_darcs/patches$ time gunzip *.gz

real    0m0.375s
user    0m0.192s
sys     0m0.140s
HACL yumyum:~/work/experiment/_darcs/patches$ time gzip -6 200*

real    0m1.327s
user    0m1.116s
sys     0m0.144s
HACL yumyum:~/work/experiment/_darcs/patches$ du -sk .
7563    .
HACL yumyum:~/work/experiment/_darcs/patches$ time gunzip *.gz

real    0m0.942s
user    0m0.192s
sys     0m0.132s
HACL yumyum:~/work/experiment/_darcs/patches$ time gzip -7 200*

real    0m1.742s
user    0m1.252s
sys     0m0.136s
HACL yumyum:~/work/experiment/_darcs/patches$ du -sk .
7547    .
msg586 (view) Author: zooko Date: 2006-03-27.11:06:25
Oh, I see why it makes such little difference -- most of the patches are small,
so bzip2 (like gzip) doesn't have enough information to compress them much.
msg587 (view) Author: droundy Date: 2006-03-27.13:16:52
On Mon, Mar 27, 2006 at 11:04:22AM +0000, Zooko wrote:
> For my repository here, it isn't obviously a significant win, although
> curiously I was not able to recompress the patches with gzip and get them
> back to their original gzipped size.  Pay attention to the times as well
> as to the sizes.  Try this on your own repository and tell us what the
> results are (and please report it anyway, even if you get disappointing
> or uninteresting results).

If we conclude that bzip2 is at least sometimes sufficiently nicer to be
worth using, we would at a minimum need to leave it as an option, so that
older darcs could read one's repository.  It'd probably always remain a
non-default option, since I'd rather not impose the inconvenience of
forcing anyone compiling darcs to install both libbzip2 and zlib.
-- 
David Roundy
http://www.darcs.net
msg589 (view) Author: rs Date: 2006-03-28.18:24:59
I am just start using of darcs, so my repository has only two patches - first is
initial import and second is my work for one day.
Here are result:

========== initial
total 356
359268 Mar 27 14:15 20060327101508-1b432-a47a6bff379f4a7d697e1dd678fc4b215d350d14.gz
  2488 Mar 28 16:04 20060327213744-65c63-d5b3b26df1f6bba4f3e8cead75f9b28734fd26cb.gz
========== unpacked

real	0m0.100s
user	0m0.077s
sys	0m0.015s
total 1868
1896946 Mar 27 14:15 20060327101508-1b432-a47a6bff379f4a7d697e1dd678fc4b215d350d14
  10081 Mar 28 16:04 20060327213744-65c63-d5b3b26df1f6bba4f3e8cead75f9b28734fd26cb
========== packed bzip2

real	0m0.637s
user	0m0.624s
sys	0m0.031s
total 320
322471 Mar 27 14:15
20060327101508-1b432-a47a6bff379f4a7d697e1dd678fc4b215d350d14.bz2
  2641 Mar 28 16:04
20060327213744-65c63-d5b3b26df1f6bba4f3e8cead75f9b28734fd26cb.bz2
========== packed bzip2 -9

real	0m0.199s
user	0m0.186s
sys	0m0.046s

real	0m0.617s
user	0m0.608s
sys	0m0.046s
total 320
322471 Mar 27 14:15
20060327101508-1b432-a47a6bff379f4a7d697e1dd678fc4b215d350d14.bz2
  2641 Mar 28 16:04
20060327213744-65c63-d5b3b26df1f6bba4f3e8cead75f9b28734fd26cb.bz2
========== packed gzip

real	0m0.153s
user	0m0.155s
sys	0m0.030s
total 360
362098 Mar 27 14:15 20060327101508-1b432-a47a6bff379f4a7d697e1dd678fc4b215d350d14.gz
  2550 Mar 28 16:04 20060327213744-65c63-d5b3b26df1f6bba4f3e8cead75f9b28734fd26cb.gz
========== packed gzip -9

real	0m0.346s
user	0m0.343s
sys	0m0.047s
total 356
356772 Mar 27 14:15 20060327101508-1b432-a47a6bff379f4a7d697e1dd678fc4b215d350d14.gz
  2552 Mar 28 16:04 20060327213744-65c63-d5b3b26df1f6bba4f3e8cead75f9b28734fd26cb.gz
========== packed gzip -6

real	0m0.147s
user	0m0.155s
sys	0m0.015s
total 360
362098 Mar 27 14:15 20060327101508-1b432-a47a6bff379f4a7d697e1dd678fc4b215d350d14.gz
  2550 Mar 28 16:04 20060327213744-65c63-d5b3b26df1f6bba4f3e8cead75f9b28734fd26cb.gz
========== packed gzip -7

real	0m0.173s
user	0m0.186s
sys	0m0.015s
total 356
359642 Mar 27 14:15 20060327101508-1b432-a47a6bff379f4a7d697e1dd678fc4b215d350d14.gz
  2552 Mar 28 16:04 20060327213744-65c63-d5b3b26df1f6bba4f3e8cead75f9b28734fd26cb.gz




And here are the testing script 

echo "========== initial"
ls -l

echo "========== unpacked"
time gunzip *.gz
ls -l

echo "========== packed bzip2"
time bzip2 200*
ls -l

echo "========== packed bzip2 -9"
time bunzip2 *.bz2
time bzip2 -9 200*
ls -l

echo "========== packed gzip"
bunzip2 *.bz2
time gzip 200*
ls -l

echo "========== packed gzip -9"
gunzip *.gz
time gzip -9 200*
ls -l

echo "========== packed gzip -6"
gunzip *.gz
time gzip -6 200*
ls -l

echo "========== packed gzip -7"
gunzip *.gz
time gzip -7 200*
ls -l
msg590 (view) Author: rs Date: 2006-03-28.18:32:53
As You see i have only about two times slow and about 10% of space reducing.
But you are right, the space saved only by big patches. It is funny but at small
patches we can loose space (2488 bytes for gzip and 2641 bytes for bzip2).

So the gzip seems not worst than bzip2, and may be for darcs patches (which is
small files) it can be better than bzip2.
msg600 (view) Author: jch Date: 2006-04-07.14:42:05
Thanks for the statistics, gentlemen.

So it looks like bzip2 might provide some minor savings, but they are too small
to offset the inconvenience of having interoperability problems between
Darcs versions compiled with and without bzip2 support.

If you disagree with that, please feel free to reopen this bug.
msg601 (view) Author: zooko Date: 2006-04-07.15:53:37
One last detail I would be curious about is the difference between gzip and
bzip2 on binary files.  My guess is that it would be still only marginally
better, and certainly not better than 50% savings.  But it would be nice to
check against somebody's real repository that included binary files.
msg801 (view) Author: markstos Date: 2006-07-09.01:06:46
bzip2 is *really* slow compared to other programs that give much better
compression, such as rzip. 7zip is as slow a bzip2 to compress (if not slower),
but as fast as rzip to uncompress - and is generally a better compressor -
although rzip seems to have the advantage on large text files.

I see that this is wont-fix, but I wanted to point this out in case the question
comes up again... bzip2 is totally and utterly obsoleted.

http://article.gmane.org/gmane.linux.ubuntu.devel/13670
msg2149 (view) Author: zooko Date: 2007-10-03.20:59:34
Here is size in KiB of the darcs repository for the allmydata.org tahoe project,
in various kinds of compression:

standard (after running "darcs optimize --compress" just to be sure)
22960   tahoe-comp

same thing tarred up (look at how tar is more efficient at packing in files than
the Mac OS X filesystem is.  And/or how the Mac OS X version of du reports space...)
18972   tahoe-comp.tar

uncompressed ("darcs optimize --uncompress"), and same thing tarred up:
31788   tahoe-uncomp
28392   tahoe-uncomp.tar

bzip2 compressed (by "cd _darcs/patches && bzip2 -9 *") and tarred up:
20948   tahoe-bzipcomp
17032   tahoe-bzipcomp.tar

rzip compressed (by "cd _darcs/patches && rzip -9 *") and tarred up:
20212   tahoe-rzipcomp
16400   tahoe-rzipcomp.tar

The tarball of the uncompressed darcs repo (from above), compressed with gzip,
bzip2 and rzip:
13508   tahoe-uncomp.tar.gz
12316   tahoe-uncomp.tar.bz2
7052    tahoe-uncomp.tar.rz

One really interesting thing to note is that the tarball of the
rzip-compressed-patches version of the repo is a shocking 2.3 times as big as
the rzip-compressed tarball of the uncompressed-patches version of the repo!

In fact, that's so shocking that I'm going to double-check the contents of those
two tarballs...

Yep, it looks legit.

So it might be worth switching from gzip to rzip in darcs.  Another take-home
message is that if you really want to pack your darcs repository into a small
space, uncompress the patches, tar it up, and then rzip it.

(By the way if you are willing to limit yourself to Linux only then you could
Con Kolivas's lrzip, which is a front-end to rzip that further improves its
compression ratio.)
msg2150 (view) Author: zooko Date: 2007-10-03.20:59:58
Anyone interested in tackling this?
msg2151 (view) Author: zooko Date: 2007-10-03.21:06:18
Hm...  Another "take-home" message of this is that if you are not concerned
about disk space of an untarred repo, and you are concerned about access speed
of an untarred repo, and you are concerned about being able to conveniently pack
up a tarred, compressed, repo, then you ought to run with patch compression
turned off.
msg2152 (view) Author: droundy Date: 2007-10-03.21:11:54
It appears that there is no librzip, so I guess this would either mean working
with an external executable (which is fragile if that executable doesn't happen
to be present) or writing our own library binding from the C source?

David
msg2612 (view) Author: zooko Date: 2008-01-19.21:26:55
See also a newer experiment using 7z instead of rzip:

http://lists.osuosl.org/pipermail/darcs-devel/2008-January/006838.html

In the original experiment (below), rzip showed a 13% compression improvement
over gzip.  In the new experiment (linked), 7z showed a 14% compression
improvement over gzip.  (In both cases, we are talking about compression of
individual patches, which is much less efficient in space than compression of
multiple patches together -- see the linked thread for more detail.)

However, in the experiment, below, I suggest that it might be worth switching
from gzip to rzip, and in the linked discussion, I suggest that it probably
isn't worth switching from gzip to 7z!  (Unless we can compress multiple patches
together at once.)  Why did I consider the space gains from 7z to be
insufficient motivation, when I earlier considered the lesser gains from rzip to
be motivating?  Well, perhaps I've just gotten more demanding of my compression
schemes...
msg3305 (view) Author: markstos Date: 2008-02-11.01:10:47
I nominate this for "wont-fix" citing the estimated space savings not being
worth the inconvenience of an incompatible repo change.
msg3309 (view) Author: zooko Date: 2008-02-11.02:13:24
Is there a category for "a future research experiment"?  ;-)
msg3328 (view) Author: droundy Date: 2008-02-11.17:04:46
On Mon, Feb 11, 2008 at 02:13:26AM -0000, Zooko wrote:
> Is there a category for "a future research experiment"?  ;-)

Perhaps there should be.
-- 
David Roundy
Department of Physics
Oregon State University
msg3691 (view) Author: zooko Date: 2008-02-28.17:14:30
Oh, there's another compressor which has even better ratios than 7zip (although
I think it uses more CPU), and it is partially written in Haskell!  It is named
"FreeArc".

Someone should measure its effect as part of this research project.  :-)
History
Date User Action Args
2006-03-27 10:34:15rscreate
2006-03-27 11:04:22zookosetstatus: unread -> unknown
nosy: + zooko
messages: + msg585
2006-03-27 11:06:28zookosetnosy: droundy, tommy, zooko, rs
messages: + msg586
2006-03-27 13:16:54droundysetnosy: droundy, tommy, zooko, rs
messages: + msg587
2006-03-28 18:25:00rssetnosy: droundy, tommy, zooko, rs
messages: + msg589
2006-03-28 18:32:55rssetnosy: droundy, tommy, zooko, rs
messages: + msg590
2006-04-07 14:42:07jchsetstatus: unknown -> wont-fix
nosy: + jch
messages: + msg600
2006-04-07 15:53:40zookosetnosy: droundy, jch, tommy, zooko, rs
messages: + msg601
2006-07-09 01:06:49systemsetnosy: + system
messages: + msg801
2007-10-03 20:59:37zookosetnosy: + kowey, beschmi
messages: + msg2149
title: use bzip2 instead of gzip -> use rzip instead of gzip?
2007-10-03 20:59:59zookosetstatus: wont-fix -> unknown
nosy: system, zooko, kowey, droundy, jch, tommy, rs, beschmi
messages: + msg2150
2007-10-03 21:06:19zookosetmessages: + msg2151
2007-10-03 21:11:56droundysetmessages: + msg2152
2008-01-19 21:26:56zookosetnosy: droundy, jch, tommy, beschmi, kowey, zooko, rs
messages: + msg2612
title: use rzip instead of gzip? -> use rzip or 7z instead of gzip? bundle patches together for compression?
2008-02-11 01:10:48markstossetstatus: unknown -> deferred
nosy: + markstos
messages: + msg3305
2008-02-11 02:13:26zookosetnosy: droundy, jch, tommy, beschmi, kowey, markstos, zooko, rs
messages: + msg3309
2008-02-11 17:04:47droundysetnosy: droundy, jch, tommy, beschmi, kowey, markstos, zooko, rs
messages: + msg3328
title: use rzip or 7z instead of gzip? bundle patches together for compression? -> use rzip or 7z instead of gzip? bundle patches together for compression?
2008-02-28 17:14:32zookosetnosy: droundy, jch, tommy, beschmi, kowey, markstos, zooko, rs
messages: + msg3691
2008-02-28 17:14:46zookosetnosy: droundy, jch, tommy, beschmi, kowey, markstos, zooko, rs
title: use rzip or 7z instead of gzip? bundle patches together for compression? -> use FreeArc instead of gzip? bundle patches together for compression?
2009-08-06 17:36:56adminsetnosy: + jast, Serware, dmitry.kurochkin, darcs-devel, dagit, mornfall, simon, thorkilnaur, - droundy, jch, rs
2009-08-06 20:33:57adminsetnosy: - beschmi
2009-08-10 21:44:42adminsetnosy: + rs, jch, - darcs-devel, jast, dagit, Serware, mornfall
2009-08-25 17:50:52adminsetnosy: + darcs-devel, - simon
2009-08-27 13:56:04adminsetnosy: jch, tommy, kowey, markstos, darcs-devel, zooko, rs, thorkilnaur, dmitry.kurochkin
2009-09-02 14:33:53koweysetstatus: deferred -> waiting-for
nosy: jch, tommy, kowey, markstos, darcs-devel, zooko, rs, thorkilnaur, dmitry.kurochkin
topic: + Performance, Provisional
superseder: + packed storage combining many small files into fewer larger ones
title: use FreeArc instead of gzip? bundle patches together for compression? -> use FreeArc instead of gzip?
2017-07-30 23:07:01ghsetstatus: waiting-for -> given-up