Created on 2006-03-27.10:34:15 by rs, last changed 2017-07-30.23:07:01 by gh.
msg584 (view) |
Author: rs |
Date: 2006-03-27.10:34:14 |
|
Why darcs use gzip for compress patches?
BZip2 is better than gzip, so why You don't use them?
|
msg585 (view) |
Author: zooko |
Date: 2006-03-27.11:04:19 |
|
For my repository here, it isn't obviously a significant win, although curiously
I was not able to recompress the patches with gzip and get them back to their
original gzipped size. Pay attention to the times as well as to the sizes. Try
this on your own repository and tell us what the results are (and please report
it anyway, even if you get disappointing or uninteresting results).
HACL yumyum:~/work$ cd experiment/_darcs/patches/
HACL yumyum:~/work/experiment/_darcs/patches$ ls | wc -l
814
HACL yumyum:~/work/experiment/_darcs/patches$ du -sk .
7555 .
HACL yumyum:~/work/experiment/_darcs/patches$ time gunzip *.gz
real 0m1.838s
user 0m0.168s
sys 0m0.140s
HACL yumyum:~/work/experiment/_darcs/patches$ du -sk .
19297 .
HACL yumyum:~/work/experiment/_darcs/patches$ ls | wc -l
814
HACL yumyum:~/work/experiment/_darcs/patches$ time bzip2 200*
real 0m4.920s
user 0m4.356s
sys 0m0.248s
HACL yumyum:~/work/experiment/_darcs/patches$ du -sk .
7041 .
HACL yumyum:~/work/experiment/_darcs/patches$ ls | wc -l
814
HACL yumyum:~/work/experiment/_darcs/patches$ time bunzip2 *.bz2
real 0m1.474s
user 0m1.192s
sys 0m0.220s
HACL yumyum:~/work/experiment/_darcs/patches$ du -sk .
19297 .
HACL yumyum:~/work/experiment/_darcs/patches$ ls | wc -l
814
HACL yumyum:~/work/experiment/_darcs/patches$ time gzip 200*
real 0m1.332s
user 0m1.084s
sys 0m0.172s
HACL yumyum:~/work/experiment/_darcs/patches$ du -sk .
7563 .
HACL yumyum:~/work/experiment/_darcs/patches$ ls | wc -l
814
HACL yumyum:~/work/experiment/_darcs/patches$ time gunzip *.gz
real 0m0.405s
user 0m0.188s
sys 0m0.164s
HACL yumyum:~/work/experiment/_darcs/patches$ time gzip -9 200*
real 0m3.798s
user 0m2.376s
sys 0m0.108s
HACL yumyum:~/work/experiment/_darcs/patches$ du -sk .
7531 .
HACL yumyum:~/work/experiment/_darcs/patches$ time gunzip *.gz
real 0m0.375s
user 0m0.192s
sys 0m0.140s
HACL yumyum:~/work/experiment/_darcs/patches$ time gzip -6 200*
real 0m1.327s
user 0m1.116s
sys 0m0.144s
HACL yumyum:~/work/experiment/_darcs/patches$ du -sk .
7563 .
HACL yumyum:~/work/experiment/_darcs/patches$ time gunzip *.gz
real 0m0.942s
user 0m0.192s
sys 0m0.132s
HACL yumyum:~/work/experiment/_darcs/patches$ time gzip -7 200*
real 0m1.742s
user 0m1.252s
sys 0m0.136s
HACL yumyum:~/work/experiment/_darcs/patches$ du -sk .
7547 .
|
msg586 (view) |
Author: zooko |
Date: 2006-03-27.11:06:25 |
|
Oh, I see why it makes such little difference -- most of the patches are small,
so bzip2 (like gzip) doesn't have enough information to compress them much.
|
msg587 (view) |
Author: droundy |
Date: 2006-03-27.13:16:52 |
|
On Mon, Mar 27, 2006 at 11:04:22AM +0000, Zooko wrote:
> For my repository here, it isn't obviously a significant win, although
> curiously I was not able to recompress the patches with gzip and get them
> back to their original gzipped size. Pay attention to the times as well
> as to the sizes. Try this on your own repository and tell us what the
> results are (and please report it anyway, even if you get disappointing
> or uninteresting results).
If we conclude that bzip2 is at least sometimes sufficiently nicer to be
worth using, we would at a minimum need to leave it as an option, so that
older darcs could read one's repository. It'd probably always remain a
non-default option, since I'd rather not impose the inconvenience of
forcing anyone compiling darcs to install both libbzip2 and zlib.
--
David Roundy
http://www.darcs.net
|
msg589 (view) |
Author: rs |
Date: 2006-03-28.18:24:59 |
|
I am just start using of darcs, so my repository has only two patches - first is
initial import and second is my work for one day.
Here are result:
========== initial
total 356
359268 Mar 27 14:15 20060327101508-1b432-a47a6bff379f4a7d697e1dd678fc4b215d350d14.gz
2488 Mar 28 16:04 20060327213744-65c63-d5b3b26df1f6bba4f3e8cead75f9b28734fd26cb.gz
========== unpacked
real 0m0.100s
user 0m0.077s
sys 0m0.015s
total 1868
1896946 Mar 27 14:15 20060327101508-1b432-a47a6bff379f4a7d697e1dd678fc4b215d350d14
10081 Mar 28 16:04 20060327213744-65c63-d5b3b26df1f6bba4f3e8cead75f9b28734fd26cb
========== packed bzip2
real 0m0.637s
user 0m0.624s
sys 0m0.031s
total 320
322471 Mar 27 14:15
20060327101508-1b432-a47a6bff379f4a7d697e1dd678fc4b215d350d14.bz2
2641 Mar 28 16:04
20060327213744-65c63-d5b3b26df1f6bba4f3e8cead75f9b28734fd26cb.bz2
========== packed bzip2 -9
real 0m0.199s
user 0m0.186s
sys 0m0.046s
real 0m0.617s
user 0m0.608s
sys 0m0.046s
total 320
322471 Mar 27 14:15
20060327101508-1b432-a47a6bff379f4a7d697e1dd678fc4b215d350d14.bz2
2641 Mar 28 16:04
20060327213744-65c63-d5b3b26df1f6bba4f3e8cead75f9b28734fd26cb.bz2
========== packed gzip
real 0m0.153s
user 0m0.155s
sys 0m0.030s
total 360
362098 Mar 27 14:15 20060327101508-1b432-a47a6bff379f4a7d697e1dd678fc4b215d350d14.gz
2550 Mar 28 16:04 20060327213744-65c63-d5b3b26df1f6bba4f3e8cead75f9b28734fd26cb.gz
========== packed gzip -9
real 0m0.346s
user 0m0.343s
sys 0m0.047s
total 356
356772 Mar 27 14:15 20060327101508-1b432-a47a6bff379f4a7d697e1dd678fc4b215d350d14.gz
2552 Mar 28 16:04 20060327213744-65c63-d5b3b26df1f6bba4f3e8cead75f9b28734fd26cb.gz
========== packed gzip -6
real 0m0.147s
user 0m0.155s
sys 0m0.015s
total 360
362098 Mar 27 14:15 20060327101508-1b432-a47a6bff379f4a7d697e1dd678fc4b215d350d14.gz
2550 Mar 28 16:04 20060327213744-65c63-d5b3b26df1f6bba4f3e8cead75f9b28734fd26cb.gz
========== packed gzip -7
real 0m0.173s
user 0m0.186s
sys 0m0.015s
total 356
359642 Mar 27 14:15 20060327101508-1b432-a47a6bff379f4a7d697e1dd678fc4b215d350d14.gz
2552 Mar 28 16:04 20060327213744-65c63-d5b3b26df1f6bba4f3e8cead75f9b28734fd26cb.gz
And here are the testing script
echo "========== initial"
ls -l
echo "========== unpacked"
time gunzip *.gz
ls -l
echo "========== packed bzip2"
time bzip2 200*
ls -l
echo "========== packed bzip2 -9"
time bunzip2 *.bz2
time bzip2 -9 200*
ls -l
echo "========== packed gzip"
bunzip2 *.bz2
time gzip 200*
ls -l
echo "========== packed gzip -9"
gunzip *.gz
time gzip -9 200*
ls -l
echo "========== packed gzip -6"
gunzip *.gz
time gzip -6 200*
ls -l
echo "========== packed gzip -7"
gunzip *.gz
time gzip -7 200*
ls -l
|
msg590 (view) |
Author: rs |
Date: 2006-03-28.18:32:53 |
|
As You see i have only about two times slow and about 10% of space reducing.
But you are right, the space saved only by big patches. It is funny but at small
patches we can loose space (2488 bytes for gzip and 2641 bytes for bzip2).
So the gzip seems not worst than bzip2, and may be for darcs patches (which is
small files) it can be better than bzip2.
|
msg600 (view) |
Author: jch |
Date: 2006-04-07.14:42:05 |
|
Thanks for the statistics, gentlemen.
So it looks like bzip2 might provide some minor savings, but they are too small
to offset the inconvenience of having interoperability problems between
Darcs versions compiled with and without bzip2 support.
If you disagree with that, please feel free to reopen this bug.
|
msg601 (view) |
Author: zooko |
Date: 2006-04-07.15:53:37 |
|
One last detail I would be curious about is the difference between gzip and
bzip2 on binary files. My guess is that it would be still only marginally
better, and certainly not better than 50% savings. But it would be nice to
check against somebody's real repository that included binary files.
|
msg801 (view) |
Author: markstos |
Date: 2006-07-09.01:06:46 |
|
bzip2 is *really* slow compared to other programs that give much better
compression, such as rzip. 7zip is as slow a bzip2 to compress (if not slower),
but as fast as rzip to uncompress - and is generally a better compressor -
although rzip seems to have the advantage on large text files.
I see that this is wont-fix, but I wanted to point this out in case the question
comes up again... bzip2 is totally and utterly obsoleted.
http://article.gmane.org/gmane.linux.ubuntu.devel/13670
|
msg2149 (view) |
Author: zooko |
Date: 2007-10-03.20:59:34 |
|
Here is size in KiB of the darcs repository for the allmydata.org tahoe project,
in various kinds of compression:
standard (after running "darcs optimize --compress" just to be sure)
22960 tahoe-comp
same thing tarred up (look at how tar is more efficient at packing in files than
the Mac OS X filesystem is. And/or how the Mac OS X version of du reports space...)
18972 tahoe-comp.tar
uncompressed ("darcs optimize --uncompress"), and same thing tarred up:
31788 tahoe-uncomp
28392 tahoe-uncomp.tar
bzip2 compressed (by "cd _darcs/patches && bzip2 -9 *") and tarred up:
20948 tahoe-bzipcomp
17032 tahoe-bzipcomp.tar
rzip compressed (by "cd _darcs/patches && rzip -9 *") and tarred up:
20212 tahoe-rzipcomp
16400 tahoe-rzipcomp.tar
The tarball of the uncompressed darcs repo (from above), compressed with gzip,
bzip2 and rzip:
13508 tahoe-uncomp.tar.gz
12316 tahoe-uncomp.tar.bz2
7052 tahoe-uncomp.tar.rz
One really interesting thing to note is that the tarball of the
rzip-compressed-patches version of the repo is a shocking 2.3 times as big as
the rzip-compressed tarball of the uncompressed-patches version of the repo!
In fact, that's so shocking that I'm going to double-check the contents of those
two tarballs...
Yep, it looks legit.
So it might be worth switching from gzip to rzip in darcs. Another take-home
message is that if you really want to pack your darcs repository into a small
space, uncompress the patches, tar it up, and then rzip it.
(By the way if you are willing to limit yourself to Linux only then you could
Con Kolivas's lrzip, which is a front-end to rzip that further improves its
compression ratio.)
|
msg2150 (view) |
Author: zooko |
Date: 2007-10-03.20:59:58 |
|
Anyone interested in tackling this?
|
msg2151 (view) |
Author: zooko |
Date: 2007-10-03.21:06:18 |
|
Hm... Another "take-home" message of this is that if you are not concerned
about disk space of an untarred repo, and you are concerned about access speed
of an untarred repo, and you are concerned about being able to conveniently pack
up a tarred, compressed, repo, then you ought to run with patch compression
turned off.
|
msg2152 (view) |
Author: droundy |
Date: 2007-10-03.21:11:54 |
|
It appears that there is no librzip, so I guess this would either mean working
with an external executable (which is fragile if that executable doesn't happen
to be present) or writing our own library binding from the C source?
David
|
msg2612 (view) |
Author: zooko |
Date: 2008-01-19.21:26:55 |
|
See also a newer experiment using 7z instead of rzip:
http://lists.osuosl.org/pipermail/darcs-devel/2008-January/006838.html
In the original experiment (below), rzip showed a 13% compression improvement
over gzip. In the new experiment (linked), 7z showed a 14% compression
improvement over gzip. (In both cases, we are talking about compression of
individual patches, which is much less efficient in space than compression of
multiple patches together -- see the linked thread for more detail.)
However, in the experiment, below, I suggest that it might be worth switching
from gzip to rzip, and in the linked discussion, I suggest that it probably
isn't worth switching from gzip to 7z! (Unless we can compress multiple patches
together at once.) Why did I consider the space gains from 7z to be
insufficient motivation, when I earlier considered the lesser gains from rzip to
be motivating? Well, perhaps I've just gotten more demanding of my compression
schemes...
|
msg3305 (view) |
Author: markstos |
Date: 2008-02-11.01:10:47 |
|
I nominate this for "wont-fix" citing the estimated space savings not being
worth the inconvenience of an incompatible repo change.
|
msg3309 (view) |
Author: zooko |
Date: 2008-02-11.02:13:24 |
|
Is there a category for "a future research experiment"? ;-)
|
msg3328 (view) |
Author: droundy |
Date: 2008-02-11.17:04:46 |
|
On Mon, Feb 11, 2008 at 02:13:26AM -0000, Zooko wrote:
> Is there a category for "a future research experiment"? ;-)
Perhaps there should be.
--
David Roundy
Department of Physics
Oregon State University
|
msg3691 (view) |
Author: zooko |
Date: 2008-02-28.17:14:30 |
|
Oh, there's another compressor which has even better ratios than 7zip (although
I think it uses more CPU), and it is partially written in Haskell! It is named
"FreeArc".
Someone should measure its effect as part of this research project. :-)
|
|
Date |
User |
Action |
Args |
2006-03-27 10:34:15 | rs | create | |
2006-03-27 11:04:22 | zooko | set | status: unread -> unknown nosy:
+ zooko messages:
+ msg585 |
2006-03-27 11:06:28 | zooko | set | nosy:
droundy, tommy, zooko, rs messages:
+ msg586 |
2006-03-27 13:16:54 | droundy | set | nosy:
droundy, tommy, zooko, rs messages:
+ msg587 |
2006-03-28 18:25:00 | rs | set | nosy:
droundy, tommy, zooko, rs messages:
+ msg589 |
2006-03-28 18:32:55 | rs | set | nosy:
droundy, tommy, zooko, rs messages:
+ msg590 |
2006-04-07 14:42:07 | jch | set | status: unknown -> wont-fix nosy:
+ jch messages:
+ msg600 |
2006-04-07 15:53:40 | zooko | set | nosy:
droundy, jch, tommy, zooko, rs messages:
+ msg601 |
2006-07-09 01:06:49 | system | set | nosy:
+ system messages:
+ msg801 |
2007-10-03 20:59:37 | zooko | set | nosy:
+ kowey, beschmi messages:
+ msg2149 title: use bzip2 instead of gzip -> use rzip instead of gzip? |
2007-10-03 20:59:59 | zooko | set | status: wont-fix -> unknown nosy:
system, zooko, kowey, droundy, jch, tommy, rs, beschmi messages:
+ msg2150 |
2007-10-03 21:06:19 | zooko | set | messages:
+ msg2151 |
2007-10-03 21:11:56 | droundy | set | messages:
+ msg2152 |
2008-01-19 21:26:56 | zooko | set | nosy:
droundy, jch, tommy, beschmi, kowey, zooko, rs messages:
+ msg2612 title: use rzip instead of gzip? -> use rzip or 7z instead of gzip? bundle patches together for compression? |
2008-02-11 01:10:48 | markstos | set | status: unknown -> deferred nosy:
+ markstos messages:
+ msg3305 |
2008-02-11 02:13:26 | zooko | set | nosy:
droundy, jch, tommy, beschmi, kowey, markstos, zooko, rs messages:
+ msg3309 |
2008-02-11 17:04:47 | droundy | set | nosy:
droundy, jch, tommy, beschmi, kowey, markstos, zooko, rs messages:
+ msg3328 title: use rzip or 7z instead of gzip? bundle patches together for compression? -> use rzip or 7z instead of gzip? bundle patches together for compression? |
2008-02-28 17:14:32 | zooko | set | nosy:
droundy, jch, tommy, beschmi, kowey, markstos, zooko, rs messages:
+ msg3691 |
2008-02-28 17:14:46 | zooko | set | nosy:
droundy, jch, tommy, beschmi, kowey, markstos, zooko, rs title: use rzip or 7z instead of gzip? bundle patches together for compression? -> use FreeArc instead of gzip? bundle patches together for compression? |
2009-08-06 17:36:56 | admin | set | nosy:
+ jast, Serware, dmitry.kurochkin, darcs-devel, dagit, mornfall, simon, thorkilnaur, - droundy, jch, rs |
2009-08-06 20:33:57 | admin | set | nosy:
- beschmi |
2009-08-10 21:44:42 | admin | set | nosy:
+ rs, jch, - darcs-devel, jast, dagit, Serware, mornfall |
2009-08-25 17:50:52 | admin | set | nosy:
+ darcs-devel, - simon |
2009-08-27 13:56:04 | admin | set | nosy:
jch, tommy, kowey, markstos, darcs-devel, zooko, rs, thorkilnaur, dmitry.kurochkin |
2009-09-02 14:33:53 | kowey | set | status: deferred -> waiting-for nosy:
jch, tommy, kowey, markstos, darcs-devel, zooko, rs, thorkilnaur, dmitry.kurochkin topic:
+ Performance, Provisional superseder:
+ packed storage combining many small files into fewer larger ones title: use FreeArc instead of gzip? bundle patches together for compression? -> use FreeArc instead of gzip? |
2017-07-30 23:07:01 | gh | set | status: waiting-for -> given-up |
|