darcs

Issue 1849 get over http hangs -- repository corruption (2.3.1 to 2.4.4)

Title get over http hangs -- repository corruption (2.3.1 to 2.4.4)
Priority bug Status given-up
Milestone Resolved in
Superseder Nosy List dmitry.kurochkin, kowey, zooko
Assigned To
Topics HTTP, Hashed

Created on 2010-05-24.05:59:16 by zooko, last changed 2017-07-31.00:06:25 by gh.

Files
File name Uploaded Type Edit Remove
both-uncompressed.7z zooko, 2010-05-24.05:59:14 application/octet-stream
Messages
msg11102 (view) Author: zooko Date: 2010-05-24.05:59:14
I have darcs 2.3.1 on server (linux) and client (osx) and this has been
working fine for months. Recently darcs get over http started hanging
for one of the three repositories. You can see this happening on this
buildbot: 

http://tahoe-lafs.org/buildbot-zfec/builders/mac-amd64-osx-10.4.11-ootles

e.g.:

http://tahoe-lafs.org/buildbot-zfec/builders/mac-amd64-osx-10.4.11-ootles/builds/30

It doesn't happen for the same client and server on other repositories,
e.g.:

http://tahoe-lafs.org/buildbot/builders/zooko%20ootles%20Mac-amd64%2010.4/builds/409

Other builders building at the same time from the same repo on the same
server but from a different client had no problem, e.g.:

http://tahoe-lafs.org/buildbot-zfec/builders/buildbot.rubenkerkhof.com/builds/19

This happens with this newly created repo (ticket1026-hashedformat) but
not with the repo that it came from (trunk-hashedformat).

I deleted the ~/.darcs/cache on the server side--this didn't change the
behavior.

I upgraded the client from 2.3.1 to 2.4.4 and that didn't change the
behavior. I upgraded the server from 2.3.1 to 2.4.4 and that didn't
change the behavior.

I ran "darcs check" and it reported all clean.

I rebooted the server and that didn't change the behavior.

Next I decided to try backing up the repo and then obliterating recent
patches to that branch. My first such attempt yielded surprising errors:

source@org:~/darcs/zfec/ticket1026-hashedformat$ darcs oblit
Sun May 23 14:59:52 PDT 2010  zooko@zooko.com
  * setup: add misc/run_trial.py
Shall I obliterate this patch? (1/15)  [ynWvplxdaqjk], or ? for help: y
Sun May 23 14:52:46 PDT 2010  zooko@zooko.com
  * setup: bundle the latest zetuptoolz as unpacked source egg and
change setup.py to load it directly instead of with ez_setup.py
Shall I obliterate this patch? (2/15)  [ynWvplxdaqjk], or ? for help: d
Warning: ./zfec/misc/run_trial.py-0: takeFile ./zfec/misc/run_trial.py-0
in /home/source/darcs/zfec/ticket1026-hashedformat: openFd: permission
denied (Permission denied)
Warning: Cannot remove non-empty file ./zfec/misc/run_trial.py
Warning: Not deleting ./zfec/misc because it is not empty.
Finished obliterating.

Hm, this means that there are permissions problems on the directory, right?

source@org:~/darcs/zfec/ticket1026-hashedformat$ ls -al zfec/misc/
total 12
drwxr-sr-x 2 zooko  source 4096 May 23 15:09 .
drwxrwsr-x 8 source source 4096 May 23 15:09 ..
-rw-r--r-- 1 zooko  source   44 May 23 15:09 run_trial.py

Yes the "source" user has read but not write permissions on that directory.

However, running chmod -R g+u on the repo didn't fix the problem. So I
restored the repo from backup, did the chmod, and then tried the oblit
again:

source@org:~/darcs/zfec/ticket1026-hashedformat$ darcs oblit 
Sun May 23 14:59:52 PDT 2010  zooko@zooko.com
  * setup: add misc/run_trial.py
Shall I obliterate this patch? (1/15)  [ynWvplxdaqjk], or ? for help: y
Sun May 23 14:52:46 PDT 2010  zooko@zooko.com
  * setup: bundle the latest zetuptoolz as unpacked source egg and
change setup.py to load it directly instead of with ez_setup.py
Shall I obliterate this patch? (2/15)  [ynWvplxdaqjk], or ? for help: d
Finished obliterating.

Next I tried the darcs get again, and it still hangs, but now it hangs
in a different place! Every run of the darcs get up until now has hung
after outputting like this (with --version --debug --timing --partial):

Sun May 23 23:16:45 MDT 2010: Adding new file to existing UrlRequest.
Sun May 23 23:16:45 MDT 2010: URL.waitUrl
http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/0000000284-8d5c515fc53ccd3deada935baa77b78f474dbc37129a8ead02d4b80226d71ec3

But now that I've obliterated this patch, now it ends like this:

URL.waitUrl
http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/0000000284-21808dcf8927e10480f7351b7ec5977531b23e031a79a5338325d0470837d38c
Sun May 23 23:21:28 MDT 2010: Adding new file to existing UrlRequest.

oh and when I try the same thing again then this time it ends like this:

Sun May 23 23:22:28 MDT 2010: Adding new file to existing UrlRequest.
Sun May 23 23:22:28 MDT 2010: URL.waitUrl
http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/0000000284-21808dcf8927e10480f7351b7ec5977531b23e031a79a5338325d0470837d38c

Oh I just noticed that I have "--partial" in there. That's not right.
What if I replace it with "--lazy"... Same thing:

Sun May 23 23:25:07 MDT 2010: Adding new file to existing UrlRequest.
Sun May 23 23:25:07 MDT 2010: URL.waitUrl
http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/0000000284-21808dcf8927e10480f7351b7ec5977531b23e031a79a5338325d0470837d38c

Okay, I'll try obliterating some more patches and see what happens...

source@org:~/darcs/zfec/ticket1026-hashedformat$ darcs oblit 
Sun May 23 14:52:46 PDT 2010  zooko@zooko.com
  * setup: bundle the latest zetuptoolz as unpacked source egg and
change setup.py to load it directly instead of with ez_setup.py
Shall I obliterate this patch? (1/14)  [ynWvplxdaqjk], or ? for help: y
Sun May 23 14:13:42 PDT 2010  zooko@zooko.com
  * setup: remove bundled copy of zetuptoolz -- we can just rely on the
ez_setup.py bootstrapping script installing the normal setuptools (or
whatever passes for the normal setuptools on your system)
Shall I obliterate this patch? (2/14)  [ynWvplxdaqjk], or ? for help: d
Finished obliterating.

Now it hangs like this:

Sun May 23 23:25:48 MDT 2010: Adding new file to existing UrlRequest.
Sun May 23 23:25:48 MDT 2010: URL.waitUrl
http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/0000000284-372cc4d342e54de2702fff72b6e4e2b6a99c1e37015d57cd56d2e72669214c80

Then I deleted the ~/.darcs/cache on the client side--that didn't change
the behavior.

Then I obliterated another patch:
source@org:~/darcs/zfec/ticket1026-hashedformat$ darcs oblit 
Sun May 23 14:13:42 PDT 2010  zooko@zooko.com
  * setup: remove bundled copy of zetuptoolz -- we can just rely on the
ez_setup.py bootstrapping script installing the normal setuptools (or
whatever passes for the normal setuptools on your system)
Shall I obliterate this patch? (1/13)  [ynWvplxdaqjk], or ? for help: y
Sat May 15 20:01:22 PDT 2010  zooko@zooko.com
  * setup: new improved show-tool-versions.sh
Shall I obliterate this patch? (2/13)  [ynWsfvplxdaqjk], or ? for help: d
Finished obliterating.

Now it hangs on:

Sun May 23 23:28:00 MDT 2010: Adding new file to existing UrlRequest.
Sun May 23 23:28:00 MDT 2010: URL.waitUrl
http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/0000000284-7bcd4756519929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2

Now I have obliterated all the patches that distinguish the
ticket1026-hashedformat branch from the trunk-hashedformat branch! And
yet if I try the darcs get again on the ticket1026-hashedformat branch
then the get hangs, but if I do the same command on the
trunk-hashedformat branch then the get succeeds.

Okay, I stop here. I will attach an archive file containing the complete
current contents of the ticket1026-hashedformat and trunk-hashedformat
directories, named "both.7z".

Aha! But wait! I just ran "darcs optimize --uncompress" on the
ticket1026-hashedformat directory and now the darcs get from it
succeeds! Here I now attach the archive showing the contents of that
directory after optimize --uncompress, named
"ticket1026-hashedformat-uncompressed.7z".

Just to be sure, I went back and restored to the version of
ticket1026-hashedformat from the both.7z archive, and then the client
quickly completed its darcs get until I thought to remove the client's
cache after which the client went back to hanging whenever I did a darcs
get.

Now, I did some darcs obliterating in the ticket1026-hashedformat
repository earlier today, so if there is a bug in darcs obliterate in
darcs-2.3.1 then that might explain the repository corruption.

Oh wait. Dammit. I tried to reproduce it one last time by running "darcs
optimize --uncompress" again and this time it did *not* fix the problem,
even after removing the client cache. :-( So now I'm confused. Also I
just noticed this in the get output from the failed gets:

Sun May 23 23:42:48 MDT 2010: URL.waitNextUrl succeeded:
http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/prefs/prefs
_darcs/pref
s/prefs
Sun May 23 23:42:48 MDT 2010: Done identifying repository .
Sun May 23 23:42:48 MDT 2010: Identified darcs-1 repo:
/Users/zfecbuildslave/tmp/build
Sun May 23 23:42:48 MDT 2010: Writing working directory contents...
Sun May 23 23:42:48 MDT 2010: Copying hashed pristine tree:
0000000284-7bcd4756519929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2
Sun May 23 23:42:48 MDT 2010: Beginning copying pristine
Sun May 23 23:42:48 MDT 2010: Reading hash file
0000000284-7bcd4756519929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2
from pristine.hashed/
Sun May 23 23:42:48 MDT 2010: I'm doing copyFileUsingCache on
pristine.hashed/0000000284-7bcd4756519929c587b779cf72c63114aeb9e0650c94223717271ba1
24945cc2
Sun May 23 23:42:48 MDT 2010: URL.copyUrlWithPriority
(http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/000000028
4-7bcd4756519929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2
                      ->
/home/source/.darcs/cache/pristine.hashed/0000000284-7bcd4756519929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2)
Sun May 23 23:42:48 MDT 2010: URL.urlThread
(http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/0000000284-7bcd4756
519929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2
            ->
/home/source/.darcs/cache/pristine.hashed/0000000284-7bcd4756519929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2)
Sun May 23 23:42:48 MDT 2010: URL.request_url
(http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/0000000284-7bcd47
56519929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2
              ->
/home/source/.darcs/cache/pristine.hashed/0000000284-7bcd4756519929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2)
Sun May 23 23:42:48 MDT 2010: URL.waitUrl
http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/0000000284-7bcd4756519
929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2
Sun May 23 23:42:48 MDT 2010: Failed to start download URL
http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/00000
00284-7bcd4756519929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2:
fopen() failed
Sun May 23 23:42:48 MDT 2010: Failed to download URL
http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/0000000284-
7bcd4756519929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2:
fopen() failed
Sun May 23 23:42:48 MDT 2010: In fetchFileUsingCachePrivate I'm going
manually
Sun May 23 23:42:48 MDT 2010:     getting
0000000284-7bcd4756519929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2
Sun May 23 23:42:48 MDT 2010:     from
http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/0000000284-7bcd4756519929
c587b779cf72c63114aeb9e0650c94223717271ba124945cc2
Sun May 23 23:42:48 MDT 2010: URL.copyUrlWithPriority
(http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/000000028
4-7bcd4756519929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2
                      -> ./darcs17275)
Sun May 23 23:42:48 MDT 2010: URL.urlThread
(http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/0000000284-7bcd4756
519929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2
            -> ./darcs17275)
Sun May 23 23:42:48 MDT 2010: Adding new file to existing UrlRequest.
Sun May 23 23:42:48 MDT 2010: URL.waitUrl
http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/0000000284-7bcd4756519
929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2

See that fopen() failed? That does not appear in any log file from a
successful download and it appears in every log file from a hung
download. Well, why does that fail? That file is world-readable. Here I
am su'ing to be the www-data user and catting that file:

$ cat
/home/source/darcs/zfec/ticket1026-hashedformat/_darcs/pristine.hashed/0000000284-7bcd4756519929c587b779cf72c63114aeb9e0650c94223717271ba124945cc2
                                               
MMjD1####N;####+ڼ#########+o#.{z#ί######c7@   
##X=###}O"T#######>EZ#####################

Okay now I'm really going to stop. Here is an archive that has these two
repositories which have the same set of patches from darcs's point of
view, yet pulling from the trunk-hashedformat one always succeeds and
pulling from the ticket1026-hashedformat one always gives this fopen()
failure and hangs. It is named "both-uncompressed.7z".
Attachments
msg11137 (view) Author: kowey Date: 2010-05-28.20:26:00
Thanks for the report, Zooko.  I've had time to read it properly yet
(some kind of quick summary could be handy).

Quick questions:
1. Do you have a way of reproducing hang+corruption systematically?
2. Do you get hanging or corruption when doing local fetches?
3. Is OS X a necessary component of this?

The optimize --uncompress stuff seems like an interesting clue indeed.

I wonder if HTTP hanging is affected by changes in HEAD to the HTTP code.

Note that --partial is treated as a synonym for --lazy when fetching
hashed repos (but the flag is deprecated, in my opinion)
msg12603 (view) Author: kowey Date: 2010-09-23.16:33:24
Sorry for the delay! I've had this marked in as "re-read" for some time 
now and it's taken me a while to get around to it.

Zooko: thanks for taking the trouble to write a detailed bug report. 
Long reports like this (1440 words) could be a bit easier to process 
with an executive summary, for what it's worth.  

Also, you may find that --no-cache to be handy for avoiding interference 
from the cache system.

My attempt:

1. darcs hangs when fetching http://tahoe-
lafs.org/source/zfec/ticket1026-hashedformat/

2. Obliterating patches does not seem to help, not even down to the same 
context as a working repository. Doing an obliterate just moves the 
hanging to a different file.

3. Zooko later discovered that running darcs optimize --uncompress on 
the broken repository made fetching appear succeed, but could not 
reproduce it.  Perhaps we shouldn't pay this too much heed with caching 
and what not.

4. While obliterating for the first time, Zooko discovered and fixed  
permissions problems in the working directory (due to multiple users, 
zooko/source).

I could sift out three things of potential interest:

A. The permissions problems
B. Something to do with optimize --compress/--uncompress
C. Weird fopen() error message when darcs get fails.  

For C, on my Linux box, I get "HTTP response code said error" (with 
darcs 2.4.3).  This *might* be Mac-related, but I don't know

It's not clear if this is Mac related in anyway.  There is no hang on 
Linux, so I recommend as a next action to try: darcs get --no-cache --
debug http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat on a Mac

Another thing to do may be to inspect/compare the two repositories that 
Zooko offered us.

At the time of this writing, on my Linux box, fetching this repository 
gives us:

Copying patches, to get lazy repository hit ctrl-C...    
                                  
darcs failed:  Couldn't fetch `0000006044-
f5d33bfcd7dac3d32c3bf32428b46dbd889afc64fdd6c5d926d180621d597472'
in subdir patches from sources:

thisrepo:/tmp/ticket1026-hashedformat
cache:/home/eykk10/.darcs/cache
repo:http://tahoe-lafs.org/source/zfec/ticket1026-hashedformat
msg12604 (view) Author: kowey Date: 2010-09-23.16:39:39
Note also about point C (the fopen() error): we appear to be able to 
fetch the file just fine (as Zooko observes).  So maybe reproduction on 
MacOS X is a good thing to try
History
Date User Action Args
2010-05-24 05:59:16zookocreate
2010-05-28 20:26:01koweysettopic: + HTTP, Hashed
nosy: + kowey
messages: + msg11137
title: get over http hangs -- repository corruption -> get over http hangs -- repository corruption (2.3.1 to 2.4.4)
2010-09-23 16:33:28koweysetstatus: unknown -> needs-reproduction
messages: + msg12603
2010-09-23 16:39:40koweysetmessages: + msg12604
2017-07-31 00:06:25ghsetstatus: needs-reproduction -> given-up