darcs

Issue 2703 darcs log hangs after quitting pager

Title darcs log hangs after quitting pager
Priority Status unknown
Milestone Resolved in
Superseder Nosy List falsifian
Assigned To
Topics

Created on 2023-05-18.16:40:33 by falsifian, last changed 2023-10-19.22:35:46 by falsifian.

Files
File name Uploaded Type Edit Remove
context falsifian, 2023-05-28.22:18:42 text/plain
wip-resolve-issue2702.dpatch bfrk, 2023-05-26.18:17:43 application/octet-stream
wip.dpatch falsifian, 2023-05-28.22:18:44 text/plain
Messages
msg23268 (view) Author: falsifian Date: 2023-05-18.16:40:32
1. Summarise the issue (what were doing, what went wrong?)

Run "darcs log". Quit the pager (less) without scrolling to the end. less quits but darcs doesn't exit. CTRL-C doesn't work.

I could have sworn I reported this a few years ago, but I can't find the issue. http://bugs.darcs.net/issue1278 looks superficially similar, but I can't reproduce that one.

2. What behaviour were you expecting instead?

darcs should quit once less does.

3. What darcs version are you using? (Try: darcs --exact-version)

From OpenBSD ports; output of darcs --exact-version:

darcs compiled on May  8 2023, at 07:59:54

Weak Hash: not available

Context:

[TAG 2.16.5
Ganesh Sittampalam <ganesh@earth.li>**20220220083531
 Ignore-this: 51417cb5dd488dc0d962ffb051f63ee1
] 



4. What operating system are you running?

OpenBSD-current on amd64
msg23269 (view) Author: bfrk Date: 2023-05-25.09:46:30
This seems to ba OS specific, it doesn't happen on Linux.

The code is in src/Darcs/UI/External.hs, viewDocWith and getViewer, 
the actual process management is in pipeDocInternal where it says

  rval <- waitForProcess pid

1) What is your pager and with what arguments is it run? (use ps 
while the pager is still active). Depending on environment settings 
and availability, darcs will invoke different pagers.

2) How do you quit the pager exactly?

3) Can you still see the pager with ps after it has been "quit", 
i.e. is the process perhaps still running somehow?

The fact that Ctrl-C doesn't work even after quitting the pager 
strongly suggests that the pager is still running, since we use the 
delegate_ctlc feature of System.Process, that is, the parent process 
(darcs) is ignoring SIGINT untill the child process is finished. 
This is so that pagers which have no way to quit them other via 
Ctrl-C can be quit without automatically quitting darcs itself, too.
msg23270 (view) Author: falsifian Date: 2023-05-25.21:30:38
On Thu, May 25, 2023 at 09:46:34AM +0000, Ben Franksen wrote:
> 
> Ben Franksen <ben.franksen@online.de> added the comment:
> 
> This seems to ba OS specific, it doesn't happen on Linux.
> 
> The code is in src/Darcs/UI/External.hs, viewDocWith and getViewer, 
> the actual process management is in pipeDocInternal where it says
> 
>   rval <- waitForProcess pid
> 
> 1) What is your pager and with what arguments is it run? (use ps 
> while the pager is still active). Depending on environment settings 
> and availability, darcs will invoke different pagers.

less with no arguments. Here is an excerpt of pstree output before I
(attempt to) quit less:

 | |-+= 88820 falsifia -ksh (ksh)
 | | \-+= 97789 falsifia darcs log
 | |   \--- 35117 falsifia less

> 2) How do you quit the pager exactly?

I press "q".

> 3) Can you still see the pager with ps after it has been "quit", 
> i.e. is the process perhaps still running somehow?

It's a zombie. After pressing q:

moth ~ $ ps -lp 35117
  UID   PID  PPID CPU PRI  NI   VSZ   RSS WCHAN   STAT   TT       TIME COMMAND
 1000 35117 97789   0 -22   0     0     0 -       Z+p    pp    0:00.00 (less)

Note this doesn't happen if I scroll to the end of the output before
quitting less. Is the line of code you quoted above reached in that
case?

I could gather a trace of Darcs's system calls (with ktrace, like Linux
strace) if that would help, e.g. to see what it's stuck on.
msg23271 (view) Author: bfrk Date: 2023-05-26.06:49:15
> It's a zombie.

This comfirms my suspicion.

> Note this doesn't happen if I scroll to the end of the output before
> quitting less. Is the line of code you quoted above reached in that
> case?

We could find out: run the same darcs command with --debug and redirect 
stderr to a file, then post the last few lines of that file.

The fact that it works fine if you scroll to the end suggests that the 
problem is related to the way we transfer the data to the pager. It looks to 
me as if darcs is stuck trying to send more bytes to the pipe, while the 
receiving end (the pager) has already terminated. So I think we don't even 
reach the line where the pipe is closed but are stuck at

          PipeToOther printers      -> hPutDocWith printers i inp

A process that terminates should close all its open handles. The question is: 
what happens at the other end, i.e. the parent process that tries to write to 
the pipe?

> I could gather a trace of Darcs's system calls (with ktrace, like Linux
> strace) if that would help, e.g. to see what it's stuck on.

Before we go that far down, I suggest we exhaust the "printf debuggging" 
method. I suggest you clone http://darcs.net/screened, but it should also be 
possible to start from http://darcs.net/releases/branch-2.16 (which has the 
advantage that you don't have to rebuild all the dependencies). The idea is 
to add extra debugMessage calls in pipeDocInternal to narrow down where it 
hangs.
msg23272 (view) Author: bfrk Date: 2023-05-26.07:47:40
I just realized that what we do here is incorrect even on Linux. In 
src/Darcs/Util/SignalHandler.hs it says

                     | s == sigPIPE = exitWith $ ExitFailure 1

And indeed, when I run `darcs log` and then hit 'q' to terminate the 
pager, the exit code is 1, which is wrong. I think that, at least when 
we pipe our output through a pager, we should ignore SIGPIPE and handle 
(i.e catch and ignore) the EPIPE error when the write fails. Looking 
closer at the code in src/Darcs/UI/External.hs, I see that there *is* 
code to handle EPIPE in the function ortryrunning, which among other 
things also handles the ResourceVanished exception ... which (if I 
understand things correctly) should not be raised because we do have a 
handler for SIGPIPE.

The signal handling code is very old and quite messy, so I may have 
overlooked something. And it doesn't explain why the behavior differs 
between Linux and OpenBSD.
msg23273 (view) Author: bfrk Date: 2023-05-26.18:13:37
From the mailing-list, for reference, by gpiero:

I can reproduce this with a default fresh install of OpenBSD 7.3 using 
darcs-2.16.5p1 from packages.

The issue, as confirmed by ktrace, is that the darcs process doesn't get 
a SIGPIPE. I'm not knowledgeable enough about OpenBSD, but I suspect 
this is a form of the old unflushed-buffered-output problem. This 
doesn't happen if you scroll all the way down in less or if you execute 
`darcs log` in a repo with a short enough log because in that case darcs 
has already emitted the whole output. OTOH, if darcs doesn't flush the 
pending output it doesn't receive a SIGPIPE and it just waits idle.

How and when the buffered output is 'spontaneously' flushed, or if it's 
buffered at all, is OS-specific, so it explains why the issue only 
appears to be present in OpenBSD.

Force-flushing the output (or maybe using an unbuffered output) should 
solve this and confirm my suspects. Unfortunately, I don't know how to 
do it in Haskell, but Ben/Ganesh surely do.
msg23274 (view) Author: bfrk Date: 2023-05-26.18:17:43
[My reply to the list:]

Easy enough, see attached patch which I recorded against 2.16.5. 
Please clone http://darcs.net/releases/branch-2.16, cd into the new 
directory, apply the patch, and then run

  cabal update
  cabal run darcs -- log

to see if the issue is indeed fixed or not. If yes, it would be nice 
if you could afterwards obliterate the patch and try again, to 
exclude the possibility that the problem was caused by an outdated 
dependency (such as a bug in the process packages that has since 
been fixed).

[And from gpiero:]

James, can you test Ben's patch?

The OpenBSD instance I used for testing is a 512MB VM on my 
resource-starved laptop and there's no way I can recompile darcs 
over there.
Attachments
msg23275 (view) Author: bfrk Date: 2023-05-27.18:02:55
I am pretty sure now that this patch will not fix the problem. We close 
the write-end of the pipe immediately after the write, which should do an 
implicit flush.

You (gpiero) say
> OTOH, if darcs doesn't flush the 
> pending output it doesn't receive a SIGPIPE and it just waits idle.

I don't understand what it is that darcs is waiting for. Why is a signal 
needed at all? What is "the old unflushed-buffered-output problem"?

In your ktrace, what is the last system call that darcs does (before it 
hangs)?
msg23276 (view) Author: gpiero Date: 2023-05-28.08:33:53
* [Sat, May 27, 2023 at 06:02:55PM +0000] Ben Franksen:
>I am pretty sure now that this patch will not fix the problem. We close
>the write-end of the pipe immediately after the write, which should do an
>implicit flush.

Yeah, I was sceptic too, but for another reason: I don't think the 
program reaches that part of code and the flush should have been placed 
more in depth. *But*, I've investigated a little more and darcs seems 
stuck inside hPutDocWith and, following the code, hPutBuf (or hPutBuf', 
I don't recall) *does* call flush. So it doesn't seem related to a lack 
of buffer flushing. Sorry I'm a bit vague here but I had to remove that 
VM and... well, I foolishly kept my notes INSIDE the VM (oops).

>You (gpiero) say
>> OTOH, if darcs doesn't flush the
>> pending output it doesn't receive a SIGPIPE and it just waits idle.
>
>I don't understand what it is that darcs is waiting for. Why is a signal
>needed at all? What is "the old unflushed-buffered-output problem"?

I/O is usually buffered: ttys are typically line-buffered (it's flushed 
on newlines) and pipes are typically block-buffered (it's flushed when 
it's full). You can also flush the buffer explicitly or, implicitly, by 
closing the fd.

For the sake of simplicity, here an example in Perl (it's not 
language-dependant, but some languages or libraries can call flush under 
the hood):

$ perl -E 'print "foobar"; sleep 10;'

This writes the string "foobar" *to the buffer*, sleeps for 10 seconds, 
then exits and in doing that closes the stdout and flushes the buffer.  
Only then you'll see "foobar" on the screen. Note that 'sleep 10' is 
just a stub: you can replace it with days long intensive computation, an 
infinite loop or a deadlock. In all cases, you'll only see "foobar" when 
the buffer is flushed to stdout.

If you don't want to wait 10 seconds for that important piece of 
information, you can append a newline to the string (because it's 
line-buffered):

$ perl -E 'say "foobar"; sleep 10;'

or explicitly flush the buffer:

$ perl -E 'print "foobar"; STDOUT->flush; sleep 10;'

Well, in Perl you can also set autoflush, but that's not relevant here.

>In your ktrace, what is the last system call that darcs does (before it
>hangs)?

Unfortunately, the ktrace.out was gone with the VM (oops 2). Will redo 
in the next days.

Ciao,
Gian Piero.
msg23280 (view) Author: falsifian Date: 2023-05-28.22:18:44
> I am pretty sure now that this patch will not fix the problem. We close 
> the write-end of the pipe immediately after the write, which should do an 
> implicit flush.

For the record, it indeed doesn't.

I wasn't able to build branch-2.16 because my base version is too high,
but I applied what ought to be the same change on top of screened, and
the problem is still there.

For reference, I've attached my version of the patch (which I redid
because yours depends on a tag not in screened) and the output of darcs
log --context after recording that.

Wit that change applied, and running "cabal run darcs -- log" (having
earlier today run "cabal update"), I still see the behaviour I
originally reported.

> You (gpiero) say
> > OTOH, if darcs doesn't flush the 
> > pending output it doesn't receive a SIGPIPE and it just waits idle.
> 
> I don't understand what it is that darcs is waiting for. Why is a signal 
> needed at all? What is "the old unflushed-buffered-output problem"?
> 
> In your ktrace, what is the last system call that darcs does (before it 
> hangs)?

It's poll(). Here's an exerpt of a system call trace of just the darcs
process (output of kdump -T) leading up to and including the SIGTERM I
eventually sent it.

 34354 darcs    1685311642.583002 CALL  read(10,0x2edbb57d010,0x1000)
 34354 darcs    1685311642.583012 GIO   fd 10 read 1 bytes
       "\M^?"
 34354 darcs    1685311642.583017 RET   read 1
 34354 darcs    1685311642.583032 CALL  poll(0x2edba704a10,2,INFTIM)
 34354 darcs    1685311655.373252 PSIG  SIGTERM caught handler=0x2eaeb32ed50 mask=0<>
 34354 darcs    1685311655.373285 RET   futex -1 errno 4 Interrupted system call
Attachments
msg23283 (view) Author: falsifian Date: 2023-05-28.22:55:23
In case it's helpful:

As an experiment, I tried piping the output of the following simple
program to less. I wrote it this way because after skimming the code
not-very-carefully, I thought "darcs log" amounted to calling
Data.ByteString.hPut a bunch of times (after forking off the pager).

	module Main where

	import qualified Data.ByteString as B
	import Data.String (fromString)
	import System.IO (stdout)

	main :: IO ()
	main = B.hPut stdout (fromString "Hello.\n") >> main

The behaviour is similar to darcs log, but not quite the same. I press
"q" to exit less, and then:

- The (above) program doesn't exit, like with darcs.

- Unlike darcs log, the system call it's stuck on is select. With darcs
  log it was poll. So, maybe I'm not successfully reproducing what darcs
  log is doing?

- Other differences probably explained by the fact that I did the
  pipe-to-less from the shell instead of having the Haskell program call
  less: (a) the less process is gone, not a zombie; (b) pressing ctrl-C
  (after having quit less) successfully terminates the program.
msg23286 (view) Author: bfrk Date: 2023-05-29.12:35:20
The select vs. poll is most probably due to a different version of 
ghc or perhaps different compiler flags. These calls are used by the 
GHC runtime system to multiplex IO between threads.

It is interesting that you can reproduce the gist of the problem 
with that simple Haskell program and pipe composition does by the 
shell. This is a strong hint that what we have here is a bug in the 
GHC runtime on BSD. Before we open an issue on the GHC bug tracker, 
let's do a few more experiments:

Does it make a difference whether you have "\n" in the string or 
not?

Does the same happen if you replace less-then-quit with /bin/true?

Also, which ghc version are you using? Can you test with the latest 
ghc (9.6.2)?
msg23287 (view) Author: bfrk Date: 2023-05-29.12:36:20
> with that simple Haskell program and pipe composition does by the 
> shell. This is a strong hint that what we have here is a bug in the 

s/does/done/
msg23288 (view) Author: falsifian Date: 2023-05-29.14:22:43
Thanks for the suggestions.

> Does it make a difference whether you have "\n" in the string or 
> not?

Only that there is a significant pause before less shows its ":" prompt
on the bottom line of the terminal; I guess less is waiting for the
input to reach some line length limit. After that, I press "q", and as
usual my program doesn't exit.

> Does the same happen if you replace less-then-quit with /bin/true?

No,
	./f | /usr/bin/true
exits immediately (./f is my compiled program).

> Also, which ghc version are you using? Can you test with the latest 
> ghc (9.6.2)?

I have 9.2.7. I'll try compiling the latest ghc, but it may be a while
before I find the time.
msg23290 (view) Author: gpiero Date: 2023-05-29.17:16:04
For the record, the minimal test case is really minimal:

$ cat <<EOF >test.hs
module Main where

main :: IO ()
main = putStrLn "foobar" >> main
EOF

$ ghc test

$ ./test | less

OpenBSD 7.3 with ghc-9.2.7p1 from packages.
msg23291 (view) Author: bfrk Date: 2023-05-29.18:40:47
If it fails with this extremely simple test program and a recent 
enough ghc (9.2.8 is the version recommended by ghcup) then I think 
you should file an bug on the ghc tracker. Sorry for delegating this 
to you but I am not in a good position to respond to clarification 
requests etc.

It is a bit unsettling that this does not happen with 'true', only 
with 'less'. Perhaps the error only manifests if the first write 
succeeds. You could try

 ./test | read x

The read here should terminate after reading one line. What I am 
trying to do here is to eliminate less as a possible culprit.
msg23292 (view) Author: gpiero Date: 2023-05-30.10:01:19
* [Mon, May 29, 2023 at 06:40:47PM +0000] Ben Franksen:
>If it fails with this extremely simple test program and a recent
>enough ghc (9.2.8 is the version recommended by ghcup) then I think
>you should file an bug on the ghc tracker.

James, is it something you can do? I'm not knowledgeable enough about 
OpenBSD nor GHC/Haskell and I'll have a hard time compiling code on that 
VM.

For completeness, in case it's not already solved in recent releases, 
it's something to do with the feedback mechanism of a full buffer in 
pipes. It doesn't happen with /bin/true, read, etc. because they don't 
saturate the buffer. It happens with less, more and similar.

To reproduce:

$ cat <<EOF >test.hs
module Main where

main :: IO ()
main = putStrLn "foobar" >> main
EOF

$ ghc test

$ ./test | perl -E '$SIG{ALRM} = sub { die "exiting on SIGALRM\n" }; 
alarm 3; while (<>) { sleep .5; print $_ }'

You could need to adjust the arguments to alarm() and sleep() in order 
to be sure the pipe is clogged on your system. Removing the sleep() is 
sufficient for not having the issue.
msg23308 (view) Author: falsifian Date: 2023-06-02.02:56:37
On Mon, May 29, 2023 at 06:40:47PM +0000, Ben Franksen wrote:
...
> If it fails with this extremely simple test program and a recent 
> enough ghc (9.2.8 is the version recommended by ghcup) then I think 
> you should file an bug on the ghc tracker. Sorry for delegating this 
> to you but I am not in a good position to respond to clarification 
> requests etc.

I am happy to help with this; it will just take a while before I can get
to it.

On Tue, May 30, 2023 at 10:01:19AM +0000, Gian Piero Carrubba wrote:
> 
> Gian Piero Carrubba <gpiero@rm-rf.it> added the comment:
> 
> * [Mon, May 29, 2023 at 06:40:47PM +0000] Ben Franksen:
> >If it fails with this extremely simple test program and a recent
> >enough ghc (9.2.8 is the version recommended by ghcup) then I think
> >you should file an bug on the ghc tracker.
> 
> James, is it something you can do? I'm not knowledgeable enough about 
> OpenBSD nor GHC/Haskell and I'll have a hard time compiling code on that 
> VM.

Yes, I expect I'll be able to figure it out, unless recent changes in
ghc have created more work for the OpenBSD port maintainer.
msg23586 (view) Author: bfrk Date: 2023-07-16.11:57:49
falsifian wrote:
> I am happy to help with this; it will just take a while before
> I can get to it.

Just a reminder, in case you forgot ;-)
msg23629 (view) Author: falsifian Date: 2023-07-22.14:16:37
> Just a reminder, in case you forgot ;-)

It's still on my list!
msg23724 (view) Author: falsifian Date: 2023-10-19.22:35:44
>It's still on my list!

Small update: I tried naïvely updating the OpenBSD port to 9.8.1 but it looks like it will be some work (more than I have time for right now) because some patches no longer apply. I think I will wait until the next time someone updates the OpenBSD port.
History
Date User Action Args
2023-05-18 16:40:33falsifiancreate
2023-05-25 09:46:34bfrksetmessages: + msg23269
2023-05-25 21:30:42falsifiansetmessages: + msg23270
2023-05-26 06:49:17bfrksetmessages: + msg23271
2023-05-26 07:47:40bfrksetmessages: + msg23272
2023-05-26 18:13:37bfrksetmessages: + msg23273
2023-05-26 18:17:43bfrksetfiles: + wip-resolve-issue2702.dpatch
messages: + msg23274
2023-05-27 18:02:55bfrksetmessages: + msg23275
2023-05-28 08:33:53gpierosetmessages: + msg23276
2023-05-28 22:18:44falsifiansetfiles: + context, wip.dpatch
messages: + msg23280
2023-05-28 22:55:23falsifiansetmessages: + msg23283
2023-05-29 12:35:21bfrksetmessages: + msg23286
2023-05-29 12:36:20bfrksetmessages: + msg23287
2023-05-29 14:22:43falsifiansetmessages: + msg23288
2023-05-29 17:16:05gpierosetmessages: + msg23290
2023-05-29 18:40:47bfrksetmessages: + msg23291
2023-05-30 10:01:19gpierosetmessages: + msg23292
2023-06-02 02:56:39falsifiansetmessages: + msg23308
2023-07-16 11:57:49bfrksetmessages: + msg23586
2023-07-22 14:16:38falsifiansetmessages: + msg23629
2023-10-19 22:35:46falsifiansetmessages: + msg23724