Wednesday, April 14, 2010

Continue running a non nohup-ed command after logout (no SIGHUP)

Many times it happens that you start a command that takes fairly long time to complete, and before it ends, you must log out for some reason - maybe the network will go down soon or you do not want to keep staring at the screen till it completes, or you just don't want to keep that terminal around.

A bit of shell behaviour for the uninformed. When you launch a command in a shell, the new process is created by fork()ing the current shell and immediately exec()ing the command executable/binary, which means the new process is a child of the shell process. You can stop the running process and keep it running in background as

% java some.long.running.application
{java program spits out something}
{hit ^z}
^Z
zsh: suspended  java some.long.running.application
% bg
[1]  + continued  java some.long.running.application

or you can start it in background as

% java some.long.running.application &
[1]  1001
{java program spits out something}

of course you can bring these jobs in foreground any time you want

% jobs
[1]  - running    iostat -xd 100
[2]  + running    java some.long.running.application
[3]  + suspended  ~/bin/startOfflineIMAP.sh
% fg %2 {bash users must drop the %}
[2]    running    java some.long.running.application
{java program spits out something}

Now you want to log out. The moment you log out of the terminal, the shell process sends SIGHUP signal to all running children and SIGCONT->SIGHUP for all stopped children. The default behaviour of an application after receiving SIGHUP is to exit. Any applications - foreground, as well as background, that were started from this shell are killed. We want our application to survive after logout.

The textbook way of doing this is to start the command with `nohup' as

% nohup java some.long.running.application &
[1] 1001
% logout

or the subshell trick :

% (java some.long.running.application &)
% {prompt returns, java disowned}
{java program spits out something}

zsh users can do it as :

% java some.long.running.application &!
% {prompt returns, java disowned}
{java program spits out something}

Or use the good old screen (my favorite)!

Unfortunately you did not start the process with nohup or subshell trick, and say the process can not be restarted because of some reason or it has done significant work already.

What if we could tell the shell not to send SIGHUP to a particular child?
`disown' command lets you do just that! :D

% jobs
[1]  - running    iostat -xd 100
[2]  + running    java some.long.running.application
[3]  + suspended  ~/bin/startOfflineIMAP.sh
% disown %2 {bash users must drop the %, also bashers can add -h option}
{java program spits out something}

This tells the shell not to send SIGHUP to our precious java process. And you'd think you can now happily log out with java process still running.

Well, not quite. Say the shell has pid 1000 and java process has pid 1001, then

% ls -l /proc/1000/fd
total 0
lrwx------. {...} 0 -> /dev/pts/1
lrwx------. {...} 1 -> /dev/pts/1
lrwx------. {...} 2 -> /dev/pts/1

% ls -l /proc/1001/fd
total 0
lrwx------. {...} 0 -> /dev/pts/1
lrwx------. {...} 1 -> /dev/pts/1
lrwx------. {...} 2 -> /dev/pts/1


Which means process 1001 uses terminal /dev/pts/1 as it's stdin, stdout and stderr. Even if we disown the java process, when the shell quits, terminal device /dev/pts/1 will not be available, and hence next read or write by java process to any of stdin/stdout/stderr will probably result in an abort. Even if it does not abort, you might want to capture stdout and stderr of the program somewhere to a file maybe, and possibly feed some file to it as input. That is not possible as

% ls -l /proc/1000/fd
total 0
lrwx------. {...} 0 -> /dev/pts/1 (deleted)
lrwx------. {...} 1 -> /dev/pts/1 (deleted)
lrwx------. {...} 2 -> /dev/pts/1 (deleted)

Sad, isn't it?

Not quite!

Let us analyze how nohup works. If output of nohup is not redirected to some file, by default all the output of nohup-ed program goes to some default file (such as $HOME/nohup.out or $PWD/nohup.out). In any case, nohup has a writeable file descriptor to the file where output is supposed to go. Immediately after fork() but before exec(), nohup duplicates this fd to stdout and stderr using dup2(). This way, the child can keep running after being released from the shell without SIGHUP (which means it's parent=1), as stdout and stderr fds are still valid because they no longer are the fds of parent shell but fds of some real file opened. Stdin is probably uncared for as we are running the process in background, non-interactive mode after all.

The question is : all this is fine as it is done _before_ starting the java process. What can we do to change it's stdout and stderr _after_ it has been launched already?
Note that we can not modify /proc/1001/fd/1 to link to some real file (me wonders what issues would creep up if it was allowed).

Our good old friend gdb comes to rescue! The solution is trivial. Just attach the process, open a file you want the output to go to within that program with open() and dup2() the new fd to 1 and 2 :D

% gdb -p 1001
....
Attaching to process 1001
Reading symbols from /usr/bin/java...(no debugging symbols found)...done.
...
(gdb) call open("/home/prashant/tmp/output", O_WRONLY | O_CREAT | O_APPEND)
$1 = 5
(gdb) call dup2(5,1)
$2 = 1
(gdb) call open("/home/prashant/tmp/output.err", O_WRONLY | O_CREAT | O_APPEND)
$1 = 6
(gdb) call dup2(6,2)
$3 = 2
(gdb) detach 
Detaching from program: /usr/bin/java, process 1001
(gdb) quit
%

In case debug info is not available, you can replace the O_ macros to actual values in fcntl.h

...
(gdb) call dup2(open("/home/prashant/tmp/output", 0x209),1)
$2 = 1
(gdb) call dup2(open("/home/prashant/tmp/output.err", 0x209),2)
$3 = 2
(gdb) detach
...

Note that you can redirect stdin and stdout in same file if you wish (just be careful with append mode on NFS and truncate mode in general ;).

And that's about it. Go ahead and logout. Your process should be busy while you are gone.

PS : this will work as long as the program does not try to read anything from stdin. If and when it does, it may crash depending on whether the program abort()s when it can not do basic IO on fds 0,1 and 2. You might want to open another file to read and use dup2() in similar way if you plan to provide input from a file.