worker: use os._exit for posix worker in all cases
authorJun Wu <quark@fb.com>
Thu, 24 Nov 2016 01:15:34 +0000
changeset 30530 86cd09bc13ba
parent 30529 4338f87dbf6f
child 30531 7b3136bc7bfd
worker: use os._exit for posix worker in all cases Like commandserver, the worker should never run other resource cleanup logic. Previously this is not true for workers if they have exceptions other than KeyboardInterrupt. This actually caused a real-world deadlock with remotefilelog: 1. remotefilelog/fileserverclient creates a sshpeer. pipei/o/e get created. 2. worker inherits that sshpeer's pipei/o/e. 3. worker runs sshpeer.cleanup (only happens without os._exit) 4. worker closes pipeo/i, which will normally make the sshpeer read EOF from its stdin and exit. But the master process still have pipeo, so no EOF. 5. worker reads pipee (stderr of sshpeer), which never completes because the ssh process does not exit, does not close its stderr. 6. master waits for all workers, which never completes because they never complete sshpeer.cleanup. This could also be addressed by closing these fds after fork, which is not easy because Python 2.x does not have an official "afterfork" hook. Hacking os.fork is also ugly. Besides, sshpeer is probably not the only troublemarker. The patch changes _posixworker so all its code paths will use os._exit to avoid running unwanted resource clean-ups.
mercurial/worker.py
--- a/mercurial/worker.py	Thu Nov 24 00:48:40 2016 +0000
+++ b/mercurial/worker.py	Thu Nov 24 01:15:34 2016 +0000
@@ -15,6 +15,7 @@
 from .i18n import _
 from . import (
     error,
+    scmutil,
     util,
 )
 
@@ -132,15 +133,26 @@
         if pid == 0:
             signal.signal(signal.SIGINT, oldhandler)
             signal.signal(signal.SIGCHLD, oldchldhandler)
-            try:
+
+            def workerfunc():
                 os.close(rfd)
                 for i, item in func(*(staticargs + (pargs,))):
                     os.write(wfd, '%d %s\n' % (i, item))
-                os._exit(0)
+
+            # make sure we use os._exit in all code paths. otherwise the worker
+            # may do some clean-ups which could cause surprises like deadlock.
+            # see sshpeer.cleanup for example.
+            try:
+                scmutil.callcatch(ui, workerfunc)
             except KeyboardInterrupt:
                 os._exit(255)
-                # other exceptions are allowed to propagate, we rely
-                # on lock.py's pid checks to avoid release callbacks
+            except: # never return, therefore no re-raises
+                try:
+                    ui.traceback()
+                finally:
+                    os._exit(255)
+            else:
+                os._exit(0)
         pids.add(pid)
     os.close(wfd)
     fp = os.fdopen(rfd, 'rb', 0)