[Ace-users] [ace-bugs] ACE_OS::fork results in undefined behavior

Thu Oct 25 12:41:07 CDT 2007

Hi folks, it's been a while.  I just got caught up on (read: dumped)
130,000+ unread emails that built up over the summer.

    --jtc

    ACE VERSION: 5.6.1 (subversion version 79840)

    HOST MACHINE and OPERATING SYSTEM:

NetBSD orac.acorntoolworks.com 3.0_STABLE NetBSD 3.0_STABLE (ORAC) #0: Tue Feb 21 20:05:51 PST 2006  jtc at orac.acorntoolworks.com:/home/jtc/netbsd/NetBSD-3/src/sys/arch/i386/compile/ORAC i386

    TARGET MACHINE and OPERATING SYSTEM, if different from HOST:

NetBSD orac.acorntoolworks.com 3.0_STABLE NetBSD 3.0_STABLE (ORAC) #0: Tue Feb 21 20:05:51 PST 2006  jtc at orac.acorntoolworks.com:/home/jtc/netbsd/NetBSD-3/src/sys/arch/i386/compile/ORAC i386

    COMPILER NAME AND VERSION (AND PATCHLEVEL):

gcc version 3.3.3 (NetBSD nb3 20040520)

    THE $ACE_ROOT/ace/config.h FILE [if you use a link to a platform-
    specific file, simply state which one]:

config-netbsd.h

    THE $ACE_ROOT/include/makeinclude/platform_macros.GNU FILE [if you
    use a link to a platform-specific file, simply state which one
    (unless this isn't used in this case, e.g., with Microsoft Visual
    C++)]:

platform_netbsd.GNU

    CONTENTS OF $ACE_ROOT/bin/MakeProjectCreator/config/default.features
    (used by MPC when you generate your own makefiles):

N/A

    AREA/CLASS/EXAMPLE AFFECTED:
    [What example failed?  What module failed to compile?]

ACE_Process, ACE_OS::fork(), etc.

    DOES THE PROBLEM AFFECT:

EXECUTION

    SYNOPSIS:

ACE_OS::fork(const ACE_TCHAR*) results in undefined behavior.

    DESCRIPTION:

I recently chased down a race condition in ACE_Process that has eluding me
for quite some time.  To summarize, ACE_Process invokes async-signal-unafe
functions between fork() and exec(), which results in undefined behavior.

The POSIX / X/Open specification for fork() states:

  * A process shall be created with a single thread.  If a multi-threaded  
    process calls fork(), the new process shall contain a replica of the 
    calling thread and its entire address space, possibly including the 
    state of mutexes and other resources.  Consequently, to avoid errors,
    the child process may only execute async-signal-safe operation until 
    such time as one of the exec functions is called.

An example of this is if a thread calls fork() while another thread is 
within malloc(), and thus holds a system mutex protecting the heap.  If
the forked process calls any dynamic memory management function, it may
block on the mutex.  Since no other threads can execute in the child, it
blocks and the parent process waits indefinately for the child to exit.

It could be worse.  Undefined behavior really is undefined.  On NetBSD,
it appears fork()ing disables preemtive context switches, but the thread
engine remains "alive" enough to switch to a suspended thread in when it
blocks on a mutex.  In some cases, this switched to an ORB worker thread
and the child process attempted to handle new CORBA requests intended for 
the parent.  For the longest time, I thought that this was the result of
a bug in our code -- e.g. not wrapping an ACE_Process invocation in a 
mutex -- but ultimately I tracked it down to ACE internals.

ACE violates the no async-signal-unsafe constraint in several places.
I'll address each in separate PRFs, so the tradeoffs for each fix can
be discused in its own thread.

ACE_Process::spawn(...) calls ACE::fork(const ACE_TCHAR* program_name,
int avoid_zombies), which calls ACE_OS::fork(const ACE_TCHAR*
program_name), which calls ACE_Base_Thread_Adapter::sync_log_msg(),
which calls ACE_Log_Msg ::sync_hook() (via the sync_log_msg_hook_
member variable function pointer). So far, this is safe.

However, ACE_Log_Msg::sync_hook() calls ACE_LOG_MSG->ACE_Log_Msg::sync().
ACE_LOG_MSG is a macro that expands to ACE_Log_Msg::instance(), which
gets a pointer to thread specific storage for an ACE_Log_Msg object.  If
unset, it allocates a new instance.  Although we can assume that the TSS
key has already been created (otherwise the sync_hook() method wouldn't 
have been registered with the ACE_Base_Thread_Adapter), getting the TSS 
value and especially dynamic memory allocation is not safe.

Now ACE_Log_Msg::sync() updates program_name_ and  pid_, which are static
member variables.  So it would be possible to make it a static method and
avoid calling ACE_LOG_MSG. This would also require changing program_name_
from a ACE_TCHAR* to a ACE_TCHAR[] to avoid dynamic memory allocation. 
We'd need to decide on a reasonable size for that array.

But the only reason we'd want to set program_name_ and pid_ is for use by
ACE_Log_Msg class between fork() and exec() (since after exec() it won't 
matter), but I see little point as there is no reasonable way to make the 
rest of ACE_Log_Msg async-signal-safe 

In my sources, I've #if'd out the call to ACE_Base_Thread_Adapter::
sync_log_msg() in ACE_OS::fork(const ACE_TCHAR* program_name).
Another possibility would to do so only if ACE_HAS_THREADS.  Are there
any other viable options?  In the long term, I wonder whether the
program_name argument to ACE::fork(), ACE_OS::fork(), etc. should be
deprecated/removed, since there is little that can be done with it.

On the other hand, maybe we could generalize this to a
ACE_OS::atfork() function (similar to pthread_atfork()), so programs
that need to do things in the child process can register handlers to
do so, without the coupling and assumptions that the program name is
the only parameter.  (ACE_Process does this with the parent() and
child() virtual methods, but not everyone uses ACE_Process).

What would be the best choice to get back into the master ACE sources?

    REPEAT BY:

Have your program spawn a subprocess at the exactly right (or wrong,
depending on your point of view) time.

    SAMPLE FIX/WORKAROUND:

-- 
J.T. Conklin