<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN">
<HTML>
<HEAD>
  <META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=UTF-8">
  <META NAME="GENERATOR" CONTENT="GtkHTML/4.4.4">
</HEAD>
<BODY>
Hello,<BR>
<BR>
I know this issue is hard to reproduce, but the failure sequence is worth investigating,<BR>
an applciation exit or crash is preferable to an infinite loop.<BR>
<BR>
Any insights or comments anyone?<BR>
<BR>
Thanks,<BR>
Erik Cumps<BR>
<BR>
On vr, 2015-08-07 at 14:59 +0200, Erik Cumps wrote: 
<BLOCKQUOTE TYPE=CITE>
<PRE>
    TAO VERSION: 2.3.0
    ACE VERSION: 6.3.0

    HOST MACHINE and OPERATING SYSTEM:
        32-bit i686, Linux 3.2.35, debian wheezy

    TARGET MACHINE and OPERATING SYSTEM, if different from HOST:
        same as HOST

    COMPILER NAME AND VERSION (AND PATCHLEVEL):
        gcc (Debian 4.7.2-5) 4.7.2

    THE $ACE_ROOT/ace/config.h FILE:
        // $Id$
        
        #ifndef ACE_CONFIG_H_INCLUDED
        #define ACE_CONFIG_H_INCLUDED
        #ifdef __FreeBSD_kernel__
        #include "config-kfreebsd.h"
        #elif defined(__GNU__)
        #include "config-hurd.h"
        #else // assume linux
        /*
         * Macros that were enabled in Debian are stored here.
         *
         * Rationale: those were captured in the generated libraries on
         * compilation; hence the same values must be used when
including
         * ACE+TAO headers, to avoid unexpected results.
         */
        
        #if defined(ACE_HAS_IPV6)
        #undef ACE_HAS_IPV6
        #endif
        
        #ifndef ACE_USES_IPV4_IPV6_MIGRATION
        #define ACE_USES_IPV4_IPV6_MIGRATION 0
        #endif
        
        #ifndef __ACE_INLINE__
        #define __ACE_INLINE__
        #endif
        
        #include "config-linux.h"
        #endif // __FreeBSD_version
        #endif /* ACE_CONFIG_H_INCLUDED */

    THE $ACE_ROOT/include/makeinclude/platform_macros.GNU FILE:
        # $Id$
        
        debug          = 1
        optimize       = 1
        inline         = 1
        
        ssl            = 1
        
        xt             = 1
        tk             = 1
        fl             = 1
        fox            = 1
        qt4            = 1
        ace_qt4reactor = 1
        
        bzip2          = 1
        lzo1           = 1
        zlib           = 1
        
        # Work-around #593225
        ARMEL_TARGET := $(shell echo '__ARMEL__' | $(CC) -E - | tail -n
1)
        ifeq ($(ARMEL_TARGET),1)
          no_hidden_visibility = 1
        endif
        
        include $(ACE_ROOT)/include/makeinclude/platform_linux.GNU
        
        PLATFORM_FOX_CPPFLAGS=-I/usr/include/fox-1.6
        PLATFORM_FOX_LIBS=-lFOX-1.6

    CONTENTS OF
$ACE_ROOT/bin/MakeProjectCreator/config/default.features:
        // Misc
        acexml          = 1
        ace_svcconf     = 1
        ace_token       = 1
        ssl             = 1
        ipv6            = 0
        exceptions      = 1
        
        // GUI reactors
        xt              = 1
        ace_xtreactor   = 1
        tao_xtresource  = 1
        
        tk              = 1
        ace_tkreactor   = 1
        tao_tkresource  = 1
        
        fl              = 1
        ace_flreactor   = 1
        tao_flresource  = 1
        
        qt              = 1
        qt4             = 1
        ace_qtreactor   = 1
        tao_qtresource  = 1
        
        fox             = 1
        ace_foxreactor  = 1
        tao_foxresource = 1
        
        // ZIOP
        zlib          = 1
        zzip          = 1
        bzip2         = 1
        lzo1          = 1


    AREA/CLASS/EXAMPLE AFFECTED:
        Transport handling. (Transport_Connector.cpp)

    DOES THE PROBLEM AFFECT:
        EXECUTION?

    SYNOPSIS:
A process fails to complete its shutdown because the
TAO_Connector::connect()
method is stuck in an infinite loop.

    DESCRIPTION:
The system is under heavy load. While the process is stopping its
servants
and is shutting down the ORB, and because of scheduling delays
introduced by
the heavy load, it tries to perform a remote object invocation, which
requires
the setup of a new Transport connection.

This is handled by the TAO_Connector::connect() method, which states:
   // Stay in this loop until we find:
   // a usable connection, or a timeout happens

In this particular case the tcm.find_transport() call returns:
TAO::Transport_Cache_Manager::CACHE_FOUND_CONNECTING.

Which means the following code is executed:
        else if (found ==
TAO::Transport_Cache_Manager::CACHE_FOUND_CONNECTING)
          {
            if (r->blocked_connect ())
              {
                ...
                // If wait_for_transport returns no errors, the
base_transport
                // points to the connection we wait for.
                if (this->wait_for_transport (r, base_transport,
timeout, false))
                  {
                    // be sure this transport is registered with the
reactor
                    // before using it.
                    if (!base_transport->register_if_necessary ())
                      {
                          base_transport->remove_reference ();
                          return 0;
                      }
                  }
        
                ...
        // In either success or failure cases of wait_for_transport, the
                // ref counter in corresponding to the ref counter added
by
                // find_transport is decremented.
                base_transport->remove_reference ();
              }
            else
              {
                ...
                // return the transport in it's current, unconnected
state
                return base_transport;
              }
          }

The only way out of the loop in this particular state is if:
* r->blocked_connect() returns false
* wait_for_transport() returns true and the base transport fails to
register
* tcm.find_transport() returns a different result than
CACHE_FOUND_CONNECTING

In this particular case neither of these conditions are true and the
loop is
therefore not exited. Instead the code keeps invoking
wait_for_transport(),
which incidentally tries to send a notification event to the reactor (so
that
it can stop) and these notification events pile up in a queue because
the
reactor cannot consume them (it is blocked waiting for the remote object
invocation to complete and that itself is blocked waiting for a
transport
connection).

To give some further indication of the state of the code, here is a
(elided
and simplified) stacktrace, obtained after terminating the process with
a
SIGQUIT signal:

The first part of the stacktrace contains the part where the code tries
to
notify the reactor that it should stop. As you can see it is pushing the
notification events onto the queue. At this point, the queue contained
already 157297 notifications:
(gdb) print *this
$3 = {<ACE_Copy_Disabled> = {<No data fields>}, alloc_queue_ = {head_ =
0x8f7feb0, cur_size_ = 157297, allocator_ = 
The end_reactor_event_loop() is being called because the has_shutdown()
method of the orb_core_ is true.

#11 0xb567aa4c in ACE_Notification_Queue::allocate_more_buffers
#12 0xb567afa8 in ACE_Notification_Queue::push_new_notification
#13 0xb569b534 in ACE_Select_Reactor_Notify::notify
#14 0xb563b491 in ACE_Select_Reactor_T<ACE_Reactor_Token_T<ACE_Token>
>::notify
#15 0xb563b538 in ACE_Select_Reactor_T<ACE_Reactor_Token_T<ACE_Token>
>::wakeup_all_threads
#16 0xb563e5d6 in ACE_Select_Reactor_T<ACE_Reactor_Token_T<ACE_Token>
>::deactivate
#17 0xb579dfe2 in end_reactor_event_loop
#18 TAO_Leader_Follower::reset_client_thread
#19 0xb579e530 in ~TAO_LF_Client_Thread_Helper

The next part contains the TAO_Connector::connect() invocation. From the
size
of the notification queue we can determine that it has already spent a
lot of
time in the loop (at least long enough for more than 150000
notifications)

#20 TAO_Leader_Follower::wait_for_event
#21 0xb57a04ad in TAO_LF_Connect_Strategy::wait_i
#22 0xb576983d in TAO_Connect_Strategy::wait
#23 0xb57f4f12 in wait_for_transport
#24 TAO_Connector::wait_for_transport
#25 0xb57f69ab in TAO_Connector::connect

The final part shows that the TAO_Connector::connect() is invoked
because the
process tries to perform a remote object invocation:

#26 0xb57ccb13 in TAO::Profile_Transport_Resolver::try_connect_i
#27 0xb57cccc3 in TAO::Profile_Transport_Resolver::try_connect
#28 0xb5799ca8 in TAO_Default_Endpoint_Selector::select_endpoint
#29 0xb57cc89c in TAO::Profile_Transport_Resolver::resolve
#30 0xb579821c in TAO::Invocation_Adapter::invoke_remote_i
#31 0xb5798cc0 in TAO::Invocation_Adapter::invoke_i
#32 0xb5798076 in TAO::Invocation_Adapter::invoke
#33 0xb57cf2fb in TAO::Remote_Object_Proxy_Broker::_is_a
#34 0xb57aa845 in CORBA::Object::_is_a
#35 0xb745ca2c in narrow
#36 MyApp::Dispatcher::_narrow
#37 0x08089b26 in downcast_objref<MyApp::Dispatcher>
#38 0x08089ce3 in lookup_initref<MyApp::Dispatcher>
#39 0x08088f9f in _get_service
#40 MyObject::doMyObjectAction_Unsafe
#41 0x08089780 in MyObject::doMyObjectAction

    REPEAT BY:
This bug is hard to induce.

    SAMPLE FIX/WORKAROUND:
Would it make sense for TAO_Connector::connect() to verify the time it
spends
waiting for the connection and exit the loop if it detects the timeout?


</PRE>
</BLOCKQUOTE>
<BR>
</BODY>
</HTML>