[ace-bugs] infinite loop in TAO_Connector::connect()

Mon Sep 14 09:14:01 CDT 2015

Hi Eric,

Thank you for the PRF.

First, I assume you are using either the round trip timeout policy or the TAO-specific connection timeout. I agree that a while (true) loop seems risky, I suspect you've uncovered a deeper issue. How many threads are in the ORB thread pool?
Which thread handles the sigquit? what all does the signal handler do?

Does the stack you shared go deeper? Like maybe there is another connect attempt pending?

I've got some ideas but I'd like to see your responses before going further.

Best regards,
Phil

> On Sep 14, 2015, at 6:10 AM, Erik Cumps <erik.cumps at esaturnus.com> wrote:
> 
> Hello,
> 
> I know this issue is hard to reproduce, but the failure sequence is worth investigating,
> an applciation exit or crash is preferable to an infinite loop.
> 
> Any insights or comments anyone?
> 
> Thanks,
> Erik Cumps
> 
> On vr, 2015-08-07 at 14:59 +0200, Erik Cumps wrote:
>>     TAO VERSION: 2.3.0
>>     ACE VERSION: 6.3.0
>> 
>>     HOST MACHINE and OPERATING SYSTEM:
>>         32-bit i686, Linux 3.2.35, debian wheezy
>> 
>>     TARGET MACHINE and OPERATING SYSTEM, if different from HOST:
>>         same as HOST
>> 
>>     COMPILER NAME AND VERSION (AND PATCHLEVEL):
>>         gcc (Debian 4.7.2-5) 4.7.2
>> 
>>     THE $ACE_ROOT/ace/config.h FILE:
>>         // $Id$
>>         
>>         #ifndef ACE_CONFIG_H_INCLUDED
>>         #define ACE_CONFIG_H_INCLUDED
>>         #ifdef __FreeBSD_kernel__
>>         #include "config-kfreebsd.h"
>>         #elif defined(__GNU__)
>>         #include "config-hurd.h"
>>         #else // assume linux
>>         /*
>>          * Macros that were enabled in Debian are stored here.
>>          *
>>          * Rationale: those were captured in the generated libraries on
>>          * compilation; hence the same values must be used when
>> including
>>          * ACE+TAO headers, to avoid unexpected results.
>>          */
>>         
>>         #if defined(ACE_HAS_IPV6)
>>         #undef ACE_HAS_IPV6
>>         #endif
>>         
>>         #ifndef ACE_USES_IPV4_IPV6_MIGRATION
>>         #define ACE_USES_IPV4_IPV6_MIGRATION 0
>>         #endif
>>         
>>         #ifndef __ACE_INLINE__
>>         #define __ACE_INLINE__
>>         #endif
>>         
>>         #include "config-linux.h"
>>         #endif // __FreeBSD_version
>>         #endif /* ACE_CONFIG_H_INCLUDED */
>> 
>>     THE $ACE_ROOT/include/makeinclude/platform_macros.GNU FILE:
>>         # $Id$
>>         
>>         debug          = 1
>>         optimize       = 1
>>         inline         = 1
>>         
>>         ssl            = 1
>>         
>>         xt             = 1
>>         tk             = 1
>>         fl             = 1
>>         fox            = 1
>>         qt4            = 1
>>         ace_qt4reactor = 1
>>         
>>         bzip2          = 1
>>         lzo1           = 1
>>         zlib           = 1
>>         
>>         # Work-around #593225
>>         ARMEL_TARGET := $(shell echo '__ARMEL__' | $(CC) -E - | tail -n
>> 1)
>>         ifeq ($(ARMEL_TARGET),1)
>>           no_hidden_visibility = 1
>>         endif
>>         
>>         include $(ACE_ROOT)/include/makeinclude/platform_linux.GNU
>>         
>>         PLATFORM_FOX_CPPFLAGS=-I/usr/include/fox-1.6
>>         PLATFORM_FOX_LIBS=-lFOX-1.6
>> 
>>     CONTENTS OF
>> $ACE_ROOT/bin/MakeProjectCreator/config/default.features:
>>         // Misc
>>         acexml          = 1
>>         ace_svcconf     = 1
>>         ace_token       = 1
>>         ssl             = 1
>>         ipv6            = 0
>>         exceptions      = 1
>>         
>>         // GUI reactors
>>         xt              = 1
>>         ace_xtreactor   = 1
>>         tao_xtresource  = 1
>>         
>>         tk              = 1
>>         ace_tkreactor   = 1
>>         tao_tkresource  = 1
>>         
>>         fl              = 1
>>         ace_flreactor   = 1
>>         tao_flresource  = 1
>>         
>>         qt              = 1
>>         qt4             = 1
>>         ace_qtreactor   = 1
>>         tao_qtresource  = 1
>>         
>>         fox             = 1
>>         ace_foxreactor  = 1
>>         tao_foxresource = 1
>>         
>>         // ZIOP
>>         zlib          = 1
>>         zzip          = 1
>>         bzip2         = 1
>>         lzo1          = 1
>> 
>> 
>>     AREA/CLASS/EXAMPLE AFFECTED:
>>         Transport handling. (Transport_Connector.cpp)
>> 
>>     DOES THE PROBLEM AFFECT:
>>         EXECUTION?
>> 
>>     SYNOPSIS:
>> A process fails to complete its shutdown because the
>> TAO_Connector::connect()
>> method is stuck in an infinite loop.
>> 
>>     DESCRIPTION:
>> The system is under heavy load. While the process is stopping its
>> servants
>> and is shutting down the ORB, and because of scheduling delays
>> introduced by
>> the heavy load, it tries to perform a remote object invocation, which
>> requires
>> the setup of a new Transport connection.
>> 
>> This is handled by the TAO_Connector::connect() method, which states:
>>    // Stay in this loop until we find:
>>    // a usable connection, or a timeout happens
>> 
>> In this particular case the tcm.find_transport() call returns:
>> TAO::Transport_Cache_Manager::CACHE_FOUND_CONNECTING.
>> 
>> Which means the following code is executed:
>>         else if (found ==
>> TAO::Transport_Cache_Manager::CACHE_FOUND_CONNECTING)
>>           {
>>             if (r->blocked_connect ())
>>               {
>>                 ...
>>                 // If wait_for_transport returns no errors, the
>> base_transport
>>                 // points to the connection we wait for.
>>                 if (this->wait_for_transport (r, base_transport,
>> timeout, false))
>>                   {
>>                     // be sure this transport is registered with the
>> reactor
>>                     // before using it.
>>                     if (!base_transport->register_if_necessary ())
>>                       {
>>                           base_transport->remove_reference ();
>>                           return 0;
>>                       }
>>                   }
>>         
>>                 ...
>>         // In either success or failure cases of wait_for_transport, the
>>                 // ref counter in corresponding to the ref counter added
>> by
>>                 // find_transport is decremented.
>>                 base_transport->remove_reference ();
>>               }
>>             else
>>               {
>>                 ...
>>                 // return the transport in it's current, unconnected
>> state
>>                 return base_transport;
>>               }
>>           }
>> 
>> The only way out of the loop in this particular state is if:
>> * r->blocked_connect() returns false
>> * wait_for_transport() returns true and the base transport fails to
>> register
>> * tcm.find_transport() returns a different result than
>> CACHE_FOUND_CONNECTING
>> 
>> In this particular case neither of these conditions are true and the
>> loop is
>> therefore not exited. Instead the code keeps invoking
>> wait_for_transport(),
>> which incidentally tries to send a notification event to the reactor (so
>> that
>> it can stop) and these notification events pile up in a queue because
>> the
>> reactor cannot consume them (it is blocked waiting for the remote object
>> invocation to complete and that itself is blocked waiting for a
>> transport
>> connection).
>> 
>> To give some further indication of the state of the code, here is a
>> (elided
>> and simplified) stacktrace, obtained after terminating the process with
>> a
>> SIGQUIT signal:
>> 
>> The first part of the stacktrace contains the part where the code tries
>> to
>> notify the reactor that it should stop. As you can see it is pushing the
>> notification events onto the queue. At this point, the queue contained
>> already 157297 notifications:
>> (gdb) print *this
>> $3 = {<ACE_Copy_Disabled> = {<No data fields>}, alloc_queue_ = {head_ =
>> 0x8f7feb0, cur_size_ = 157297, allocator_ = 
>> The end_reactor_event_loop() is being called because the has_shutdown()
>> method of the orb_core_ is true.
>> 
>> #11 0xb567aa4c in ACE_Notification_Queue::allocate_more_buffers
>> #12 0xb567afa8 in ACE_Notification_Queue::push_new_notification
>> #13 0xb569b534 in ACE_Select_Reactor_Notify::notify
>> #14 0xb563b491 in ACE_Select_Reactor_T<ACE_Reactor_Token_T<ACE_Token>
>> >::notify
>> #15 0xb563b538 in ACE_Select_Reactor_T<ACE_Reactor_Token_T<ACE_Token>
>> >::wakeup_all_threads
>> #16 0xb563e5d6 in ACE_Select_Reactor_T<ACE_Reactor_Token_T<ACE_Token>
>> >::deactivate
>> #17 0xb579dfe2 in end_reactor_event_loop
>> #18 TAO_Leader_Follower::reset_client_thread
>> #19 0xb579e530 in ~TAO_LF_Client_Thread_Helper
>> 
>> The next part contains the TAO_Connector::connect() invocation. From the
>> size
>> of the notification queue we can determine that it has already spent a
>> lot of
>> time in the loop (at least long enough for more than 150000
>> notifications)
>> 
>> #20 TAO_Leader_Follower::wait_for_event
>> #21 0xb57a04ad in TAO_LF_Connect_Strategy::wait_i
>> #22 0xb576983d in TAO_Connect_Strategy::wait
>> #23 0xb57f4f12 in wait_for_transport
>> #24 TAO_Connector::wait_for_transport
>> #25 0xb57f69ab in TAO_Connector::connect
>> 
>> The final part shows that the TAO_Connector::connect() is invoked
>> because the
>> process tries to perform a remote object invocation:
>> 
>> #26 0xb57ccb13 in TAO::Profile_Transport_Resolver::try_connect_i
>> #27 0xb57cccc3 in TAO::Profile_Transport_Resolver::try_connect
>> #28 0xb5799ca8 in TAO_Default_Endpoint_Selector::select_endpoint
>> #29 0xb57cc89c in TAO::Profile_Transport_Resolver::resolve
>> #30 0xb579821c in TAO::Invocation_Adapter::invoke_remote_i
>> #31 0xb5798cc0 in TAO::Invocation_Adapter::invoke_i
>> #32 0xb5798076 in TAO::Invocation_Adapter::invoke
>> #33 0xb57cf2fb in TAO::Remote_Object_Proxy_Broker::_is_a
>> #34 0xb57aa845 in CORBA::Object::_is_a
>> #35 0xb745ca2c in narrow
>> #36 MyApp::Dispatcher::_narrow
>> #37 0x08089b26 in downcast_objref<MyApp::Dispatcher>
>> #38 0x08089ce3 in lookup_initref<MyApp::Dispatcher>
>> #39 0x08088f9f in _get_service
>> #40 MyObject::doMyObjectAction_Unsafe
>> #41 0x08089780 in MyObject::doMyObjectAction
>> 
>>     REPEAT BY:
>> This bug is hard to induce.
>> 
>>     SAMPLE FIX/WORKAROUND:
>> Would it make sense for TAO_Connector::connect() to verify the time it
>> spends
>> waiting for the connection and exit the loop if it detects the timeout?
>> 
>> 
>> 
> 
> _______________________________________________
> ace-bugs mailing list
> ace-bugs at list.isis.vanderbilt.edu
> http://list.isis.vanderbilt.edu/cgi-bin/mailman/listinfo/ace-bugs

--
Phil Mesnier
Principal Software Engineer and Partner,   http://www.ociweb.com
Object Computing, Inc.                     +01.314.579.0066 x225