<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN">

<HTML>

<HEAD>

  <META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=UTF-8">

  <META NAME="GENERATOR" CONTENT="GtkHTML/4.4.4">

</HEAD>

<BODY>

Hello,<BR>

<BR>

I know this issue is hard to reproduce, but the failure sequence is worth investigating,<BR>

an applciation exit or crash is preferable to an infinite loop.<BR>

<BR>

Any insights or comments anyone?<BR>

<BR>

Thanks,<BR>

Erik Cumps<BR>

<BR>

On vr, 2015-08-07 at 14:59 +0200, Erik Cumps wrote: 

<BLOCKQUOTE TYPE=CITE>

<PRE>

    TAO VERSION: 2.3.0

    ACE VERSION: 6.3.0

    HOST MACHINE and OPERATING SYSTEM:

        32-bit i686, Linux 3.2.35, debian wheezy

    TARGET MACHINE and OPERATING SYSTEM, if different from HOST:

        same as HOST

    COMPILER NAME AND VERSION (AND PATCHLEVEL):

        gcc (Debian 4.7.2-5) 4.7.2

    THE $ACE_ROOT/ace/config.h FILE:

        // $Id$

        #ifndef ACE_CONFIG_H_INCLUDED

        #define ACE_CONFIG_H_INCLUDED

        #ifdef __FreeBSD_kernel__

        #include "config-kfreebsd.h"

        #elif defined(__GNU__)

        #include "config-hurd.h"

        #else // assume linux

        /*

         * Macros that were enabled in Debian are stored here.

         *

         * Rationale: those were captured in the generated libraries on

         * compilation; hence the same values must be used when

including

         * ACE+TAO headers, to avoid unexpected results.

         */

        #if defined(ACE_HAS_IPV6)

        #undef ACE_HAS_IPV6

        #endif

        #ifndef ACE_USES_IPV4_IPV6_MIGRATION

        #define ACE_USES_IPV4_IPV6_MIGRATION 0

        #endif

        #ifndef __ACE_INLINE__

        #define __ACE_INLINE__

        #endif

        #include "config-linux.h"

        #endif // __FreeBSD_version

        #endif /* ACE_CONFIG_H_INCLUDED */

    THE $ACE_ROOT/include/makeinclude/platform_macros.GNU FILE:

        # $Id$

        debug          = 1

        optimize       = 1

        inline         = 1

        ssl            = 1

        xt             = 1

        tk             = 1

        fl             = 1

        fox            = 1

        qt4            = 1

        ace_qt4reactor = 1

        bzip2          = 1

        lzo1           = 1

        zlib           = 1

        # Work-around #593225

        ARMEL_TARGET := $(shell echo '__ARMEL__' | $(CC) -E - | tail -n

1)

        ifeq ($(ARMEL_TARGET),1)

          no_hidden_visibility = 1

        endif

        include $(ACE_ROOT)/include/makeinclude/platform_linux.GNU

        PLATFORM_FOX_CPPFLAGS=-I/usr/include/fox-1.6

        PLATFORM_FOX_LIBS=-lFOX-1.6

    CONTENTS OF

$ACE_ROOT/bin/MakeProjectCreator/config/default.features:

        // Misc

        acexml          = 1

        ace_svcconf     = 1

        ace_token       = 1

        ssl             = 1

        ipv6            = 0

        exceptions      = 1

        // GUI reactors

        xt              = 1

        ace_xtreactor   = 1

        tao_xtresource  = 1

        tk              = 1

        ace_tkreactor   = 1

        tao_tkresource  = 1

        fl              = 1

        ace_flreactor   = 1

        tao_flresource  = 1

        qt              = 1

        qt4             = 1

        ace_qtreactor   = 1

        tao_qtresource  = 1

        fox             = 1

        ace_foxreactor  = 1

        tao_foxresource = 1

        // ZIOP

        zlib          = 1

        zzip          = 1

        bzip2         = 1

        lzo1          = 1

    AREA/CLASS/EXAMPLE AFFECTED:

        Transport handling. (Transport_Connector.cpp)

    DOES THE PROBLEM AFFECT:

        EXECUTION?

    SYNOPSIS:

A process fails to complete its shutdown because the

TAO_Connector::connect()

method is stuck in an infinite loop.

    DESCRIPTION:

The system is under heavy load. While the process is stopping its

servants

and is shutting down the ORB, and because of scheduling delays

introduced by

the heavy load, it tries to perform a remote object invocation, which

requires

the setup of a new Transport connection.

This is handled by the TAO_Connector::connect() method, which states:

   // Stay in this loop until we find:

   // a usable connection, or a timeout happens

In this particular case the tcm.find_transport() call returns:

TAO::Transport_Cache_Manager::CACHE_FOUND_CONNECTING.

Which means the following code is executed:

        else if (found ==

TAO::Transport_Cache_Manager::CACHE_FOUND_CONNECTING)

          {

            if (r->blocked_connect ())

              {

                ...

                // If wait_for_transport returns no errors, the

base_transport

                // points to the connection we wait for.

                if (this->wait_for_transport (r, base_transport,

timeout, false))

                  {

                    // be sure this transport is registered with the

reactor

                    // before using it.

                    if (!base_transport->register_if_necessary ())

                      {

                          base_transport->remove_reference ();

                          return 0;

                      }

                  }

                ...

        // In either success or failure cases of wait_for_transport, the

                // ref counter in corresponding to the ref counter added

by

                // find_transport is decremented.

                base_transport->remove_reference ();

              }

            else

              {

                ...

                // return the transport in it's current, unconnected

state

                return base_transport;

              }

          }

The only way out of the loop in this particular state is if:

* r->blocked_connect() returns false

* wait_for_transport() returns true and the base transport fails to

register

* tcm.find_transport() returns a different result than

CACHE_FOUND_CONNECTING

In this particular case neither of these conditions are true and the

loop is

therefore not exited. Instead the code keeps invoking

wait_for_transport(),

which incidentally tries to send a notification event to the reactor (so

that

it can stop) and these notification events pile up in a queue because

the

reactor cannot consume them (it is blocked waiting for the remote object

invocation to complete and that itself is blocked waiting for a

transport

connection).

To give some further indication of the state of the code, here is a

(elided

and simplified) stacktrace, obtained after terminating the process with

a

SIGQUIT signal:

The first part of the stacktrace contains the part where the code tries

to

notify the reactor that it should stop. As you can see it is pushing the

notification events onto the queue. At this point, the queue contained

already 157297 notifications:

(gdb) print *this

$3 = {<ACE_Copy_Disabled> = {<No data fields>}, alloc_queue_ = {head_ =

0x8f7feb0, cur_size_ = 157297, allocator_ = 

The end_reactor_event_loop() is being called because the has_shutdown()

method of the orb_core_ is true.

#11 0xb567aa4c in ACE_Notification_Queue::allocate_more_buffers

#12 0xb567afa8 in ACE_Notification_Queue::push_new_notification

#13 0xb569b534 in ACE_Select_Reactor_Notify::notify

#14 0xb563b491 in ACE_Select_Reactor_T<ACE_Reactor_Token_T<ACE_Token>

>::notify

#15 0xb563b538 in ACE_Select_Reactor_T<ACE_Reactor_Token_T<ACE_Token>

>::wakeup_all_threads

#16 0xb563e5d6 in ACE_Select_Reactor_T<ACE_Reactor_Token_T<ACE_Token>

>::deactivate

#17 0xb579dfe2 in end_reactor_event_loop

#18 TAO_Leader_Follower::reset_client_thread

#19 0xb579e530 in ~TAO_LF_Client_Thread_Helper

The next part contains the TAO_Connector::connect() invocation. From the

size

of the notification queue we can determine that it has already spent a

lot of

time in the loop (at least long enough for more than 150000

notifications)

#20 TAO_Leader_Follower::wait_for_event

#21 0xb57a04ad in TAO_LF_Connect_Strategy::wait_i

#22 0xb576983d in TAO_Connect_Strategy::wait

#23 0xb57f4f12 in wait_for_transport

#24 TAO_Connector::wait_for_transport

#25 0xb57f69ab in TAO_Connector::connect

The final part shows that the TAO_Connector::connect() is invoked

because the

process tries to perform a remote object invocation:

#26 0xb57ccb13 in TAO::Profile_Transport_Resolver::try_connect_i

#27 0xb57cccc3 in TAO::Profile_Transport_Resolver::try_connect

#28 0xb5799ca8 in TAO_Default_Endpoint_Selector::select_endpoint

#29 0xb57cc89c in TAO::Profile_Transport_Resolver::resolve

#30 0xb579821c in TAO::Invocation_Adapter::invoke_remote_i

#31 0xb5798cc0 in TAO::Invocation_Adapter::invoke_i

#32 0xb5798076 in TAO::Invocation_Adapter::invoke

#33 0xb57cf2fb in TAO::Remote_Object_Proxy_Broker::_is_a

#34 0xb57aa845 in CORBA::Object::_is_a

#35 0xb745ca2c in narrow

#36 MyApp::Dispatcher::_narrow

#37 0x08089b26 in downcast_objref<MyApp::Dispatcher>

#38 0x08089ce3 in lookup_initref<MyApp::Dispatcher>

#39 0x08088f9f in _get_service

#40 MyObject::doMyObjectAction_Unsafe

#41 0x08089780 in MyObject::doMyObjectAction

    REPEAT BY:

This bug is hard to induce.

    SAMPLE FIX/WORKAROUND:

Would it make sense for TAO_Connector::connect() to verify the time it

spends

waiting for the connection and exit the loop if it detects the timeout?

</PRE>

</BLOCKQUOTE>

<BR>

</BODY>

</HTML>