[tao-bugs] Stale connections with BiDirGIOP

Sat Nov 7 12:21:33 CST 2015

I have created Bug 4207 in bugzilla, and submitted pull request #157 
with automated test and bug fix.

Hope it makes it,

Milan.

Milan Cvetkovic wrote:
> OK, answering to my own question to some extent...
>
> I narrowed the problem to Transport_Cache_Manager_T, and its use of
> Cache_ExtId::index_.
>
> First, how I see it working:
> ============================
>
> Transport_Cache_Manager_T uses ACE_Hash_Map_Manager to keep a mapping
> between Cache_ExtId and Cache_IntId. In case of IIOP (in my case it
> really was SSLIOP, but I doubt there is a difference there), Cache_ExtId
> represents IP-ADDR:PORT/index triple for a connection. IP-ADDR:PORT is
> the address that Transport connects to, and index is used to allow
> multiple connections to same ip/port address. All three values (address,
> port, index) are used to calculate hash when stored to ACE_Hash_Map_Manager
>
> When a new Transport is created, it is registered with cache manager,
> and it would create an entry using ip:port:index(0). When another
> transport is needed again, Transport_Cache_Manager_T::find_i looks up
> for an existing connection, and uses it if it is found and idle.
>
> The problem:
> ============
> Transport_Cache_Manager_T::find_i assumes that indexes of existing
> connections are all consecutive numbers starting with 0. It will try to
> lookup Transport with index=1 *only* if index=0 entry for the same
> IP:port exists, and if it is busy. If IP:port:index=0 entry is
> previously purged from the cache, Transport_Cache_Manager_T::find_i will
> never try to use index=1 (or any other index in the cache).
>
> This scenario is exactly what happens with BiDirGIOP when client
> disappears from the network, and later reconnects( and re-registers
> callback with same IP:PORT) value:
> - server caches first callback with IP:addr:index=0
> - client reconnects/re-registers
> - server caches the second callback with IP:addr:index=1
> - eventually, server cleans up cache entry with IP:addr:index=0
> - but it is never able to access the entry with IP:addr:index=1
>
> I am not too sure on the impact on regular TAO clients, since I didnt
> try it, but I would assume that:
> - if index=0 entry is busy, second transport is created
> - if index=0 entry's transport is closed, index=1
>    entry is purged from cache, and index=1 entry is no
>    longer reachable, until index=0 entry for the same IP:PORT is created.
>
> Potential solutions:
> ====================
> - I could fix Transport_Cache_Manager_T::unbind_i so it made sure
>    that the assumption made in find_i is true: If cache has M elements,
>    when removing an entry at index=N (where N is in [0,M), all remaining
>    entries for same IP:addr should have consecutive indexes
>    in range [0,M-1).
> - Alternatively, Transport_Cache_Manager_T can be rewritten
>    to actually use multi-hashmap. The existing implementation with
>    hash-map and indexes seems inappropriate and sub-optimal.
>    Or there is a good reason not to use multi-hash-map, that I am not
>    aware of...
>    It seems that this would touch more files in TAO though.
>
> I would like to contribute this patch. I would appreciate if someone
> could advise me, which direction should I take.
>
> Thanks, Milan.
>
> Milan Cvetkovic wrote:
>>      TAO VERSION: 2.2.1
>>      ACE VERSION: 6.2.1
>>
>>      HOST MACHINE and OPERATING SYSTEM: Debian wheezy on x86_64
>>
>>      THE $ACE_ROOT/ace/config.h FILE: config-linux.h
>>
>>      THE $ACE_ROOT/include/makeinclude/platform_macros.GNU FILE:
>> c++11 = 1
>> ssl = 1
>> include ${ACE_ROOT}/include/makeinclude/platform_linux.GNU
>>
>>      AREA/CLASS/EXAMPLE AFFECTED:
>>      BiDirGIOP / Transport_Cache_Manager_T / SSLIOP
>>      DOES THE PROBLEM AFFECT:
>>          EXECUTION: YES
>>
>>     SYNOPSIS: After loss of network connection from a client, server is
>> no longer able to invoke callback RPCs, even after client reconnected,
>> and resubmitted its callback IOR.
>>
>> DESCRIPTION:
>>
>> I have BiDirGIOP setup over SSLIOP. Client is behind firewall router on
>> 192.168.12.x network. Client incarnates callback object, listening on
>> 192.168.12.113:7770 and port 7771 for ss. Client contacts the server
>> over the internet, and it sends the IOR to callback object above. Server
>> later uses callback object to send various notifications. This setup
>> utilizes bidirectional GIOP, over SSLIOP.
>>
>> Everything works as desired, until client loses connectivity to server.
>> When client re-registers, server adds the new Transport to Transport
>> cache manager, however in some scenarios it does not remove the old
>> transport, and keeps using it for callbacks, failing on CORBA::TIMEOUT
>>
>> My understanding is that Transport_Cache_Manager keeps the hash map
>> table of all connections. These connections have the same key, being
>> issued from the same IP:port every time (in the example above,
>> 192.168.12.113:7771). In some cases, the server does not replace the
>> existing transport entry, but adds it with an increased index, and keeps
>> using index:0 for making callbacks.
>>
>> I am attaching the portions of TAO logs. Note that second registration
>> binds with index :1. The stale transport is kept with index :0.
>>
>> How do I control the content of Transport_Cache_Manager_T. I removed the
>> references to callback objects from server, however the transport is
>> still cached.
>>
>> Thanks, Milan.
>