[tao-bugs] Stale connections with BiDirGIOP

Sat Oct 17 20:11:55 CDT 2015

OK, answering to my own question to some extent...

I narrowed the problem to Transport_Cache_Manager_T, and its use of 
Cache_ExtId::index_.

First, how I see it working:
============================

Transport_Cache_Manager_T uses ACE_Hash_Map_Manager to keep a mapping 
between Cache_ExtId and Cache_IntId. In case of IIOP (in my case it 
really was SSLIOP, but I doubt there is a difference there), Cache_ExtId 
represents IP-ADDR:PORT/index triple for a connection. IP-ADDR:PORT is 
the address that Transport connects to, and index is used to allow 
multiple connections to same ip/port address. All three values (address, 
port, index) are used to calculate hash when stored to ACE_Hash_Map_Manager

When a new Transport is created, it is registered with cache manager, 
and it would create an entry using ip:port:index(0). When another 
transport is needed again, Transport_Cache_Manager_T::find_i looks up 
for an existing connection, and uses it if it is found and idle.

The problem:
============
Transport_Cache_Manager_T::find_i assumes that indexes of existing 
connections are all consecutive numbers starting with 0. It will try to 
lookup Transport with index=1 *only* if index=0 entry for the same 
IP:port exists, and if it is busy. If IP:port:index=0 entry is 
previously purged from the cache, Transport_Cache_Manager_T::find_i will 
never try to use index=1 (or any other index in the cache).

This scenario is exactly what happens with BiDirGIOP when client 
disappears from the network, and later reconnects( and re-registers 
callback with same IP:PORT) value:
- server caches first callback with IP:addr:index=0
- client reconnects/re-registers
- server caches the second callback with IP:addr:index=1
- eventually, server cleans up cache entry with IP:addr:index=0
- but it is never able to access the entry with IP:addr:index=1

I am not too sure on the impact on regular TAO clients, since I didnt 
try it, but I would assume that:
- if index=0 entry is busy, second transport is created
- if index=0 entry's transport is closed, index=1
   entry is purged from cache, and index=1 entry is no
   longer reachable, until index=0 entry for the same IP:PORT is created.

Potential solutions:
====================
- I could fix Transport_Cache_Manager_T::unbind_i so it made sure
   that the assumption made in find_i is true: If cache has M elements,
   when removing an entry at index=N (where N is in [0,M), all remaining
   entries for same IP:addr should have consecutive indexes
   in range [0,M-1).
- Alternatively, Transport_Cache_Manager_T can be rewritten
   to actually use multi-hashmap. The existing implementation with
   hash-map and indexes seems inappropriate and sub-optimal.
   Or there is a good reason not to use multi-hash-map, that I am not
   aware of...
   It seems that this would touch more files in TAO though.

I would like to contribute this patch. I would appreciate if someone 
could advise me, which direction should I take.

Thanks, Milan.

Milan Cvetkovic wrote:
>      TAO VERSION: 2.2.1
>      ACE VERSION: 6.2.1
>
>      HOST MACHINE and OPERATING SYSTEM: Debian wheezy on x86_64
>
>      THE $ACE_ROOT/ace/config.h FILE: config-linux.h
>
>      THE $ACE_ROOT/include/makeinclude/platform_macros.GNU FILE:
> c++11 = 1
> ssl = 1
> include ${ACE_ROOT}/include/makeinclude/platform_linux.GNU
>
>      AREA/CLASS/EXAMPLE AFFECTED:
>      BiDirGIOP / Transport_Cache_Manager_T / SSLIOP
>      DOES THE PROBLEM AFFECT:
>          EXECUTION: YES
>
>     SYNOPSIS: After loss of network connection from a client, server is
> no longer able to invoke callback RPCs, even after client reconnected,
> and resubmitted its callback IOR.
>
> DESCRIPTION:
>
> I have BiDirGIOP setup over SSLIOP. Client is behind firewall router on
> 192.168.12.x network. Client incarnates callback object, listening on
> 192.168.12.113:7770 and port 7771 for ss. Client contacts the server
> over the internet, and it sends the IOR to callback object above. Server
> later uses callback object to send various notifications. This setup
> utilizes bidirectional GIOP, over SSLIOP.
>
> Everything works as desired, until client loses connectivity to server.
> When client re-registers, server adds the new Transport to Transport
> cache manager, however in some scenarios it does not remove the old
> transport, and keeps using it for callbacks, failing on CORBA::TIMEOUT
>
> My understanding is that Transport_Cache_Manager keeps the hash map
> table of all connections. These connections have the same key, being
> issued from the same IP:port every time (in the example above,
> 192.168.12.113:7771). In some cases, the server does not replace the
> existing transport entry, but adds it with an increased index, and keeps
> using index:0 for making callbacks.
>
> I am attaching the portions of TAO logs. Note that second registration
> binds with index :1. The stale transport is kept with index :0.
>
> How do I control the content of Transport_Cache_Manager_T. I removed the
> references to callback objects from server, however the transport is
> still cached.
>
> Thanks, Milan.