[tao-bugs] Stale connections with BiDirGIOP

Mon Nov 16 12:46:31 CST 2015

How do I get this patch merged to ACE/TAO?

Thanks, Milan.

Milan Cvetkovic wrote:
> I have created Bug 4207 in bugzilla, and submitted pull request #157
> with automated test and bug fix.
>
> Hope it makes it,
>
> Milan.
>
> Milan Cvetkovic wrote:
>> OK, answering to my own question to some extent...
>>
>> I narrowed the problem to Transport_Cache_Manager_T, and its use of
>> Cache_ExtId::index_.
>>
>> First, how I see it working:
>> ============================
>>
>> Transport_Cache_Manager_T uses ACE_Hash_Map_Manager to keep a mapping
>> between Cache_ExtId and Cache_IntId. In case of IIOP (in my case it
>> really was SSLIOP, but I doubt there is a difference there), Cache_ExtId
>> represents IP-ADDR:PORT/index triple for a connection. IP-ADDR:PORT is
>> the address that Transport connects to, and index is used to allow
>> multiple connections to same ip/port address. All three values (address,
>> port, index) are used to calculate hash when stored to
>> ACE_Hash_Map_Manager
>>
>> When a new Transport is created, it is registered with cache manager,
>> and it would create an entry using ip:port:index(0). When another
>> transport is needed again, Transport_Cache_Manager_T::find_i looks up
>> for an existing connection, and uses it if it is found and idle.
>>
>> The problem:
>> ============
>> Transport_Cache_Manager_T::find_i assumes that indexes of existing
>> connections are all consecutive numbers starting with 0. It will try to
>> lookup Transport with index=1 *only* if index=0 entry for the same
>> IP:port exists, and if it is busy. If IP:port:index=0 entry is
>> previously purged from the cache, Transport_Cache_Manager_T::find_i will
>> never try to use index=1 (or any other index in the cache).
>>
>> This scenario is exactly what happens with BiDirGIOP when client
>> disappears from the network, and later reconnects( and re-registers
>> callback with same IP:PORT) value:
>> - server caches first callback with IP:addr:index=0
>> - client reconnects/re-registers
>> - server caches the second callback with IP:addr:index=1
>> - eventually, server cleans up cache entry with IP:addr:index=0
>> - but it is never able to access the entry with IP:addr:index=1
>>
>> I am not too sure on the impact on regular TAO clients, since I didnt
>> try it, but I would assume that:
>> - if index=0 entry is busy, second transport is created
>> - if index=0 entry's transport is closed, index=1
>>    entry is purged from cache, and index=1 entry is no
>>    longer reachable, until index=0 entry for the same IP:PORT is created.
>>
>> Potential solutions:
>> ====================
>> - I could fix Transport_Cache_Manager_T::unbind_i so it made sure
>>    that the assumption made in find_i is true: If cache has M elements,
>>    when removing an entry at index=N (where N is in [0,M), all remaining
>>    entries for same IP:addr should have consecutive indexes
>>    in range [0,M-1).
>> - Alternatively, Transport_Cache_Manager_T can be rewritten
>>    to actually use multi-hashmap. The existing implementation with
>>    hash-map and indexes seems inappropriate and sub-optimal.
>>    Or there is a good reason not to use multi-hash-map, that I am not
>>    aware of...
>>    It seems that this would touch more files in TAO though.
>>
>> I would like to contribute this patch. I would appreciate if someone
>> could advise me, which direction should I take.
>>
>> Thanks, Milan.
>>
>> Milan Cvetkovic wrote:
>>>      TAO VERSION: 2.2.1
>>>      ACE VERSION: 6.2.1
>>>
>>>      HOST MACHINE and OPERATING SYSTEM: Debian wheezy on x86_64
>>>
>>>      THE $ACE_ROOT/ace/config.h FILE: config-linux.h
>>>
>>>      THE $ACE_ROOT/include/makeinclude/platform_macros.GNU FILE:
>>> c++11 = 1
>>> ssl = 1
>>> include ${ACE_ROOT}/include/makeinclude/platform_linux.GNU
>>>
>>>      AREA/CLASS/EXAMPLE AFFECTED:
>>>      BiDirGIOP / Transport_Cache_Manager_T / SSLIOP
>>>      DOES THE PROBLEM AFFECT:
>>>          EXECUTION: YES
>>>
>>>     SYNOPSIS: After loss of network connection from a client, server is
>>> no longer able to invoke callback RPCs, even after client reconnected,
>>> and resubmitted its callback IOR.
>>>
>>> DESCRIPTION:
>>>
>>> I have BiDirGIOP setup over SSLIOP. Client is behind firewall router on
>>> 192.168.12.x network. Client incarnates callback object, listening on
>>> 192.168.12.113:7770 and port 7771 for ss. Client contacts the server
>>> over the internet, and it sends the IOR to callback object above. Server
>>> later uses callback object to send various notifications. This setup
>>> utilizes bidirectional GIOP, over SSLIOP.
>>>
>>> Everything works as desired, until client loses connectivity to server.
>>> When client re-registers, server adds the new Transport to Transport
>>> cache manager, however in some scenarios it does not remove the old
>>> transport, and keeps using it for callbacks, failing on CORBA::TIMEOUT
>>>
>>> My understanding is that Transport_Cache_Manager keeps the hash map
>>> table of all connections. These connections have the same key, being
>>> issued from the same IP:port every time (in the example above,
>>> 192.168.12.113:7771). In some cases, the server does not replace the
>>> existing transport entry, but adds it with an increased index, and keeps
>>> using index:0 for making callbacks.
>>>
>>> I am attaching the portions of TAO logs. Note that second registration
>>> binds with index :1. The stale transport is kept with index :0.
>>>
>>> How do I control the content of Transport_Cache_Manager_T. I removed the
>>> references to callback objects from server, however the transport is
>>> still cached.
>>>
>>> Thanks, Milan.
>>
>