[tao-users] Notification Service sudden memory leak

Wed May 17 08:11:33 CDT 2017

     TAO VERSION: 2.1.6
     ACE VERSION: 6.1.6

     HOST MACHINE and OPERATING SYSTEM:
RHEL 6.3, using standard RPM

     DOES THE PROBLEM AFFECT:
         COMPILATION?
no
         LINKING?
no
         EXECUTION?
yes

     SYNOPSIS:
At some point in time, the TAO notification service starts to leak massive 
amounts of memory. Later on, it even crashes (probably due to "out of 
memory").

     DESCRIPTION:
We use the tao-cosnotification service in an environment with mixed C++ 
(TAO) and Java applications. Most event channels have only one supplier 
and one or more consumers, some channels have several suppliers (either 
from C++ or Java apps).

After some time, we noticed that the memory usage (shown with top, %MEM) 
starts to increase linearly, even though the load (number of events) is 
unchanged. Our tests took between 12 and 36 hours to cause the problem to 
occur. Several channels are involved and we have no idea yet what the 
trigger is.

I started tao-cosnotification with -ORBDebugLevel 4 to gain some insights 
and it seems that there is an object with increasing refcount:

object:8a650e40 decr refcount = 4
object:8a650e40 incr refcount = 5
object:8a650e40 incr refcount = 6
object:8a650e40 incr refcount = 6
object:8a650e40 decr refcount = 5
object:8a650e40 decr refcount = 5
object:8a650e40 decr refcount = 4 # for many hours it is stable
[...]
object:8a650e40 decr refcount = 8125 # after some time a bit increased
[...]
object:8a650e40 decr refcount = 8538 # but now growing much faster
[...]
object:8b166010 incr refcount = 212842 # and very high
object:8b166010 incr refcount = 212843
object:8b166010 decr refcount = 212842
object:8b166010 incr refcount = 212843
object:8b166010 incr refcount = 212844
object:8b166010 decr refcount = 212843

I had the suspicion, that consumers that were not properly disconnecting 
from their channels could be the cause, but while there is "some" leak, it 
stabilizes and doesn't leak further. So this doesn't seem to be the cause.

     QUESTIONS:
1.) Is such a problem known and fixed in newer versions?

2.) The object refcount shows 2 increments and only 1 decrement in this
"leaking" state. What could be the cause and how do I find out what this object
"8b166010" is all about?

3.) I wrote a test program to destroy all channels using ec->destroy(), but the
memory was not freed. What could be the cause? Is this expected?

4.) Is there a way to get timestamps into the debug log??!! :) Is ORBDebugLevel
4 enough? The logfile grew to ~30GB during this test, which is not easy to
handle ... I didn't dare to enable more logging :)

5.) What is the policy for disconnected consumers? Is there a way to
"improperly" disconnect and cause objects to be stored until the end?

     SAMPLE FIX/WORKAROUND:
Restart of tao-cosnotification is necessary :(

kind regards,
Markus