[Ace-users] [ace-bugs] [ACE_Message_Queue] notify PIPE block causesSelect_Reactor deadlock.

Thu Feb 28 13:09:26 CST 2008

Doug, thanks for the answer. In my experience the deadlock was not 
simple to reproduce, at least with my application. For the PC, it 
seemed to occur only on Vista (not on Windows XP) and while doing 
something specific. The whole thing worked fine 99% of the time. 
It did occur on one unix platform as well, I forgot which one. 

My main app has two threads, and talks to a remote server. The socket 
i/o to the remote server is done in thread2. It seemed that the problem 
occurred when thread1 was enqueuing lots of messages to thread2, and 
when thread2 was too busy with socket i/o with the remote server to 
dequeue the messages from thread1.

I did try changing my code slightly to avoid the deadlock, but to
no avail. 

In the end I patched ACE (as you suggested). 

I don't think it is trivial to create a test case reproducing 
the deadlock. Surely doable but might require some time, which I am 
lacking at the moment. 

Thanks,

greg

-----Original Message-----
From: schmidt at dre.vanderbilt.edu [mailto:schmidt at dre.vanderbilt.edu] 
Sent: Wednesday, February 27, 2008 5:17 PM
To: Greg Popovitch
Cc: Johnny Willemsen; Rudy Pot; ace-bugs at cs.wustl.edu
Subject: Re: [ace-bugs] [ACE_Message_Queue] notify PIPE block
causesSelect_Reactor deadlock. 

Hi Greg,

> Thanks for the answer, and thanks for the great work with ACE!

You are very welcome!

> I agree that having a non-regression test case for the deadlock
> problem is useful. If there had been one, the 2003 fix restoring the
> deadlock problem would probably not have happened.

Right!!

> However, having a test case for the deadlock issue would not help you
> create a fix avoiding the race conditions. If it was me, I'd do the
> "obvious" fix for the deadlock (after running all the available tests)
> and deal with the race conditions correctly if and when when they
> re-occur.

The problem, of course, is that race conditions are hard to detect, so a
program may appear to work during simpe smoke testing, when in fact it
has a bug that appears in production.  In contrast, the deadlock
situation manifests itself fairly easily, so it can be detected and
addressed, e.g., by designing the program to avoid a deadlock.

> Otherwise, we put ourselves in a deadlock. We are willing to fix this
> only if we can make sure we don't create race conditions, but we are
> not able to verify this because we don't have a test case for those
> race conditions.

Right, but hopefully you can understand why we don't want to apply a fix
that is known to cause problems since it will break a lot of production
code in subtle and pernicious ways.

> IMHO, the immediate problem is that we don't have a test case for
> reproducing the "race conditions"

How about we do the following:

1. Add a regression test for the current deadlock case.

2. Add suggested fixes for the current problem to bugzilla so they don't
   get lost.

3. Try to create a regression test that demonstrates the race
   condition. 

4. If we get #3 then we can enhance the solutions for #2 to avoid both
   problems #1 and #3.

In the meantime, you can patch your code locally to avoid the deadlock -
such is the power of open-source ;-)

Thanks,

        Doug