[Ace-users] Re: [ciao-users] redundant component groups

Matthew Gillen mgillen at bbn.com
Wed Jun 27 11:54:02 CDT 2007


Hi Friedhelm,

Friedhelm Wolf wrote:
> SYNOPSIS:
>     Which possibilities exist to have fault tolerance in CIAO and DAnCE?
> 
> DESCRIPTION:
> Hi there,
> 
> The basic challenge, this post is about concerns fault tolerant behavior
> of a system of CCM components. Thus, before becoming too specific:
> Does anyone have some experiences with how to implement CCM systems,
> which have to be very reliable in terms of uptime/availability?
> What CCM mechanism are there to achieve this?

There isn't really anything standardized on this front, or if it is
standardized, it isn't implemented.

We completed a project recently where we (with the Vanderbilt/DOC group and a
team from CMU) implemented a fault tolerance infrastructure for our particular
CCM components.  We built on MEAD (http://www.ece.cmu.edu/~mead/) which was
originally designed for CORBA-2 servers.  You can read one of the papers we
published about the project:
http://www.dist-systems.bbn.com/papers/2006/JCP/index.shtml

While we achieved our goals for the project as far as providing FT for our CCM
components, I wouldn't say that anything we did was standardized (esp. since
the version of MEAD we started with is ORB-specific).

On the one hand, the changes that were required in CIAO/DAnCE were relatively
small.  On the other hand, not all types of assemblies were supported: we had
a collection of independent, collocated assemblies (our unit of replication
was a process, and every assembly fit into a single process) and they were
hooked up to each other via the Naming Service.  In other words, we didn't
need to support distributed assemblies.


> To become more specific now:
> The following approach is inspired by the CORBA fault tolerance service:

(you should note that AFAIK no one ever implemented what the CORBA-FT spec
described)

> The basic idea is, that a group of components (having interdependencies),
> provides some services which need to have a very high availability.
> So all components will be instantiated more than one time to have a
> redundant backup (keeping these components in sync might be necessary,
> depending on the component type but this is not in the scope of this
> question). If one of these components fails (assuming that there is a way
> to find out when a component fails ... usually through CORBA exceptions),
> it will not only be necessary to replace this single component by its
> backup, but also to inform the whole component group to reconnect to the
> correct component.
> 
> Can you give me some advice, how to achieve this using standard CCM
> mechanisms?
> 
> I think that ReDaC might aim in this direction.

The DOC group was looking at something along these lines towards the end of
the aforementioned project.  ReDaC is certainly key functionality for what
you're talking about.

> Is it possible to dynamically create an assembly file, which reflects the
> necessary connection changes to integrate a backup component instead of an
> unresponsive component?

The DOC group's RACE framework is designed to dynamically create assembly
files, so it's certainly a candidate for this type of thing.  I'm not sure how
far they've gotten, but FT-for-CCM is listed as one of the things the DOC
group is working on:
http://www.dre.vanderbilt.edu/projects/ (bullet #6)

> Can you foresee any technical or performance problems, that would
> conflict with such an approach?

Relying on CORBA exceptions for identifying failures will lead to very bad
fail-over times (unless you tweak your OS's TCP stack).  Specifically, if the
machine goes belly up, the normal TCP timeout setting can be upwards of a
minute, which is long no matter what your application is doing.  (note: you
can get very fast CORBA exceptions if the remote machine stays up, and only
the component's process dies.  However, handling this case fast is of very
limited value, IMHO).

The state-management of your components will be non-trivial (actually it's
probably the hardest part of any FT framework), especially if your components
have state that doesn't have a CORBA data type associated with it (ie
std::map, a database connection, etc).

I'll let the DOC folks answer the remainder of your questions.

HTH,
Matt

> Besides from technical issues:
> ReDaC seems to be a nonstandard enhancement of the CCM spec by DAnCE.
> Is that correct?
> 
> I ask this, because it's necessary for the project I work with to be
> based on standards specifications
> only at this level of the design. So, are there any efforts to
> standardize the redeployment features?
> 
> Are there other CCM standard features, I didn't think about, which could
> provide fault
> tolerance on component assembly level.
> 
> Thanks for reading all through this rather long explanation and for
> giving any helpful remarks,
> Friedhelm
> 



More information about the Ace-users mailing list