[ciao-users] DAnCE: Plan Launcher throws CORBA MARSHAL exception

Wed Mar 21 08:09:23 CDT 2018

    DAnCE VERSION: 1.2.3
    TAO VERSION : 2.2.3
    ACE VERSION : 6.2.3

    HOST MACHINE and OPERATING SYSTEM:
        Intel(R) Core(TM) i7-6700HQ
        Microsoft Windows 10 Professional Version 1709
        Windows Socket 2

    TARGET MACHINE and OPERATING SYSTEM, if different from HOST:
        Intel Mobile Core 2 Duo T5600
        Microsoft Windows XP Professional Service Pack 3
    COMPILER NAME AND VERSION (AND PATCHLEVEL):
        Microsoft Visual Studio 2010 Version 10.0.40219.1 SP1Rel

    THE $ACE_ROOT/ace/config.h FILE: #include "ace/config-win32.h"

    THE $ACE_ROOT/include/makeinclude/platform_macros.GNU FILE:
        Not used due to Microsoft Visual C++

    CONTENTS OF $ACE_ROOT/bin/MakeProjectCreator/config/default.features
    (used by MPC when you generate your own makefiles): Not used

    AREA/CLASS/EXAMPLE AFFECTED:
No module failed to compile.

    DOES THE PROBLEM AFFECT: EXECUTION
        COMPILATION? No
        LINKING? No
        EXECUTION? Yes
        OTHER (please specify)?

    SYNOPSIS:
Plan Launcher throws CORBA MARSHAL exception for large deployments.

    DESCRIPTION:
We have been developed a framework of CCM components. Based on the various
needs of our applications we create deployment plans which are making use
of the components. Since we established the framework we have created a lot of
deployment plans with different number of component instances and connections.

But recently we were faced with CORBA MARSHAL exceptions with our largest
deployment plans. The exception does occur when the deployment plan is started.

We define a large deployment plan as:

- approximately 150 component instances
- approximately 350 connection instances

Here I'm providing an error log snippet:

 (9300|6780) [LM_ERROR] -  13:43:25.043884 - Plan_Launcher::launch_plan - 
 Deployment failed, exception: Caught StartError  exception while invoking 
 finishLaunch: PLANXXX, 1 errors from node applications:        TestNode - 
 finishLaunch raised CORBA exception : system exception, 
 ID 'IDL:omg.org/CORBA/MARSHAL:1.0'
 Unknown vendor minor code id (0), minor code = 0, completed = NO

Here is a summary of the results we collected so far:

- Most important: The CORBA MARSHAL exception can be easily reproduced on weak
  PC hardware (Core 2 Duo target machine). On the host machine (Core i7) the
  exception does not occur at all.
- By reducing the number of CCM connections (via deployment plan) we could also
  reduce the number of occurrences of the CORBA MARSHAL exception.
- For release builds we can almost always reproduce the exception, for debug
  builds the exception occurs sporadically.
- For debug builds sometimes we get the debug assertion
  "Invalid allocation size: 4294967295 bytes." in dance_node_manager.exe.
  In this scenario the CORBA MARSHAL exception will follow always.
- In a debug session we were able to locate the area where the exception was
  thrown: The exception occurred when the plan_launcher called finishLaunch()
  on the execution_manager. The execution_manager replies with
  "SYSTEM_EXCEPTION:UNKNOWN_OBJECT" to this call.

We would appreciate any help to solve the CORBA MARSHAL exception.

At a first glance this looks like a race condition and we cannot completely
exclude that it is caused by our own code. So we would also be grateful for any
hints on how we could better narrow down the problem.

If needed, we could also supply log files, call-stacks etc. But we would have
to check that because log files may contain company-relevant data.

Thank you.

    REPEAT BY:
Start a large deployment plan on weak PC hardware, see the section "DESCRIPTION"
of the PRF for an explanation of "large deployment".

    SAMPLE FIX/WORKAROUND:
We have a small executable which is a wrapper for spawning the DAnCE runtime
processes. We have implemented a retry strategy in the wrapper. If launching of
a plan failed we are trying to re-launch the deployment plan for a configurable
number of attempts. Currently this solves the problem for our production
systems.