[ciao-users] DAnCE: Plan Launcher throws CORBA MARSHAL exception

William R. Otte wotte at dre.vanderbilt.edu
Wed Mar 21 16:58:39 CDT 2018


Hi Holger -

Thanks for using the PRF.

Bearing in mind that I’m about six years removed from the last time I 
worked on DAnCE:

It looks to me like the error could be occurring on the node level (e.g. 
Node/Locality manager), not the domain level (execution manager).  Have 
you looked in detail at those logs?  Have you turned up the logging to 
debug level for both the node and domain infrastructure?

It would be helpful to attach those logs.

/-Will

On 21 Mar 2018, at 8:09, Haidinger, Holger wrote:

>     DAnCE VERSION: 1.2.3
>     TAO VERSION : 2.2.3
>     ACE VERSION : 6.2.3
>
>     HOST MACHINE and OPERATING SYSTEM:
>         Intel(R) Core(TM) i7-6700HQ
>         Microsoft Windows 10 Professional Version 1709
>         Windows Socket 2
>
>     TARGET MACHINE and OPERATING SYSTEM, if different from HOST:
>         Intel Mobile Core 2 Duo T5600
>         Microsoft Windows XP Professional Service Pack 3
>     COMPILER NAME AND VERSION (AND PATCHLEVEL):
>         Microsoft Visual Studio 2010 Version 10.0.40219.1 SP1Rel
>
>     THE $ACE_ROOT/ace/config.h FILE: #include "ace/config-win32.h"
>
>     THE $ACE_ROOT/include/makeinclude/platform_macros.GNU FILE:
>         Not used due to Microsoft Visual C++
>
>     CONTENTS OF 
> $ACE_ROOT/bin/MakeProjectCreator/config/default.features
>     (used by MPC when you generate your own makefiles): Not used
>
>     AREA/CLASS/EXAMPLE AFFECTED:
> No module failed to compile.
>
>     DOES THE PROBLEM AFFECT: EXECUTION
>         COMPILATION? No
>         LINKING? No
>         EXECUTION? Yes
>         OTHER (please specify)?
>
>     SYNOPSIS:
> Plan Launcher throws CORBA MARSHAL exception for large deployments.
>
>     DESCRIPTION:
> We have been developed a framework of CCM components. Based on the 
> various
> needs of our applications we create deployment plans which are making 
> use
> of the components. Since we established the framework we have created 
> a lot of
> deployment plans with different number of component instances and 
> connections.
>
> But recently we were faced with CORBA MARSHAL exceptions with our 
> largest
> deployment plans. The exception does occur when the deployment plan is 
> started.
>
> We define a large deployment plan as:
>
> - approximately 150 component instances
> - approximately 350 connection instances
>
> Here I'm providing an error log snippet:
>
>  (9300|6780) [LM_ERROR] -  13:43:25.043884 - 
> Plan_Launcher::launch_plan -
>  Deployment failed, exception: Caught StartError  exception while 
> invoking
>  finishLaunch: PLANXXX, 1 errors from node applications:        
> TestNode -
>  finishLaunch raised CORBA exception : system exception,
>  ID 'IDL:omg.org/CORBA/MARSHAL:1.0'
>  Unknown vendor minor code id (0), minor code = 0, completed = NO
>
> Here is a summary of the results we collected so far:
>
> - Most important: The CORBA MARSHAL exception can be easily reproduced 
> on weak
>   PC hardware (Core 2 Duo target machine). On the host machine (Core 
> i7) the
>   exception does not occur at all.
> - By reducing the number of CCM connections (via deployment plan) we 
> could also
>   reduce the number of occurrences of the CORBA MARSHAL exception.
> - For release builds we can almost always reproduce the exception, for 
> debug
>   builds the exception occurs sporadically.
> - For debug builds sometimes we get the debug assertion
>   "Invalid allocation size: 4294967295 bytes." in 
> dance_node_manager.exe.
>   In this scenario the CORBA MARSHAL exception will follow always.
> - In a debug session we were able to locate the area where the 
> exception was
>   thrown: The exception occurred when the plan_launcher called 
> finishLaunch()
>   on the execution_manager. The execution_manager replies with
>   "SYSTEM_EXCEPTION:UNKNOWN_OBJECT" to this call.
>
> We would appreciate any help to solve the CORBA MARSHAL exception.
>
> At a first glance this looks like a race condition and we cannot 
> completely
> exclude that it is caused by our own code. So we would also be 
> grateful for any
> hints on how we could better narrow down the problem.
>
> If needed, we could also supply log files, call-stacks etc. But we 
> would have
> to check that because log files may contain company-relevant data.
>
> Thank you.
>
>     REPEAT BY:
> Start a large deployment plan on weak PC hardware, see the section 
> "DESCRIPTION"
> of the PRF for an explanation of "large deployment".
>
>     SAMPLE FIX/WORKAROUND:
> We have a small executable which is a wrapper for spawning the DAnCE 
> runtime
> processes. We have implemented a retry strategy in the wrapper. If 
> launching of
> a plan failed we are trying to re-launch the deployment plan for a 
> configurable
> number of attempts. Currently this solves the problem for our 
> production
> systems.
>
>
> _______________________________________________
> ciao-users mailing list
> ciao-users at list.isis.vanderbilt.edu
> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flist.isis.vanderbilt.edu%2Fcgi-bin%2Fmailman%2Flistinfo%2Fciao-users&data=02%7C01%7Cwotte%40dre.vanderbilt.edu%7Ca867d342faa64542c9bd08d58f2d505a%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636572347233365834&sdata=CX8V3n7VSR2ci4HwVThpUEBqlGCqjBpzR%2BfY1PLPAe4%3D&reserved=0


More information about the ciao-users mailing list