[cadynce] How Many Configurations Is Enough?

Tue May 22 17:23:56 CDT 2007

Raj -

> Ideally, 
> we should keep going until the last machine is standing 

Right! This is what Carol calls the battle-short condition, aka Go Down
Fighting. Forget all reserve requirements and just make the most
important missions work. 

> To tolerate such a high 
> failure rate, your results show that we need to store many 
> more configurations (orders of magnitude more).   
> Given that 
> we cannot test a huge number of configurations offline,...

The data I posted yesterday is really a pretty thin reed. We'll need a
more detailed examination of the problem before making any serious
changes in technical direction. For example, my analysis assumed that
all computers are identical, that the damage is to a random subset of
computers, and it asked only for the probability that after the damage
all app strings would still be running having suffered at most some
failovers. A better analysis would account for co-failure probabilities
- at least that computers in one data center are more likely to go down
together than computers in different data centers. Also, the analysis
should distinguish app strings that survive untouched, app strings that
survive after some failovers, app strings that can be recovered by
restarting them, and app strings that cannot be recovered at all. 

Also, regarding the need to store many configurations, recall that Carol
has agreed (verbally, informally) that certain modifications to a
certified configuration, such as swapping all apps on one shelf with the
apps on another shelf in the same cabinet, give rise to equally
certified (but never tested) configurations. It wouldn't be surprising
to find that a single stored, certified, configuration could spawn many
thousands of certified configurations in this way, without the need to
store any but the first.

(Applying this argument with enthusiasm to the toy model I used to
generate the graphs leads to the conclusion that only one allocation is
ever needed: pack the applications into the smallest possible number of
computers - say 43 of them. Then if there are fewer than 43 computers
left operating, we're screwed; if there are 43 or more left operating,
allocate the applications to the appropriately permuted set of
computers. Bada bing. Did I mention that it was a thin reed?)

> ...it seems to me that we should have the ability to 
> 
> 1.	Generate feasible configurations on the fly given 
> constraints at that point in time.   Since the space of 
> configurations can be large - albeit with scaling back 
> requirements or running only the most critical tasks that 
> would fit, we should be able to find one.  We do not have to 
> find a feasible configuration if it exists; we just have to 
> find a reasonable one that will run under the given constraints.

If I were a sailor on a burning ship, I'd sure agree with you. But the
Navy's attitude has been that on-the-fly generation of configurations
without something real close to certification quality guarantees is
worse that not operating at all. (A really bad configuration could e.g.
bring down the CEC network of all other ships in the battlespace. Or
launch nukes.) 

What brought us to our present position is that nobody has any idea how
to certify an on-the-fly configuration generator, at least not to the
Navy's satisfaction. 

- Joe