Stress-Test the Plan, Not Just the Model

Stress-Test the Plan, Not Just the Model

Stack Research
research engineering

Before an agent acts, force its plan through a thousand bad futures. Ship what survives.

AI systems are built to produce the next answer. The better question is whether that answer still works when things go wrong.

A wind tunnel doesn’t predict the weather. It pushes a design through controlled turbulence to find where it breaks. Agent decisions should get the same treatment: fork the near future into hostile variants and see what survives.

Instead of “what should we do next?” — ask “what keeps working if reality turns against us?”

How It Works

Before a high-impact action is approved, the system runs the proposed plan across many plausible bad futures:

  • A dependency fails 20 minutes early.
  • A credential rotates mid-workflow.
  • Two agents disagree on state freshness.
  • A “read-only” tool returns stale data.
  • A vendor API changes schema without notice.

The output isn’t a polished paragraph. It’s a failure map: which assumptions broke first, which actions couldn’t be undone, which constraints were violated, and which fallback path still completed safely.

Why This Matters

In operations, security, and governance, one robust plan beats one elegant answer. If the first surprise collapses the plan, the quality of the wording doesn’t matter.

Testing against hostile futures shifts the goal:

  • From “best answer” to “lowest regret.”
  • From single-path optimization to resilience under variation.
  • From persuasive output to auditable breakpoints.

This matters because real systems don’t fail all at once. They fail at weak seams under pressure.

What to Build

A practical version needs five components:

  1. State encoder. Convert live context into typed entities, constraints, and trust boundaries.
  2. Fork engine. Generate bounded counterfactual variants using explicit mutation rules.
  3. Policy runner. Replay candidate plans through tool mocks or sandboxed executors.
  4. Causal ledger. Record exact transition chains for every failed and successful branch.
  5. Decision contract. Produce a machine-readable output: allowed actions, blocked actions, and reversal cost.

Automation should execute the contract, not free-form text.

Example: Incident Triage

An agent proposes: isolate host, rotate token, restart service, close incident. On the happy path, this looks right.

Now fork the environment:

  • Token rotation API is delayed.
  • Host isolation succeeds but telemetry lags.
  • Restart triggers autoscaler churn.
  • A second related alert fires during step 3.

Score the plan by what actually matters:

  • Did it preserve evidence?
  • Did it avoid irreversible cleanup too early?
  • Did it depend on hidden assumptions about tool latency?
  • Did it still converge when event order changed?

Only recommend plans that stay safe across a threshold of hostile branches.

Measure It

  • Branch survival rate — percent of counterfactual branches where policy goals are met without violations.
  • Reversal cost — estimated cost to unwind wrong actions after new evidence arrives.
  • Assumption density — unstated assumptions required per action.
  • Cascade potential — how often local failures become cross-system failures.

The Hard Parts

Bad simulators create false confidence. Branch generation can explode combinatorially. Teams will skip simulation under delivery pressure. These are engineering problems, not reasons to avoid the approach.

The Point

The next useful step for agent systems isn’t smoother language. It’s stress-tested decisions.

Build systems that can show where they break, replay why they broke, and earn execution rights by surviving structured adversity.