Let Machines Talk: Containing Failure
The question isn't whether your system will fail. It's how much it takes down when it does.
Every system fails. The engineering that matters isn’t preventing failure — it’s deciding in advance how far failure travels.
Blast radius is the total damage a single failure can cause before anyone intervenes. A system with a small blast radius breaks and takes nothing else with it. A system with a large blast radius breaks and takes the building down. The difference is never luck. It’s architecture.
Permissions Are the First Boundary
The fastest way a system causes widespread damage is by having access to things it doesn’t need. An agent that can read and write to every database, call every API, and access every credential has an unlimited blast radius by default. One bad decision propagates everywhere.
Scoped permissions are the most basic containment tool:
- Least privilege. The system gets the minimum access required for its current task. Not its potential tasks. Not its future tasks. The task it’s doing right now.
- Short-lived credentials. Access tokens that expire in minutes, not days. If a system is compromised, the window of exploitation is bounded by the credential lifetime.
- Per-task scoping. Each unit of work gets its own permission set. A system processing invoices doesn’t need access to the HR database, even if the same underlying agent handles both workflows.
The common failure here is convenience. Granting broad permissions is easier than scoping them correctly. It saves time during development and eliminates a whole class of “access denied” errors. It also means that when something goes wrong, everything is exposed.
Resource Boundaries
Permissions control what a system can touch. Resource boundaries control how much.
- Rate limits. Cap the number of actions per unit of time. A system that can send 10 API calls per second causes less damage in 30 seconds than one that can send 10,000.
- Budget caps. Set hard spending limits on compute, API calls, storage writes, and external service usage. When the cap is hit, the system stops. Not gracefully degrades — stops.
- Concurrency limits. Control how many operations run in parallel. A runaway process that spawns unlimited threads can exhaust an entire cluster. One that’s capped at 5 concurrent operations can’t.
- Output volume limits. Cap the amount of data a system can write, send, or publish in a given window. This prevents a malfunctioning agent from flooding a message queue, filling a disk, or spamming an external service.
Resource boundaries are blunt instruments. They don’t distinguish between legitimate high-volume work and a system going haywire. That’s fine. The point isn’t to be smart. The point is to set a ceiling on damage that holds regardless of what the system is doing or why.
Isolation by Default
Two systems that share infrastructure share failure modes. A database outage takes down every service connected to it. A network partition affects every system in the same subnet. A CPU spike from one workload starves everything on the same host.
Isolation means accepting the overhead of separation in exchange for independent failure:
- Separate compute. Critical workloads run on dedicated infrastructure, not shared clusters. The cost is higher. The blast radius is smaller.
- Separate data stores. Each system owns its data. No shared databases. Communication happens through explicit interfaces — APIs, message queues, event streams — not shared tables.
- Network segmentation. Systems that don’t need to talk to each other can’t. Firewall rules and network policies enforce this at the infrastructure level, not the application level.
- Credential isolation. Each system has its own credentials. Compromising one system’s keys doesn’t give access to another system’s resources.
The temptation is to share. Shared databases are simpler. Shared clusters are cheaper. Shared credentials are easier to manage. Every shared resource is also a shared failure path.
Transactions and Reversibility
Blast radius isn’t just about what breaks. It’s about what can be fixed.
A system that writes directly to a production database with no transaction boundaries and no audit log has an effectively infinite blast radius — not because the failure is large, but because the damage is invisible and irreversible. You don’t know what changed, when it changed, or how to change it back.
Reversibility shrinks the effective blast radius after the fact:
- Transactional writes. Group related changes into atomic operations. Either all of them succeed or none of them do. No half-written state to untangle.
- Append-only logs. Never overwrite. Every state change is a new record. The previous state is always recoverable.
- Compensation logic. For operations that can’t be rolled back (sent emails, published records, external API calls), build explicit compensation paths. What’s the undo for this action? If there isn’t one, the action needs extra scrutiny before execution.
- Soft deletes. Don’t destroy data. Mark it deleted and enforce that at the application layer. Real deletion is a separate, deliberate, audited operation.
If every action your system takes is logged, reversible, and attributable, then even a large failure is recoverable. The blast radius in terms of broken state might be wide, but the blast radius in terms of permanent damage is small.
Failure Domains
A failure domain is the set of components that go down together when one of them fails. Mapping your failure domains before something breaks is one of the highest-value exercises in systems engineering.
Start by asking: if this component fails, what else stops working?
- If the answer is “just this component,” you have a small failure domain.
- If the answer requires a whiteboard, you have a problem.
Common sources of large failure domains:
- Single points of dependency. A shared authentication service that every other service depends on. When it’s down, everything is down.
- Cascading timeouts. Service A calls Service B, which calls Service C. C is slow, so B’s thread pool fills up waiting, so A’s thread pool fills up waiting on B. One slow service takes out the entire chain.
- Shared queues. Multiple consumers reading from the same message queue. A poison message that crashes one consumer can block or crash all of them.
- Global configuration. A single config change that propagates to every instance simultaneously. One bad value takes out the entire fleet.
The fix for each of these is the same principle applied differently: introduce boundaries so that failure in one place doesn’t automatically become failure everywhere.
Circuit breakers stop cascading timeouts. Dead letter queues isolate poison messages. Staged rollouts limit the blast radius of bad configuration. Redundant auth services eliminate the single point of failure.
Blast Radius as a Design Metric
Treat blast radius like latency or uptime — something you measure, track, and actively reduce.
For every system capability, answer three questions:
- What’s the worst this can do? If this component fails in the worst possible way — not just crashes, but acts incorrectly — what’s the maximum damage?
- How long before we notice? The time between failure and detection is the exposure window. Damage accumulates linearly (or worse) during this window.
- How long before we contain it? The time between detection and containment. This is where kill paths matter — a fast kill path shrinks the effective blast radius even when the potential radius is large.
The product of these three values — maximum damage rate, detection time, and containment time — is your actual blast radius. Reducing any one of them reduces the total.
A system with broad access but instant detection and a one-second kill path might have a smaller actual blast radius than a system with narrow access but no monitoring and a manual shutdown procedure.
The Tradeoffs
Blast radius engineering has real costs:
- Performance. Isolation means more network hops, more serialization, more overhead. Shared resources are faster.
- Complexity. More boundaries mean more interfaces, more failure modes at the boundaries themselves, and more operational surface area.
- Cost. Dedicated infrastructure for every system is expensive. Shared infrastructure is cheaper per unit.
- Development speed. Scoped permissions, transactional writes, and compensation logic all take time to build. Broad access and direct writes are faster to ship.
These tradeoffs are real and the right answer depends on what your system does. A prototype that processes test data doesn’t need the same containment as an agent that executes financial transactions.
The mistake is treating blast radius reduction as optional for systems that run autonomously in production. An autonomous system that fails safely is more valuable than one that runs faster but fails catastrophically. Speed you can optimize later. Trust you have to build from the start.
