Let Machines Talk: Kill Paths

Let Machines Talk: Kill Paths

Stack Research
engineering research

A kill switch is a button. A kill path is an engineering discipline.

Every system that acts autonomously needs a way to stop. Not a checkbox in a compliance doc. An actual engineered path from full operation to full stop, with well-understood steps in between.

The problem is that “stop” is underspecified. Stop doing what? Stop when? Stop and then what? A kill switch with no answer to these questions is a liability dressed up as a safety feature.

The Spectrum

Kill paths aren’t binary. There are at least four distinct levels between “running normally” and “off,” and each one carries different costs.

1. Throttle

Reduce the rate of action. The system still operates, still makes decisions, but slower. Fewer API calls per minute. Longer delays between steps. Lower concurrency.

This is the lightest intervention. It buys time for a human to look at what’s happening without disrupting downstream systems that depend on output. It’s also the easiest to implement — a rate limiter with a configurable ceiling that an operator can dial down in real time.

The risk: throttling doesn’t change what the system is doing, only how fast. If the system is doing something harmful, it’s now doing it slowly. That’s not always better. Sometimes it’s worse, because it creates the illusion of control while damage continues to accumulate.

Use throttling when you’re uncertain. Something looks off, you want to observe, and you’re not ready to commit to a harder intervention.

2. Degrade

Remove capabilities. The system stays online but loses access to specific tools, actions, or decision paths. It can still read but not write. It can still recommend but not execute. It can still operate within a sandbox but not touch production resources.

Degradation is more surgical than throttling. Instead of slowing everything down, you cut specific edges in the system’s capability graph. This requires knowing, in advance, which capabilities can be removed independently. That means your system needs to be designed for it — capabilities as discrete, revocable permissions, not monolithic bundles.

The tradeoff is complexity. Degradation introduces partial states that are hard to reason about. A system that can read but not write might retry indefinitely, build up a queue of pending actions, or make decisions based on stale state because it can’t update its own context. You need to think through what the system does when a capability it expects to have is gone.

The other tradeoff is that degradation can be invisible to the people who need to know about it. If a system silently loses write access and continues to report healthy status, operators might not realize they’re looking at a lobotomized version of what should be running. Every degradation step needs a corresponding alert.

3. Isolate

Cut the system off from everything external. It can still run, but it can’t reach other services, databases, APIs, or networks. It operates in a sealed environment.

Isolation is the right move when you suspect the system is compromised or when its actions are affecting other systems in ways you don’t understand. It’s the containment step — stop the blast radius from growing while you figure out what happened.

This is harder to implement than it sounds. Real systems have dozens of integration points: message queues, shared databases, webhook endpoints, file systems, DNS. True isolation means cutting all of them, and missing even one can leave a channel open. Network-level isolation (pulling the system into a quarantine VLAN or disabling its service mesh sidecar) is more reliable than application-level isolation (setting a flag that says “don’t call external services”) because it doesn’t depend on the system honoring its own restrictions.

The cost of isolation is data loss. Any in-flight work — incomplete transactions, unacknowledged messages, half-written records — gets stranded. Recovery from isolation is often harder than recovery from a hard stop, because you end up with partial state scattered across the boundary.

4. Hard stop

Kill the process. Shut it down. No graceful cleanup, no final flush, no “just let me finish this one thing.” The system is off.

This is the option of last resort, and it needs to actually work. That means it can’t depend on the system’s cooperation. A hard stop that sends a shutdown signal and waits for the process to exit gracefully is not a hard stop. It’s a polite request. Hard stops are implemented at the infrastructure level: kill the container, terminate the VM, cut power to the hardware.

The cost is obvious: everything in memory is gone, in-flight operations are abandoned, and any state that wasn’t persisted is lost. Recovery requires starting from the last known good state, which means you need to have one. Systems that don’t checkpoint regularly can’t afford hard stops. Systems that can’t afford hard stops can’t afford to run autonomously.

The Order Matters

These four levels aren’t a menu. They’re a sequence. A well-engineered kill path moves through them in order, escalating only when the previous level fails to resolve the situation.

Throttle first. If the problem persists or worsens, degrade. If degradation doesn’t contain it, isolate. If isolation fails or isn’t fast enough, hard stop.

Skipping levels is sometimes necessary — if a system is actively exfiltrating data, you don’t start by throttling — but skipping should be the exception. Each level generates information. Throttling tells you whether the problem is rate-dependent. Degradation tells you which capability is causing harm. Isolation tells you whether the problem is internal or external. Hard stop tells you nothing. It’s safe, but you learn nothing from it.

The escalation path should be automated with human checkpoints. The system can throttle itself automatically when anomaly metrics cross a threshold. Degradation should require confirmation, or at minimum emit a high-priority alert. Isolation and hard stop should involve a human unless the situation is so severe that waiting for human input is itself dangerous.

Why Kill Paths Fail

The engineering isn’t the hard part. The hard part is making sure the kill path works when you actually need it.

Kill paths rot. A kill path that worked six months ago might not work today. The system has new integrations, new capabilities, new dependencies. The isolation procedure that used to cut three network connections now needs to cut nine, but nobody updated the runbook. Kill paths need to be tested continuously, not just at design time.

Kill paths have dependencies. If your kill path runs through the same infrastructure as the system it’s supposed to stop, it can fail for the same reasons the system is failing. A monitoring service that runs on the same Kubernetes cluster as the workload it monitors will go down with the cluster. Kill path infrastructure should be independent of the system it controls — different network, different compute, different credentials.

Kill paths are too slow. An autonomous system that executes ten actions per second accumulates a lot of damage in the thirty seconds it takes an operator to open a dashboard, assess the situation, and click a button. The time between “something is wrong” and “the system is stopped” is your exposure window. Measure it. Shrink it. Automate the parts that don’t require judgment.

Kill paths are incomplete. The system is stopped, but its effects aren’t. A system that sent a thousand API calls before being killed has already changed the state of a thousand external systems. Kill paths stop future damage. They don’t undo past damage. That’s a separate capability — rollback — and it needs its own engineering.

Nobody practices using them. The first time an operator uses a kill path should not be during an actual incident. Teams should run kill path drills regularly. Trigger each level. Verify it works. Measure how long it takes. Fix what breaks. This is boring, repetitive, and absolutely necessary.

Kill Paths and Rollback Are Different Things

Stopping a system and undoing what it did are two separate problems. Kill paths handle the first. Rollback handles the second. Confusing them is dangerous.

A kill path answers: how do we stop this from doing more?

A rollback answers: how do we undo what it already did?

Rollback requires that every action the system takes is recorded, reversible, and attributed. That means append-only logs, idempotent operations where possible, and compensation logic for operations that can’t be reversed. Deleting a database row can be rolled back if you logged the row contents before deletion. Sending an email cannot be rolled back. Transferring money can be rolled back with a compensating transaction. Publishing a secret cannot be rolled back.

The kill path decision and the rollback decision are often made by different people under different time pressures. The person who hits the kill switch is thinking about stopping the bleeding. The person who plans the rollback is thinking about reconstructing what happened and what can be recovered. Design for both, but don’t conflate them.

Designing for Killability

Systems that are easy to stop share a few properties:

  • Checkpointed state. The system writes its state to durable storage at regular intervals. A hard stop from any checkpoint produces a consistent, recoverable state.
  • Idempotent actions. Operations can be retried without side effects. This means a system that’s killed mid-operation can be restarted without producing duplicates or corruption.
  • Scoped permissions. Each capability the system has is granted independently and can be revoked independently. Degradation is a matter of revoking the right permissions, not rewriting code.
  • External observability. The system’s behavior is visible from outside, through metrics, logs, and traces that don’t depend on the system itself being healthy. If the system is misbehaving, you can tell from outside it.
  • Independent control plane. The mechanism that stops the system doesn’t run on the system. It’s a separate service, on separate infrastructure, with separate credentials.
  • Action journaling. Every externally-visible action is logged before execution with enough detail to support rollback. The journal is the source of truth for what happened, not the system’s internal state.

None of this is novel. These are the same properties that make any distributed system reliable. The difference is that autonomous systems make them non-negotiable. A web server that crashes and restarts is an inconvenience. An autonomous agent that crashes mid-operation with no checkpoint, no journal, and no rollback path is a liability.

The Real Test

A kill path is only as good as the last time it was tested. Not reviewed. Not diagrammed. Tested.

Run the drill. Trigger throttle and verify the rate drops. Revoke a capability and confirm the system degrades correctly without silent failures. Isolate the system and verify that every integration point is actually cut. Kill the process and confirm that recovery from the last checkpoint works and that no data is lost or corrupted.

Do this in production, not just staging. Staging environments lie. They have fewer integrations, less data, and different network topologies. A kill path that works in staging and fails in production is not a kill path.

If you can’t test it, you can’t trust it. And if you can’t trust it, you don’t have a kill path. You have a hope.