Let Machines Talk: Rollback

Let Machines Talk: Rollback

Stack Research
engineering research

Stopping the system is half the problem. Undoing what it already did is the other half.

Kill paths stop a system from doing more damage. Rollback answers the harder question: what do you do about the damage that already happened?

These are two different engineering problems with different constraints, different tooling, and different failure modes. Treating them as one — or worse, assuming that stopping a system is the same as fixing what it broke — is how teams end up with clean shutdowns and corrupted state.

Not Everything Can Be Undone

The first thing to accept about rollback is that some actions are irreversible. Not difficult to reverse. Impossible.

  • You can reverse a database write. You can’t reverse a sent email.
  • You can reverse a fund transfer with a compensating transaction. You can’t reverse a leaked credential.
  • You can reverse a config change. You can’t reverse a published API response that a thousand clients already consumed.

Every action your system takes falls somewhere on a spectrum from trivially reversible to permanently irreversible. Knowing where each action sits on that spectrum — before the system executes it — is the foundation of rollback engineering.

If you haven’t classified your actions, you don’t have a rollback strategy. You have optimism.

The Three Mechanisms

There are three ways to undo something, and each applies to different kinds of actions.

1. State restoration

Replace the current state with a previous known-good state. This is the simplest form of rollback and the most reliable when it applies.

Database snapshots, file system backups, configuration version history — all of these are state restoration. You don’t need to understand what changed or why. You just need a copy of what things looked like before and a way to put it back.

The limitation is scope. State restoration works for systems you fully control. It doesn’t work for state that lives in external systems, downstream services, or other people’s databases. You can restore your table to yesterday’s snapshot. You can’t restore your customer’s cache.

State restoration also requires that you actually have the previous state. That means regular checkpoints, immutable backups, and retention policies that keep snapshots long enough to be useful. A daily backup doesn’t help when you need to rollback to three hours ago.

2. Compensation

Execute a new action that semantically reverses the original. Refund a payment. Revoke an issued token. Send a correction to a downstream service. Post a reversing journal entry.

Compensation is the mechanism for actions that change external state. You can’t restore a previous snapshot of someone else’s system, but you can ask that system to undo the effect of your action.

This requires three things:

  • A record of what the original action was, in enough detail to construct the reverse.
  • A defined compensation path for each action type. Not every action has an obvious inverse. What’s the compensation for “sent a notification”? For “created a user account”? These need to be decided at design time, not during an incident.
  • The ability to execute the compensation. If the downstream service is down, or the API doesn’t support the reverse operation, compensation fails. You need fallback plans for when compensation itself doesn’t work.

Compensation is inherently imperfect. A refunded payment isn’t the same as a payment that never happened. A revoked token might have already been used. A correction email doesn’t unsend the original. Compensation reduces harm. It doesn’t erase it.

3. Replay

Rebuild the current state by replaying a log of events from a known-good starting point, skipping or modifying the bad ones.

This is the most powerful mechanism and the most demanding. It requires an append-only event log, a way to identify which events are bad, and a system that can reconstruct its state from the log deterministically.

Event sourcing architectures get this for free — the event log is the source of truth, and any state can be rebuilt by replaying events. Traditional architectures can approximate it with write-ahead logs or change data capture streams, but the reconstruction is harder and less reliable.

Replay’s strength is precision. Instead of rolling back to a snapshot and losing everything after it, you can keep every good event and skip only the bad ones. The tradeoff is complexity. Replaying events takes time, requires the system to be offline or in a degraded state, and depends on the replay being deterministic — the same events producing the same state every time.

The Journal

All three mechanisms depend on the same thing: a complete, trustworthy record of what the system did.

Without a journal, state restoration is a guess. Compensation is impossible. Replay is fiction.

A rollback journal needs to capture:

  • What happened. The action, in enough detail to reverse it. Not “updated record” but “changed field X from value A to value B in table T, row R.”
  • When it happened. Precise timestamps, ordered, with enough resolution to distinguish concurrent operations.
  • Why it happened. The decision chain that led to the action. Which input triggered it, which rule matched, which model produced the recommendation. This matters for replay — you need to know which decisions were downstream of the bad one.
  • What it affected. The full list of systems, records, and external services that were touched. Compensation requires knowing every place the action had an effect.

The journal must be written before the action executes, not after. If the system crashes mid-action, a post-action log entry never gets written and you lose visibility into the most critical moment. Write the intent first, execute second, then write the outcome. If the outcome is missing, you know the action was attempted but may not have completed.

The journal must be stored independently of the system it records. A journal that lives in the same database as the data it tracks gets corrupted by the same failures it’s supposed to help you recover from.

Rollback Scope

When something goes wrong, the natural instinct is to roll everything back. Undo all the changes since the last known-good state. Full reset.

This is almost always more destructive than necessary. A full rollback throws away every action the system took — the correct ones along with the incorrect ones. If the system processed a thousand transactions and one was bad, rolling back all thousand creates nine hundred and ninety-nine new problems to fix.

Scoped rollback means identifying exactly which actions need to be reversed and reversing only those. This requires:

  • Traceability. Following the chain from the bad input to every action it influenced. If a bad value entered at step 3 affected decisions at steps 7, 12, and 15, you need to rollback those three steps — not everything from step 3 onward.
  • Dependency mapping. Understanding which actions depend on which other actions. Rolling back step 7 might invalidate step 9, which used step 7’s output as input. You need to know this before you start.
  • Isolation between independent work. Actions that don’t share inputs or state can be evaluated independently. A bad transaction in one workflow shouldn’t affect rollback decisions in an unrelated workflow.

Scoped rollback is harder than full rollback. It requires better tooling, better journals, and better understanding of the system’s dependency graph. It’s also the only kind of rollback that works at scale without causing secondary outages.

Testing Rollback

A rollback procedure that hasn’t been tested is a rollback procedure that doesn’t work. This isn’t cynicism. It’s experience.

Test each mechanism independently:

  • State restoration: Take a snapshot. Make changes. Restore the snapshot. Verify the state is correct and the system functions normally after restoration.
  • Compensation: Execute an action. Run the compensation. Verify the external effect is reversed and no side effects remain.
  • Replay: Introduce a bad event into a log. Replay the log without it. Verify the resulting state matches what it should be.

Then test them together. A real incident rarely needs just one mechanism. You might restore local state from a snapshot, compensate external effects, and replay a filtered event log to get the system current. The interaction between these mechanisms — the order, the timing, the edge cases — is where rollback plans fall apart.

Test in production. A staging environment with a clean database and no external integrations will tell you that your rollback scripts run without errors. It won’t tell you that they work.

The Cost of Not Building This

Rollback engineering is expensive. Journals take storage. Compensation logic takes development time. Replay infrastructure takes operational investment. Testing takes discipline.

The alternative is a system that can be stopped but not fixed. A system where every incident requires manual forensics, ad hoc scripting, and hours of uncertainty about what was affected and what’s recoverable. A system where the answer to “can we undo this?” is always “it depends” and sometimes “we don’t know.”

Every autonomous system that acts on the real world needs a rollback strategy proportional to the consequences of its actions. A system that renames files needs minimal rollback. A system that moves money needs extensive rollback. A system that makes irreversible decisions needs to be designed so that as few of its decisions as possible are actually irreversible.

The goal isn’t to make every action undoable. It’s to know, for every action, whether it can be undone — and to have already built the mechanism for the ones that can.