Let Machines Talk: Decision Lineage
Observability tells you what's happening. Lineage tells you why.
After an incident, the first question is always “what happened?” The second, harder question is “why did the system decide to do that?”
Logs tell you what happened. Metrics tell you when. Traces tell you where. None of them reliably tell you why. The decision that led to the action — the inputs it considered, the rules it applied, the alternatives it rejected — is usually gone. Reconstructing it means reading code, guessing at state, and hoping the system behaved the way you think it did.
Decision lineage is the practice of recording the full chain from input to decision to action, at the time the decision is made, so that any action the system takes can be traced back to the reasoning that produced it.
What Lineage Records
A lineage record answers five questions about every decision:
- What triggered it? The input, event, or condition that started the decision process. A user request, a scheduled job, a threshold crossing, a message from another system.
- What did the system know? The state the system had access to at decision time. Not the current state — the state as it existed when the decision was made. Context, variables, retrieved data, model outputs.
- What rules applied? The logic that evaluated the inputs and produced the decision. Policy rules, model weights, conditional branches, permission checks. Which ones fired and which ones didn’t.
- What alternatives existed? The options the system considered and rejected. This is the part that’s almost always missing. Knowing that the system chose option A is useful. Knowing it chose A over B and C, and why, is far more useful.
- What action resulted? The concrete thing the system did as a consequence of the decision. The API call, the database write, the message sent, the state change.
Each of these is linked. The trigger leads to the context, which feeds the rules, which evaluate the alternatives, which produce the action. Break any link and the lineage is incomplete.
Lineage Is Not Logging
Logs and lineage serve different purposes and have different requirements.
Logs are operational. They record events as they happen — requests received, errors thrown, services called. They’re optimized for volume, searchability, and real-time alerting. A log entry says “at 14:32:07, the system called the payments API with amount $450.”
Lineage is analytical. It records the decision structure that produced the event. A lineage record for that same action says “at 14:32:07, the system received invoice #8821, retrieved the vendor’s payment terms (net-30, no discount), checked the approval policy (auto-approve under $500), confirmed budget availability ($12,400 remaining in Q1 opex), selected immediate payment over scheduled payment because the due date was within 48 hours, and called the payments API with amount $450.”
The log tells you what to investigate. The lineage tells you what you’d find.
This distinction matters because logs are cheap and lineage is not. Recording the full decision context for every action takes more storage, more compute, and more careful engineering. You don’t need lineage for every decision your system makes. You need it for every decision that matters — the ones that change external state, spend money, grant access, or can’t be undone.
Recording Lineage at Decision Time
Lineage has to be captured when the decision happens, not reconstructed afterward. This is the constraint that makes it hard.
After an incident, the state that informed the decision may have changed. The model that produced the recommendation may have been updated. The policy rules may have been modified. The context the system had access to may have been overwritten. If you’re reconstructing lineage after the fact, you’re building a theory about what the system probably considered based on what you can still see. That’s forensics. It’s valuable, but it’s not lineage.
Capturing lineage at decision time means the decision logic itself has to emit the record. Every branch point, every data lookup, every rule evaluation writes its inputs and outputs to the lineage store before proceeding. The lineage record is complete before the action executes.
This has a performance cost. Adding a write to every decision point slows the system down. The mitigation is scoping — not every decision needs full lineage. Internal routing decisions, cache lookups, and retry logic don’t need the full treatment. Decisions that cross trust boundaries, change external state, or are irreversible do.
The other option is structured decision functions. Instead of instrumenting arbitrary code paths, you channel all significant decisions through a common interface that handles lineage recording automatically. The decision function takes inputs, evaluates rules, and returns an action — and the lineage record is a byproduct of execution, not an afterthought.
Lineage for Real-Time Oversight
Lineage is usually discussed as a post-incident tool. Something went wrong, you pull the lineage, you figure out why. But the more valuable use case is real-time.
An operator watching a running system can monitor actions through dashboards and logs. But actions alone don’t tell you if the system is making good decisions. A system might be taking the right actions for the wrong reasons — correct outputs from flawed reasoning that will eventually produce a bad outcome. Or it might be making subtly degraded decisions that don’t cross any threshold but drift further from correct over time.
Real-time lineage gives operators visibility into the decision process while the system is running:
- Decision auditing. Sample a percentage of decisions in real time and surface the full lineage for review. An operator can see not just what the system is doing but why it’s doing it, without waiting for something to break.
- Drift detection. Compare current decision lineage against historical baselines. If the system starts weighing inputs differently, considering fewer alternatives, or applying rules in a different order, the lineage reveals it before the actions show symptoms.
- Policy verification. Confirm that the rules you think are governing the system are actually governing the system. A policy that says “require human approval for transactions over $10,000” should show up in the lineage of every transaction over $10,000. If it doesn’t, you have a policy enforcement gap.
Real-time lineage turns oversight from reactive to proactive. Instead of waiting for a bad action and investigating backward, you watch the decisions and catch the bad reasoning before it produces a bad action.
The Lineage Store
Lineage records need to survive the system that produced them. If the system crashes, is killed, or is rolled back, the lineage for every decision it made up to that point must still be accessible.
This means:
- External storage. The lineage store is a separate service, on separate infrastructure, with separate credentials. It doesn’t share a failure domain with the system it records.
- Append-only writes. Lineage records are never updated or deleted by the system that created them. They’re immutable once written. This prevents a compromised or malfunctioning system from covering its tracks.
- Durable before action. The lineage record must be confirmed written before the action executes. If the lineage write fails, the action should not proceed. This is the hardest constraint to enforce and the most important. A gap in lineage is a gap in accountability.
The storage requirements are significant. Full decision context for high-frequency systems generates a lot of data. Tiered retention helps — keep full lineage for recent decisions, compress or summarize older ones, and retain full records indefinitely only for decisions that resulted in actions flagged for review.
Lineage and Rollback
Decision lineage makes rollback dramatically easier.
Without lineage, scoped rollback requires reconstructing which actions were affected by a bad input. You trace through logs, infer the decision chain, and hope you identified every downstream effect. It’s slow, error-prone, and incomplete.
With lineage, you query the lineage store: show me every decision that was influenced by input X. The full chain is already recorded — which decisions consumed that input, what actions they produced, and what downstream decisions used those actions as inputs. Scoped rollback becomes a graph traversal instead of a forensic investigation.
This is where lineage and kill paths intersect. A kill path stops the system. Lineage tells you what the system did while it was running. Rollback uses that information to undo the specific actions that need undoing. Without lineage, rollback is either surgical-but-slow (manual investigation) or fast-but-destructive (full state restoration). With lineage, it can be both surgical and fast.
The Hard Parts
Decision lineage sounds straightforward in principle. Record inputs, rules, and outputs. In practice, three things make it difficult.
Context is large. A decision that incorporates a language model’s output, a database query result, and three API responses has a lot of context. Recording all of it for every decision is expensive. You need a strategy for what to record verbatim, what to summarize, and what to reference by pointer (store the query, not the result, and re-fetch if needed during review).
Decisions are nested. A high-level decision (“process this invoice”) decomposes into dozens of sub-decisions (“look up vendor,” “check approval policy,” “select payment method”). Recording lineage at the wrong granularity gives you either too much noise or not enough detail. The right level is usually the decision that directly produces an externally-visible action.
Lineage has to be trustworthy. If the system is compromised, its lineage records may be fabricated. If the system is buggy, its lineage records may be incomplete or inaccurate. Lineage is only as reliable as the code that produces it. External validation — comparing lineage records against independent logs, checking that recorded inputs match what external systems actually sent — adds confidence but also complexity.
None of these problems are unsolvable. They’re engineering tradeoffs. The cost of solving them is real. The cost of not having lineage when you need it — when an autonomous system has done something you can’t explain and you’re working backward from effects to causes with nothing but timestamped log entries — is higher.
