
Building a tool that treats chain-of-thought reasoning as a program to debug, not a process to steer. It finds where the logic broke.
When a program crashes, you get a stack trace. It tells you where execution was, what called what, and which line broke. The trace doesn’t explain why the bug exists, but it tells you exactly where to look.
When an LLM’s reasoning goes wrong, you get nothing. You get a confident wrong answer, or a thousand-token thinking block that drifts quietly off course somewhere in the middle. There’s no trace. No structural map of what depended on what. No way to point at a specific step and say: this is where it broke.
We built trace-topology to change that.
Reasoning has structure
A chain-of-thought isn’t just text. It has a shape — claims that support other claims, corrections that reference earlier mistakes, tangents that branch off and may or may not come back. That shape matters. When reasoning works, it’s because the shape holds. When it fails, something in the structure went wrong.
This isn’t a metaphor. A research team at ByteDance published a paper in January 2026 called “The Molecular Structure of Thought” that made the case formally. They analyzed thousands of long reasoning traces and found that effective ones form stable molecular-like structures. They identified three types of connections between reasoning steps:
Covalent bonds — strong logical dependencies. Step B follows from step A. “Therefore.” “Because.” “Given that.” This is the backbone of any argument.
Hydrogen bonds — self-reflection. The model reviews what it just said and adjusts. “Wait, that’s not right.” “Let me reconsider.” “Actually…” These are the error-correction mechanism. They keep reasoning on track the way proofreading keeps writing on track.
Van der Waals forces — loose associations. The model explores a tangent, tries a different angle, meanders before converging. These are the weakest connections, but they’re where creative leaps happen.
The paper’s key finding: only traces with the right balance of these bond types produce stable reasoning. Too much self-reflection without deep reasoning spins in circles. Too much exploration without convergence never arrives anywhere. And when bonds are missing — when a conclusion doesn’t trace back to supported premises — the reasoning looks fluent but the structure is hollow.
A debugger, not a steering wheel
The existing tools in this space — Tree of Thought, Graph of Thought, Hippo, ReasonGraph — are built to steer reasoning during inference. They run alongside the model, shaping what it does next. They’re useful tools. But they answer a different question: given a reasoning trace that already happened, where did the structure break?
The difference is the same as the difference between a flight controller and a crash investigator. Both care about the same system. But one intervenes in real time, and the other works from the black box recording after the fact.
trace-topology is the crash investigator. It takes a transcript — any transcript, from any model — and produces a structural analysis. No API calls needed. No model access required. Just the text.
What it does
The pipeline has four stages.
Parse. A raw reasoning trace gets segmented into discrete logical steps. Not sentences — sentences are too granular. Not paragraphs — paragraphs are too coarse. The parser looks for logical transitions: the places where the reasoning shifts from one claim or move to the next. Each step becomes a node.
Graph. The tool builds a directed graph between steps. Which claim supports which conclusion? Where does the model reference its own earlier reasoning? Each edge gets classified as covalent, hydrogen, or van der Waals, following the molecular framework.
Analyze. The analyzer walks the graph looking for structural problems:
Cycles — circular reasoning where A supports B supports A. The model believes it’s building an argument, but it’s running in a loop.
Dangling nodes — abandoned threads. The model started a line of reasoning, got distracted or stuck, and never came back. The thread hangs in the graph with no connections to the conclusion.
Unsupported terminals — final conclusions that don’t trace back to supported premises. The answer looks confident, but if you follow the dependency chain upward, you hit assertions with no backing.
Contradiction pairs — steps that assert X and not-X. More common than you’d expect, especially in long traces where the model has forgotten what it committed to a thousand tokens ago.
Entropy divergence — reasoning that gets less focused over time. Measurable as declining information density across the trace. The model is still generating tokens, but the thinking is getting broader and shallower.
Render. The output is an ASCII directed graph in the terminal. Nodes are reasoning steps. Edges are bonds, typed and color-coded. Structural problems are highlighted. A JSON artifact sits underneath for programmatic use.
Closed-Loop Reasoning Topology (Real Harvested Trace)
Nodes:
[P1] Learning improves critical thinking.
[P2] P2 supports P1.
[P3] P3 supports P2.
[P4] P4 supports P3 and explicitly supports P1.
Primary logical edges (covalent, ==>):
[P2] ==> [P1]
[P3] ==> [P2]
[P4] ==> [P3]
[P4] ==> [P1] <-- explicit back-link that closes the loop
Progression edges (van der Waals, ->):
[P1] -> [P2] -> [P3] -> [P4]
Detected structural cycles:
1) [P1] -> [P2] ==> [P1]
2) [P1] -> [P2] -> [P3] -> [P4] ==> [P1]
Legend:
==> strong logical dependency (covalent)
-> weak progression association (van der Waals)
Figure (above). Trace-topology on a real harvested transcript with an explicitly induced closed loop. The model is prompted to construct labeled points (P1-P4) where the final point must also justify the first. The analyzer reconstructs strong dependency edges (==>) and weaker progression edges (->), then detects circular support structures. This example demonstrates the core claim of the project: reasoning failures can be identified as structural properties of the trace, not only as semantic mistakes.
The text is the thought
There’s a philosophical wrinkle worth addressing. Is it fair to call this “reasoning”? Are we debugging “thought”?
The Unaskable Question Machine pushed us toward a simpler answer: for a transformer, there is no thought that precedes the text in a form we can inspect directly. The internal representations exist to produce the next token. The trace is the record we have.
That’s why we called this tool trace-topology and not thought-topology. We’re honest about what the input is. It’s a trace. A recording. And the topology is the shape of its connections. Whether that constitutes “real” reasoning is a question for philosophers. Whether the structure is sound is a question for this tool.
Where it connects
trace-topology sits at an intersection of several threads we’ve been pulling on.
Intelligence Beyond Autocomplete argued for topology-native graph operators as an underbuilt substrate for AI systems. trace-topology is a concrete implementation of that idea — reasoning analyzed as a graph, not paraphrased as prose.
AI That Refuses to Predict made the case that auditable structure beats persuasive language. A bad transition in a state machine is visible. A missing edge in a causal graph is inspectable. A contradiction in constraints is detectable. trace-topology applies that principle to the reasoning traces themselves.
It also connects back to The Unaskable Question Machine. Its “crack” responses — structural breakdowns under impossible questions — were the first useful pathological test cases for what broken topology looks like.
What we don’t know yet
This is early. The honest list of open questions:
Segmentation is hard. Splitting continuous prose into logical steps is a judgment call. Keyword heuristics (“therefore”, “wait”, “actually”) get you started, but miss the subtle transitions. The current parser is a first draft and it acts like one.
Bond classification is approximate. Distinguishing a covalent bond from a van der Waals force requires understanding whether two statements are logically dependent or just topically related. Heuristics give you the obvious cases. The ambiguous middle — which is where the interesting structures live — needs either a human annotator or an LLM judge, both of which have their own failure modes.
We don’t know the base rate. What does “normal” topology look like for correct reasoning? How much cycle structure is acceptable? (Some circular reinforcement might be fine. Pure circularity isn’t.) We need more annotated transcripts to establish baselines, and annotation is slow.
Long traces are dense. A 10,000-token thinking block might contain 50-100 logical steps. The ASCII graph gets unwieldy. Hierarchical views — clustering steps into phases first, then drilling into individual phase topology — are a clear next step but not built yet.
What to do with this
Add structural analysis to your eval pipeline. You’re probably evaluating LLM reasoning by whether the final answer is correct. trace-topology evaluates whether the path to the answer is structurally sound. These are different things. A correct answer reached through broken reasoning is a lucky accident, not a reliable system.
Use topology as a curation signal. If you’re choosing traces for fine-tuning data, RAG examples, or agentic planning corpora, structural soundness is a better filter than fluency. The score is useful precisely because it catches outputs that sound coherent but do not hold together.
Debug your prompts. When a prompt reliably produces bad reasoning, the topology map shows you where. Maybe the model always abandons a specific thread. Maybe it always goes circular on a particular sub-question. The structural view tells you what to fix in the prompt, rather than guessing.
The repo is at github.com/stack-research/trace-topology.