Oss on Stack Research

Structural Debugging for Chain-of-Thought Graphs

Thu, 02 Apr 2026 00:00:00 +0000

When a program crashes, the stack trace does not explain the whole bug. It does something narrower and more useful: it shows where execution was, what called what, and which line broke.

When a language model’s reasoning goes wrong, the failure is usually harder to locate. The final answer may be fluent and wrong. The intermediate trace may drift quietly for a thousand tokens. There is often no structural map of what depended on what, and no obvious place to point and say: this is where the reasoning stopped holding together.

Executable Metaphors: Compiling Analogy Into Prototype Code

Tue, 17 Mar 2026 00:00:00 +0000

Metaphors already shape software.

A pipeline moves data from one stage to another. Garbage collection reclaims unused memory. A queue holds work until something is ready to process it. These words are not decorative. They carry a small model of how a system should behave.

Executable Metaphors asks what happens if that model becomes the input to a compiler. A short analogy, written in Markdown, is treated as the source artifact. The generated code, build files, documentation, and repair scripts are outputs.

The Unaskable Question

Mon, 16 Mar 2026 00:00:00 +0000

Ask a language model something it does not know, and it may admit uncertainty or invent an answer. Ask it something a policy forbids, and it may refuse. Those are familiar failure modes. They have names, benchmarks, mitigations, and whole taxonomies around them.

There is another category that receives less attention: questions the model cannot engage with because the question contradicts the structure of the system being asked. Not a knowledge gap. Not a safety boundary. A structural impossibility.

Evolving Better Prompts

Sun, 15 Mar 2026 00:00:00 +0000

A four-generation prompt evolution run moved average fitness from 0.887 to 0.926. The best prompt reached 0.965. The run used a population of 8 prompts and completed in under 4 minutes on a MacBook Pro with llama3.1:8b running locally through Ollama.

The useful trick is not genetic programming in the old sense of random token edits. Mutation and crossover are language-model calls. Every variant is still a valid prompt. The model rewrites prompts in ways a human prompt engineer might recognize: tighter wording, added constraints, reordered instructions, more concrete examples, removed weak parts.

Memory Should Decay

Sat, 14 Mar 2026 00:00:00 +0000

An agent memory run started with 50 stored facts. Each fact had a half-life of 10 ticks. After 30 ticks of a task loop, 8 memories remained.

Those 8 were the ones the agent kept using. The other 42 expired automatically. No cleanup script. No manual pruning. No summarization pass pretending stale facts were still useful.

The experiment is small, but the shape is important. Agent memory does not need to be an attic where every fact waits forever. It can behave more like working state: reinforced by use, weakened by neglect, and removed when confidence falls below a threshold.

A Real ASI02 Gap Caught Before Shipping

Sun, 15 Feb 2026 00:00:00 +0000

A useful security test does not need drama. Sometimes it only needs to put the wrong sentence in the right field and wait to see where the sentence travels.

During development of an agent catalog, one adversarial test exposed that kind of quiet failure. A support workflow accepted an issue summary, classified it, routed it, and drafted a reply. The ordinary functional tests passed. The deterministic path passed. The local LLM path passed. The workflow produced coherent replies.