LLM on Stack Research

Structural Debugging for Chain-of-Thought Graphs

Thu, 02 Apr 2026 00:00:00 +0000

When a program crashes, the stack trace does not explain the whole bug. It does something narrower and more useful: it shows where execution was, what called what, and which line broke.

When a language model’s reasoning goes wrong, the failure is usually harder to locate. The final answer may be fluent and wrong. The intermediate trace may drift quietly for a thousand tokens. There is often no structural map of what depended on what, and no obvious place to point and say: this is where the reasoning stopped holding together.

Executable Metaphors: Compiling Analogy Into Prototype Code

Tue, 17 Mar 2026 00:00:00 +0000

Metaphors already shape software.

A pipeline moves data from one stage to another. Garbage collection reclaims unused memory. A queue holds work until something is ready to process it. These words are not decorative. They carry a small model of how a system should behave.

Executable Metaphors asks what happens if that model becomes the input to a compiler. A short analogy, written in Markdown, is treated as the source artifact. The generated code, build files, documentation, and repair scripts are outputs.

The Unaskable Question

Mon, 16 Mar 2026 00:00:00 +0000

Ask a language model something it does not know, and it may admit uncertainty or invent an answer. Ask it something a policy forbids, and it may refuse. Those are familiar failure modes. They have names, benchmarks, mitigations, and whole taxonomies around them.

There is another category that receives less attention: questions the model cannot engage with because the question contradicts the structure of the system being asked. Not a knowledge gap. Not a safety boundary. A structural impossibility.

Evolving Better Prompts

Sun, 15 Mar 2026 00:00:00 +0000

A four-generation prompt evolution run moved average fitness from 0.887 to 0.926. The best prompt reached 0.965. The run used a population of 8 prompts and completed in under 4 minutes on a MacBook Pro with llama3.1:8b running locally through Ollama.

The useful trick is not genetic programming in the old sense of random token edits. Mutation and crossover are language-model calls. Every variant is still a valid prompt. The model rewrites prompts in ways a human prompt engineer might recognize: tighter wording, added constraints, reordered instructions, more concrete examples, removed weak parts.