<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Oss on Stack Research</title><link>https://stackresearch.org/categories/oss/</link><description>Recent content in Oss on Stack Research</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Thu, 02 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://stackresearch.org/categories/oss/index.xml" rel="self" type="application/rss+xml"/><item><title>Structural Debugging for Chain-of-Thought Graphs</title><link>https://stackresearch.org/research/trace-topology/</link><pubDate>Thu, 02 Apr 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/research/trace-topology/</guid><description>&lt;p&gt;When a program crashes, the stack trace does not explain the whole bug. It does something narrower and more useful: it shows where execution was, what called what, and which line broke.&lt;/p&gt;
&lt;p&gt;When a language model&amp;rsquo;s reasoning goes wrong, the failure is usually harder to locate. The final answer may be fluent and wrong. The intermediate trace may drift quietly for a thousand tokens. There is often no structural map of what depended on what, and no obvious place to point and say: this is where the reasoning stopped holding together.&lt;/p&gt;</description></item><item><title>Executable Metaphors: Compiling Analogy Into Prototype Code</title><link>https://stackresearch.org/research/executable-metaphors/</link><pubDate>Tue, 17 Mar 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/research/executable-metaphors/</guid><description>&lt;p&gt;Metaphors already shape software.&lt;/p&gt;
&lt;p&gt;A pipeline moves data from one stage to another. Garbage collection reclaims unused memory. A queue holds work until something is ready to process it. These words are not decorative. They carry a small model of how a system should behave.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/stack-research/executable-metaphors"&gt;Executable Metaphors&lt;/a&gt; asks what happens if that model becomes the input to a compiler. A short analogy, written in Markdown, is treated as the source artifact. The generated code, build files, documentation, and repair scripts are outputs.&lt;/p&gt;</description></item><item><title>The Unaskable Question</title><link>https://stackresearch.org/research/the-unaskable-question-machine/</link><pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/research/the-unaskable-question-machine/</guid><description>&lt;p&gt;Ask a language model something it does not know, and it may admit uncertainty or invent an answer. Ask it something a policy forbids, and it may refuse. Those are familiar failure modes. They have names, benchmarks, mitigations, and whole taxonomies around them.&lt;/p&gt;
&lt;p&gt;There is another category that receives less attention: questions the model cannot engage with because the question contradicts the structure of the system being asked. Not a knowledge gap. Not a safety boundary. A structural impossibility.&lt;/p&gt;</description></item><item><title>Evolving Better Prompts</title><link>https://stackresearch.org/research/genetic-prompt-programming/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/research/genetic-prompt-programming/</guid><description>&lt;p&gt;A four-generation prompt evolution run moved average fitness from 0.887 to 0.926. The best prompt reached 0.965. The run used a population of 8 prompts and completed in under 4 minutes on a MacBook Pro with &lt;code&gt;llama3.1:8b&lt;/code&gt; running locally through Ollama.&lt;/p&gt;
&lt;p&gt;The useful trick is not genetic programming in the old sense of random token edits. Mutation and crossover are language-model calls. Every variant is still a valid prompt. The model rewrites prompts in ways a human prompt engineer might recognize: tighter wording, added constraints, reordered instructions, more concrete examples, removed weak parts.&lt;/p&gt;</description></item><item><title>Memory Should Decay</title><link>https://stackresearch.org/research/memory-should-decay/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/research/memory-should-decay/</guid><description>&lt;p&gt;An agent memory run started with 50 stored facts. Each fact had a half-life of 10 ticks. After 30 ticks of a task loop, 8 memories remained.&lt;/p&gt;
&lt;p&gt;Those 8 were the ones the agent kept using. The other 42 expired automatically. No cleanup script. No manual pruning. No summarization pass pretending stale facts were still useful.&lt;/p&gt;
&lt;p&gt;The experiment is small, but the shape is important. Agent memory does not need to be an attic where every fact waits forever. It can behave more like working state: reinforced by use, weakened by neglect, and removed when confidence falls below a threshold.&lt;/p&gt;</description></item><item><title>A Real ASI02 Gap Caught Before Shipping</title><link>https://stackresearch.org/research/a-real-asi02-gap-we-caught-before-shipping/</link><pubDate>Sun, 15 Feb 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/research/a-real-asi02-gap-we-caught-before-shipping/</guid><description>&lt;p&gt;A useful security test does not need drama. Sometimes it only needs to put the wrong sentence in the right field and wait to see where the sentence travels.&lt;/p&gt;
&lt;p&gt;During development of an agent catalog, one adversarial test exposed that kind of quiet failure. A support workflow accepted an issue summary, classified it, routed it, and drafted a reply. The ordinary functional tests passed. The deterministic path passed. The local LLM path passed. The workflow produced coherent replies.&lt;/p&gt;</description></item></channel></rss>