<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Engineering on Stack Research</title><link>https://stackresearch.org/categories/engineering/</link><description>Recent content in Engineering on Stack Research</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Tue, 28 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://stackresearch.org/categories/engineering/index.xml" rel="self" type="application/rss+xml"/><item><title>Making Agents Aware of Agentic Risk</title><link>https://stackresearch.org/research/agentic-risk-awareness/</link><pubDate>Tue, 28 Apr 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/research/agentic-risk-awareness/</guid><description>&lt;p&gt;A capable agent can fail in two very different ways.&lt;/p&gt;
&lt;p&gt;The first is loud. It breaks a rule, calls the wrong tool, or says something obviously false. You can see it.&lt;/p&gt;
&lt;p&gt;The second is quiet. It forms a plausible plan on bad assumptions, keeps moving, and leaves a trail of reasonable-looking steps that point to the wrong place. That one is harder. It looks like progress until the consequences arrive.&lt;/p&gt;</description></item><item><title>Agent Incident Response Needs a Measurable Drill</title><link>https://stackresearch.org/research/agent-incident-drill/</link><pubDate>Fri, 17 Apr 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/research/agent-incident-drill/</guid><description>&lt;p&gt;Agent incident response needs a clock, a journal, and a stopping point.&lt;/p&gt;
&lt;p&gt;Without those three things, failure remains theatrical. A bad action happens, someone opens logs, someone reconstructs intent, someone asks whether the system could have been stopped sooner. The answers arrive after the important interval has already passed.&lt;/p&gt;
&lt;p&gt;The useful question is narrower: can a controlled agent failure be made measurable while it is happening?&lt;/p&gt;
&lt;p&gt;&lt;a href="https://stackresearch.org/research/control-ops/"&gt;ControlOps&lt;/a&gt; built the parts: scope validation, decision lineage, blast-radius assessment, and kill-path auditing. The drill described here connects those parts around one small incident. It does not prove that agent systems are safe. It proves something more modest and more useful: one proposed action can be checked, stopped, recorded, scored, and prepared for rollback before it becomes an invisible state change.&lt;/p&gt;</description></item><item><title>Artifact Intake Boundaries for Agentic Systems</title><link>https://stackresearch.org/research/artifact-intake-boundaries-for-agentic-systems/</link><pubDate>Sun, 05 Apr 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/research/artifact-intake-boundaries-for-agentic-systems/</guid><description>&lt;p&gt;Agentic systems do not only ingest prompts. They ingest files.&lt;/p&gt;
&lt;p&gt;A reasoning trace arrives for debugging. A benchmark archive is downloaded for evaluation. A support export is added to a retrieval corpus. A set of examples is copied into a training library. Each object may look like ordinary text, but the object becomes active as soon as it is unpacked, parsed, rendered, indexed, transformed, or passed to another tool.&lt;/p&gt;
&lt;p&gt;That makes artifact intake a security boundary.&lt;/p&gt;</description></item><item><title>Structural Debugging for Chain-of-Thought Graphs</title><link>https://stackresearch.org/research/trace-topology/</link><pubDate>Thu, 02 Apr 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/research/trace-topology/</guid><description>&lt;p&gt;When a program crashes, the stack trace does not explain the whole bug. It does something narrower and more useful: it shows where execution was, what called what, and which line broke.&lt;/p&gt;
&lt;p&gt;When a language model&amp;rsquo;s reasoning goes wrong, the failure is usually harder to locate. The final answer may be fluent and wrong. The intermediate trace may drift quietly for a thousand tokens. There is often no structural map of what depended on what, and no obvious place to point and say: this is where the reasoning stopped holding together.&lt;/p&gt;</description></item><item><title>Why Agent Memory Needs a Control Plane</title><link>https://stackresearch.org/research/why-agent-memory-needs-a-control-plane/</link><pubDate>Mon, 23 Mar 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/research/why-agent-memory-needs-a-control-plane/</guid><description>&lt;p&gt;In an end-to-end memory governance scenario, a migrated record was present in the store but denied by default retrieval. The data existed, but policy correctly kept it out of the agent&amp;rsquo;s active context. That behavior sounds strict until a real system shows how quickly &amp;ldquo;just store it&amp;rdquo; turns into stale, unsafe memory that is hard to audit.&lt;/p&gt;
&lt;p&gt;That gap is why &lt;a href="https://github.com/stack-research/agentic-memory-fabric"&gt;Agentic Memory Fabric&lt;/a&gt; is a control plane for memory, not another retrieval wrapper. The point is simple: memory used by agents should be treated like governed infrastructure, with clear lineage and retrieval policy enforced at runtime.&lt;/p&gt;</description></item><item><title>Executable Metaphors: Compiling Analogy Into Prototype Code</title><link>https://stackresearch.org/research/executable-metaphors/</link><pubDate>Tue, 17 Mar 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/research/executable-metaphors/</guid><description>&lt;p&gt;Metaphors already shape software.&lt;/p&gt;
&lt;p&gt;A pipeline moves data from one stage to another. Garbage collection reclaims unused memory. A queue holds work until something is ready to process it. These words are not decorative. They carry a small model of how a system should behave.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/stack-research/executable-metaphors"&gt;Executable Metaphors&lt;/a&gt; asks what happens if that model becomes the input to a compiler. A short analogy, written in Markdown, is treated as the source artifact. The generated code, build files, documentation, and repair scripts are outputs.&lt;/p&gt;</description></item><item><title>The Unaskable Question</title><link>https://stackresearch.org/research/the-unaskable-question-machine/</link><pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/research/the-unaskable-question-machine/</guid><description>&lt;p&gt;Ask a language model something it does not know, and it may admit uncertainty or invent an answer. Ask it something a policy forbids, and it may refuse. Those are familiar failure modes. They have names, benchmarks, mitigations, and whole taxonomies around them.&lt;/p&gt;
&lt;p&gt;There is another category that receives less attention: questions the model cannot engage with because the question contradicts the structure of the system being asked. Not a knowledge gap. Not a safety boundary. A structural impossibility.&lt;/p&gt;</description></item><item><title>Evolving Better Prompts</title><link>https://stackresearch.org/research/genetic-prompt-programming/</link><pubDate>Sun, 15 Mar 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/research/genetic-prompt-programming/</guid><description>&lt;p&gt;A four-generation prompt evolution run moved average fitness from 0.887 to 0.926. The best prompt reached 0.965. The run used a population of 8 prompts and completed in under 4 minutes on a MacBook Pro with &lt;code&gt;llama3.1:8b&lt;/code&gt; running locally through Ollama.&lt;/p&gt;
&lt;p&gt;The useful trick is not genetic programming in the old sense of random token edits. Mutation and crossover are language-model calls. Every variant is still a valid prompt. The model rewrites prompts in ways a human prompt engineer might recognize: tighter wording, added constraints, reordered instructions, more concrete examples, removed weak parts.&lt;/p&gt;</description></item><item><title>ControlOps: Letting Machines Talk</title><link>https://stackresearch.org/research/control-ops/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/research/control-ops/</guid><description>&lt;p&gt;An autonomous system should not be judged only by the moment when it answers. The answer is the visible surface. Beneath it there are quieter questions: who allowed this action, which evidence shaped it, how far could the failure travel, and how quickly could the system be stopped?&lt;/p&gt;
&lt;p&gt;These questions are often asked after the fact. A runbook is opened. A trace is reconstructed. Someone searches logs for the decision that mattered. The machine has already acted, and the organization is trying to recover the shape of the action from its shadow.&lt;/p&gt;</description></item><item><title>Memory Should Decay</title><link>https://stackresearch.org/research/memory-should-decay/</link><pubDate>Sat, 14 Mar 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/research/memory-should-decay/</guid><description>&lt;p&gt;An agent memory run started with 50 stored facts. Each fact had a half-life of 10 ticks. After 30 ticks of a task loop, 8 memories remained.&lt;/p&gt;
&lt;p&gt;Those 8 were the ones the agent kept using. The other 42 expired automatically. No cleanup script. No manual pruning. No summarization pass pretending stale facts were still useful.&lt;/p&gt;
&lt;p&gt;The experiment is small, but the shape is important. Agent memory does not need to be an attic where every fact waits forever. It can behave more like working state: reinforced by use, weakened by neglect, and removed when confidence falls below a threshold.&lt;/p&gt;</description></item><item><title>Build for the Hour After Failure</title><link>https://stackresearch.org/editorial/build-for-the-hour-after-failure/</link><pubDate>Sun, 08 Mar 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/editorial/build-for-the-hour-after-failure/</guid><description>&lt;p&gt;At 4 a.m., the model is rarely the whole problem. The missing recovery path is.&lt;/p&gt;
&lt;p&gt;Agent systems are often designed around the moment before action: the prompt, the tool schema, the evaluator, the approval check, the confidence score. Those pieces matter. They shape whether the system should act at all. But the harder question arrives after a bad action has already crossed the boundary into production.&lt;/p&gt;
&lt;p&gt;What stops next? What is still allowed to run? Which identity was used? Which records changed? Which downstream systems trusted the result? Which part can be reversed, and which part can only be compensated for?&lt;/p&gt;</description></item><item><title>Software That Expires</title><link>https://stackresearch.org/editorial/software-that-expires/</link><pubDate>Mon, 16 Feb 2026 00:00:00 +0000</pubDate><guid>https://stackresearch.org/editorial/software-that-expires/</guid><description>&lt;p&gt;Software accumulates by default.&lt;/p&gt;
&lt;p&gt;Features go in. Compatibility layers remain. Old state keeps its place because removing it feels riskier than carrying it. A temporary endpoint becomes a customer dependency. A migration flag survives long after the migration. A data field whose meaning has changed three times continues to answer because some quiet part of the system still asks for it.&lt;/p&gt;
&lt;p&gt;The usual word for this is technical debt, but debt is too clean a metaphor. Debt has a lender, a balance, and a date on the bill. Software decay is less orderly. It is closer to sediment. Each layer is understandable when it lands, and opaque once enough layers have settled above it.&lt;/p&gt;</description></item></channel></rss>