Metaphors already shape software.

A pipeline moves data from one stage to another. Garbage collection reclaims unused memory. A queue holds work until something is ready to process it. These words are not decorative. They carry a small model of how a system should behave.

Executable Metaphors asks what happens if that model becomes the input to a compiler. A short analogy, written in Markdown, is treated as the source artifact. The generated code, build files, documentation, and repair scripts are outputs.

That does not make metaphor a replacement for programming. It makes metaphor a useful compression layer for prototyping: a way to test whether a mental model implies a coherent architecture before committing to a conventional implementation.

The Claim, Narrowly

The useful claim is not that analogy should become a production programming language. That would be too broad, and the evidence is not there.

The narrower claim is more practical: an analogy can be a compact architecture brief. It can suggest components, data flow, state retention, naming, interfaces, and failure behavior in a form that a language model can unpack into a first-pass project.

For example:

a doorman who remembers every face but forgets names after an hour

That sentence implies more than a greeting function. It suggests identity recognition, temporary memory, retention policy, and behavior that changes as time passes. The compiler’s job is to translate those implied structures into a generated project that can be inspected, installed, run, and repaired.

The generated code is not sacred. It is a build artifact.

The Build Pipeline

The pipeline has four stages.

metaphor.md -> interpret -> generate -> scaffold -> validate -> builds/project/
StageInputOutputMethod
InterpretMetaphor textarchitecture.json with components, language, dependencies, and file layoutLLM call
GenerateArchitecture JSONSource files under src/LLM call per file
ScaffoldArchitecture JSONMakefile, README.md, AGENTS.md, manifest, and metadataDeterministic templates
ValidateComplete buildA working prototype or a patched retrymake install, make run, and repair loop

The interpret stage is the important hinge. It converts a suggestive sentence into a structured architecture. The compiler asks the model to name components, choose dependencies, describe files, and explain how the pieces relate.

The generate stage then uses that architecture and the original metaphor as context for file-level generation. The metaphor remains present, so the model is not only following a schema. It is still anchored to the design image that produced the schema.

The scaffold stage is deliberately less imaginative. Build files, metadata, assistant context, and project manifests come from templates. That avoids a common failure in code-generation systems: the source code may be plausible while the generated Makefile, run command, or dependency setup points at the wrong thing.

Validation Is Part of Generation

The compiler does not stop after writing files. It tries to run the project.

make install
make run

Installation catches dependency mistakes: wrong package names, missing libraries, or generated imports that do not match the declared environment. Runtime execution catches the next layer of errors: missing arguments, bad method names, incomplete files, and API mismatches.

When validation fails, the traceback and generated source files are passed back into a repair step. The compiler patches the project and retries, up to a bounded number of attempts.

This loop matters because first-pass generated code often fails in ordinary ways. A model may remember an API shape that no longer exists. It may choose a package name that is close but not installable. It may generate two files whose interfaces almost, but not quite, agree.

The repair loop does not solve deep design errors. It is not a substitute for review. Its value is mechanical: it converts many shallow generation failures into visible, repeatable build failures and gives the system a bounded chance to correct them.

Each generated project also includes:

  • metaphor.md, the source analogy
  • architecture.json, the interpreted architecture
  • build_config.json, the provider and model used for compilation
  • manifest.json, project metadata and revision history
  • Makefile, with commands such as make run, make install, make fix, and make test
  • fix.py, a local repair script for later failures
  • README.md, generated project documentation
  • AGENTS.md, assistant-facing context for future coding work
  • src/, the generated source tree

The AGENTS.md artifact is especially useful in this workflow. If an AI coding assistant later opens the generated project, it receives the intended structure directly instead of inferring the system from raw files alone.

Test Artifacts

The repository includes a local pytest suite for the compiler mechanics. These are not benchmark results, and they do not prove that arbitrary metaphors produce useful software. They do show which parts of the system have been made testable without relying on a live model.

At the time of migration, the suite contains tests across six files:

Test areaWhat it checks
Compilation pipelineA mock provider can compile a metaphor into metaphor.md, architecture.json, source files, docs, AGENTS.md, and a deterministic Makefile.
Refactor flowUpdating the metaphor records a new revision, rewrites the canonical source, and regenerates project docs.
Project managerProject creation, artifact writing, nested source paths, history, sorted listing, and deletion behavior.
Provider layerOllama and Anthropic provider configuration, API payload shape, timeout handling, connection errors, missing model guidance, and missing API key behavior.
Parser utilitiesMarkdown fence stripping, provider-context prompts, malformed JSON handling, and best-effort repair for truncated JSON.
Fix scriptSource gathering, file patching, dependency updates, config loading, runtime failure detection, timeout classification, and repair JSON parsing.

The most important test is the mock-provider compile path. It verifies the architectural promise without needing a live LLM: the interpreter returns architecture JSON, file generation writes source under src/, deterministic scaffolding creates the build surface, and the provider call count stays bounded because the Makefile is not generated by the model.

That is a modest but useful standard. The project has tests for the machinery around generation. It still needs empirical runs across multiple real metaphors and models before making claims about success rate or architecture quality.

Refactoring by Rewriting the Source

Executable Metaphors treats the analogy as the canonical source. Refactoring therefore happens by changing the metaphor and rebuilding.

Current metaphor:
a doorman who remembers every face but forgets names after an hour

New metaphor:
a librarian who files visitors by mood, not name, and purges records at closing time

This is intentionally blunt. The compiler does not preserve manual edits to generated code. It regenerates the project from the higher-level source.

That makes the approach unsuitable for mature systems with hand-tuned behavior. It is much better suited to early exploration, where the question is still architectural: should this system behave like a doorman, a librarian, a switchboard, a ledger, or a checkpoint?

Compiling several analogies gives the developer something concrete to compare. The output may reveal which mental model produces a cleaner interface, a simpler state model, or a more obvious failure boundary.

Provider-Aware Generation

One practical design choice is that generated projects retain provider context. If a generated program needs LLM calls itself, the compiler can generate code against the same provider family selected for compilation, rather than silently assuming a different hosted API.

The provider and model are recorded in build_config.json. That record matters for later repair. If the project breaks and make fix runs, the repair script can use the same generation context that produced the original files.

This does not make output deterministic. Different models, and even repeated runs with the same model, can interpret the same metaphor differently. But the build record gives the prototype a traceable origin.

What This Is Good For

The strongest use case is rapid architectural exploration.

One analogy can be compiled, installed, run, inspected, and discarded. Another can replace it. The result is not a finished product; it is a working sketch. That sketch can expose whether the metaphor carries enough structure to be useful.

This also makes the limits of LLM code generation easier to see. The model is not simply completing syntax. It is translating from one representation into another: analogy to architecture, architecture to files, files to an executable project. Each translation can lose information. Each one can also surface hidden assumptions.

The approach is strongest when the metaphor is specific enough to imply behavior:

  • What enters the system?
  • What state is retained?
  • What expires?
  • What is routed, rejected, transformed, or remembered?
  • What failure should look local rather than global?

Vague metaphors produce vague architectures. Operational metaphors produce inspectable prototypes.

Limitations

The method has sharp edges.

  • Generated code quality is model-dependent.
  • The metaphor-to-architecture step is fragile and can vary across runs.
  • Ambiguous metaphors produce divergent interpretations.
  • Refactoring is all-or-nothing because generated files are replaced.
  • The repair loop handles shallow execution failures better than wrong algorithms or bad product logic.
  • Local model responses can truncate on large files or complex projects.

Those limitations are not incidental. They define the boundary of the tool. Executable Metaphors is useful when the cost of throwing away a generated prototype is low. It is risky when the generated output starts to accumulate production responsibility before being reviewed and maintained as ordinary software.

Practical Takeaways

Several design lessons survive even if metaphor-driven prototyping remains a niche workflow.

  • Treat higher-level intent as a real artifact, not only as a comment near the code.
  • Keep generated code disposable until it has been reviewed and adopted.
  • Generate deterministic scaffolding where possible; let the model focus on source files and architecture interpretation.
  • Run generated code immediately, because validation is part of generation.
  • Record provider and model context so later repair has provenance.
  • Use repair loops for mechanical failures, not for unbounded trust.

The central idea is modest: a metaphor can be made executable enough to test.

That is a different standard from being correct, complete, or production-ready. A good prototype only has to answer the next question clearly. In this case, the question is whether an analogy carries enough architecture to build from. Sometimes it does. When it does not, the failed build is useful evidence too.