Executable Metaphors: A Compiler Where the Source Code Is an Analogy

Executable Metaphors: A Compiler Where the Source Code Is an Analogy

Stack Research
oss engineering

We built a compiler that turns natural-language metaphors into running programs. The metaphor is the source of truth; the code is the build artifact.

We built a system where you describe a program as an analogy — “a doorman who remembers every face but forgets names after an hour” — and it compiles that into source code, a Makefile, documentation, and a self-healing repair loop. The metaphor lives in a markdown file. The code lives in a build directory. To refactor, you rewrite the metaphor and recompile.

This is Executable Metaphors, a Python tool that treats natural language analogy as the canonical source of a program, and generated code as a disposable artifact.

The idea

Programming languages are already metaphors. We say “garbage collection” and mean memory reclamation. We say “pipeline” and mean sequential data transformation. But these metaphors are frozen — they’re baked into the language spec, and the developer has to translate from intent to implementation manually.

What if the metaphor itself was the program? Not pseudocode, not a spec, not a comment at the top of a file. The actual thing the compiler reads.

The hypothesis is straightforward: LLMs are good enough at code generation that the bottleneck isn’t writing code — it’s deciding what to write. A metaphor compresses a set of architectural decisions into a single sentence. “A librarian who files things by mood, not topic” implies a classification system, a storage layer, a non-standard taxonomy, and an interface for querying by affect. The LLM’s job is to unpack that compression into components and then into code.

How the compiler works

The compilation pipeline has four phases:

metaphor.md → [Interpret] → [Generate] → [Scaffold] → [Validate] → builds/project/
PhaseInputOutputMethod
InterpretMetaphor textarchitecture.json — components, language, dependencies, file layoutLLM call
GenerateArchitecture JSONSource files in src/LLM call per file
ScaffoldArchitecture JSONMakefile, README.md, AGENTS.md, manifestDeterministic templates
ValidateComplete buildVerified build, or patched buildmake install + make run + LLM repair loop

The interpret phase is where most of the interesting work happens. The LLM receives the metaphor and returns a structured JSON architecture: what language to use, what files to create, what each component does, what dependencies to install. This is the moment where “a doorman who remembers every face but forgets names after an hour” becomes a concrete spec — a Python service with a TTL-based in-memory cache, a face-recognition component, and a greeting interface.

The generate phase takes that architecture and produces actual source files. One LLM call per file, with the architecture and the metaphor both in context. The metaphor stays in the prompt throughout — the LLM isn’t just implementing a spec, it’s implementing a spec anchored to an analogy that carries implicit design intent.

The scaffold phase is deterministic. The Makefile, manifest, and metadata files are generated from templates, not by the LLM. This avoids a common failure mode where LLM-generated Makefiles point to wrong paths or use inconsistent run commands.

Self-healing validation

Phase 4 is where things get practical. After generating code, the compiler actually runs it:

  1. make install — catches bad package names (the LLM might emit OpenCV when pip needs opencv-python)
  2. make run — catches runtime errors: ImportError, AttributeError, missing arguments
  3. On failure, the full traceback plus all source files go back to the LLM for diagnosis
  4. The LLM patches the code and the compiler retries, up to 3 attempts

This matters because LLM-generated code frequently doesn’t run on the first pass. Misremembered API signatures, hallucinated function names, version-incompatible imports. The repair loop catches most of these mechanically.

Every compiled project also ships with fix.py and a build_config.json that records which provider and model built it. If the project breaks after compile, you run make fix in the build directory and it enters a 5-attempt repair loop using the same LLM that originally generated the code:

cd builds/my-project
make fix

The fix script captures the crash traceback, sends it along with all source files to the LLM, applies the patches, and retries. It’s not magic — it won’t fix deep architectural problems — but it handles the class of errors where the code is 95% right and needs a missing import or a corrected method signature.

Refactoring by rewriting the metaphor

This is the part we find most interesting. Traditional refactoring operates on code: rename this function, extract this class, move this module. The intent behind the change lives in the developer’s head.

With executable metaphors, refactoring operates on the metaphor. make refactor prompts you to rewrite the analogy for an existing project:

Current metaphor: a doorman who remembers every face but forgets names after an hour
New metaphor: a librarian who files visitors by mood, not name, and purges records at closing time

The compiler re-runs the full pipeline against the new metaphor. The architecture changes — maybe the storage model shifts from a TTL cache to a categorized log with a scheduled purge. The code regenerates from scratch.

This is a blunt instrument. You lose all manual edits to the generated code. But that’s the point — the metaphor is the source of truth. If you need to change the code, you change the metaphor. If the metaphor doesn’t support the change you need, the metaphor isn’t specific enough.

In practice, this works better for prototyping and exploration than for production systems. You can iterate on program architecture at the speed of natural language, testing whether an analogy produces code that actually does what you intend. Once you’ve found the right shape, you take the generated code as a starting point and maintain it conventionally.

Provider-aware code generation

One design detail worth calling out: when the generated program itself needs LLM capabilities — say the metaphor implies a chatbot or a vision system — the compiler generates code that uses the same provider the user chose for compilation.

If you compile with Ollama, generated code uses the local Ollama HTTP API. If you compile with Anthropic, generated code uses the anthropic SDK with claude-sonnet-4-6. The provider context is injected into the generation prompt via templates in prompts.py, including the actual model name.

This avoids a common problem with LLM code generation tools where the generated code defaults to OpenAI regardless of what the user is actually running.

What a build looks like

Every compiled project gets a consistent set of artifacts:

FilePurpose
metaphor.mdThe metaphor — canonical source of truth
architecture.jsonLLM’s interpretation: components, language, files, dependencies
build_config.jsonProvider and model used to compile
manifest.jsonProject metadata and revision history
Makefilemake run, make install, make fix, make test
fix.pySelf-healing repair script
README.mdHuman-readable documentation (LLM-generated)
AGENTS.mdAI assistant context for the project
src/Generated source code

The AGENTS.md file is a nice touch — if you later point an AI coding assistant at the build directory, it gets structured context about what the project does and how it’s organized, without having to infer it from the code.

Running it

The default setup uses Ollama with local models, which keeps iteration fast and free:

make new

The interactive session prompts for project name, provider, model, and metaphor. Then it compiles:

Project name: greeter
Provider [ollama]:
Model [llama3.1:8b]:
Metaphor: a doorman who remembers every face but forgets names after an hour

For better results on Apple Silicon, qwen2.5-coder:32b is the recommended local model. For the best results, select the Anthropic provider with claude-sonnet-4-6 (requires ANTHROPIC_API_KEY).

Limitations

  • Generated code quality is model-dependent. llama3.1:8b produces working code for simple metaphors but struggles with complex multi-component architectures. Larger models (qwen2.5-coder:32b, claude-sonnet-4-6) produce significantly better results, but the gap is real.
  • The metaphor-to-architecture step is fragile. The same metaphor can produce different architectures across runs. There’s no guarantee of determinism, and some metaphors are genuinely ambiguous. “A river that sorts its own fish” could mean half a dozen different things architecturally.
  • Refactoring is all-or-nothing. Rewriting the metaphor regenerates everything. There’s no incremental compilation, no diff-and-patch. If you’ve manually edited the generated code, those edits are gone.
  • Self-healing has limits. The repair loop handles syntax errors, import mistakes, and API misuse well. It doesn’t handle deeper issues like wrong algorithms, incorrect business logic, or architectural problems that require rethinking the metaphor.
  • Truncation from local models. Ollama responses sometimes get truncated, especially for large files. The compiler includes an auto-repair step that attempts to close open JSON brackets, but it doesn’t always recover gracefully.

What this is actually for

We don’t think metaphor-driven development replaces conventional programming. The generated code is a starting point, not a finished product.

What it’s good for is rapid architectural exploration. You can test five different mental models of a system in the time it takes to manually scaffold one. “Is this more like a switchboard or a mailroom?” becomes a question you can answer empirically — compile both, run both, compare the generated architectures.

It’s also a useful lens for thinking about what LLM code generation is actually doing. The models aren’t just autocompleting syntax. They’re translating from one representation (natural language description) to another (source code). Making the input representation explicitly metaphorical forces you to think about what information that translation needs, and what gets lost.

Practical takeaways

  1. Metaphors compress architectural intent. A well-chosen analogy encodes component relationships, data flow patterns, and behavioral constraints in a way that LLMs can unpack into concrete implementations.
  2. Self-healing build loops belong in every code generation pipeline. Generating code that doesn’t run is a waste of the generation step. A repair loop that feeds errors back to the LLM catches most first-pass failures mechanically.
  3. Provider-aware generation avoids API mismatch. If your tool generates code that calls LLMs, the generated code should use the same backend the user is already running.
  4. Deterministic scaffolding prevents drift. Let the LLM generate source code, but generate Makefiles, configs, and metadata from templates. LLM-generated build files are a consistent source of subtle bugs.
  5. Treat generated code as a build artifact. When you stop treating generated code as precious — when it’s something you recompile from a higher-level source — iteration speed increases dramatically.

The repo is at github.com/stack-research/executable-metaphors.