Evolving Better Prompts

We ran a genetic algorithm on a population of 8 prompts for 4 generations. The average fitness score started at 0.887 and ended at 0.926. The best prompt reached 0.965. The whole run took under 4 minutes on a MacBook Pro with llama3.1:8b running locally via Ollama.

The trick that makes it work: mutation and crossover are LLM calls, not random character edits. Every variant the algorithm produces is a valid, semantically meaningful prompt. The LLM rewrites prompts the way a human would — rephrasing for conciseness, adding constraints, restructuring ordering — except it does it systematically across a population under selection pressure.

This is Genetic Prompt Programming, a Python library and CLI that treats prompt optimization as an evolutionary search problem.

The problem with single-pass prompt generation

Tools like Anthropic’s prompt generator are good. You describe your task, a meta-prompt produces a polished output, and you’re done. For most use cases, that’s enough.

But it’s a one-shot guess. The meta-prompt encodes general best practices, not your specific task dynamics. There’s no feedback signal — no way to know whether the output is actually optimal for your inputs, or just well-formatted.

We wanted something that could answer a different question: given a specific task and specific test inputs, what’s the empirically best prompt?

How it works

The algorithm follows standard evolutionary structure, with one key difference in the operators:

Seed prompts → initial population
      ↓
  [Evaluate] — LLM judge scores each prompt against your task + test inputs
      ↓
  [Select]   — tournament selection favors high-fitness prompts
      ↓
  [Reproduce] — crossover combines two parents; mutation rewrites one
      ↓
  repeat for N generations → return best prompt

Component	Implementation
Selection	Tournament selection (k=3)
Crossover	LLM combines best elements of two parent prompts
Mutation	LLM applies one of 8 semantic rewrite operators
Fitness	LLM-as-judge, or bring your own scoring function
Elitism	Top 2 carry forward unchanged each generation

The population starts from seed prompts. If you provide fewer seeds than the population size, the engine generates variants by mutating the seeds you gave it.

The LLM-as-operator twist

In traditional genetic programming, mutation is random — flip a bit, swap a token, insert noise. Most mutations produce garbage. The search works because the few viable mutations get selected, but it’s inefficient.

Here, every mutation is an LLM call. The model receives the original prompt and a specific rewrite instruction, chosen randomly from this set:

Rephrase for conciseness
Add a constraint or requirement
Change tone or framing
Remove the weakest part
Add a concrete example
Restructure ordering
Increase specificity
Add chain-of-thought instructions

The crossover operator works the same way — an LLM call that receives two parent prompts and combines their strongest elements into a child.

From operators.py:

def mutate(individual: Individual, config: EvolutionConfig) -> Individual:
    mutation_instruction = random.choice(mutation_types)

    client = make_client(config.base_url)
    response = client.messages.create(
        model=config.model,
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": (
                f"You are a genetic mutation operator. {mutation_instruction}\n\n"
                "Do NOT explain — output ONLY the mutated prompt.\n\n"
                f"ORIGINAL PROMPT:\n{individual.prompt}"
            ),
        }],
    )

    mutated_prompt = _strip_thinking(response.content[0].text)
    return Individual(
        prompt=mutated_prompt,
        parent_ids=[individual.individual_id],
        mutation_history=individual.mutation_history + [mutation_instruction],
    )

The _strip_thinking call handles reasoning models (like Qwen 3.5) that emit <think>...</think> blocks before their answer. The algorithm works with reasoning models out of the box.

Every individual tracks its full lineage — which seed it descended from, which operators were applied, which parents contributed to crossover. At the end of a run, you can trace exactly how the best prompt evolved.

Fitness evaluation

The default fitness function is LLM-as-judge. For each prompt in the population, the evaluator:

Runs the prompt against every test input to produce outputs
Asks a judge model to score each output on your criteria (0.0 to 1.0)
Averages across test inputs

evaluator = LLMJudge(
    task_description="Summarize text concisely while capturing all key points.",
    test_inputs=["The James Webb Space Telescope has discovered...", ...],
    scoring_criteria=(
        "Score based on: (1) conciseness, (2) completeness, "
        "(3) clarity, (4) no hallucination."
    ),
    model="llama3.1:8b",
)

The judge model defaults to the same model used for mutation and crossover, but you can set judge_model separately if you want a stronger model scoring a weaker one’s output.

If LLM-as-judge doesn’t fit your use case, you can plug in any scoring function:

from genetic_prompt_programming.fitness import CustomFitness

evaluator = CustomFitness(lambda prompt: your_scoring_function(prompt))

The FitnessEvaluator protocol is a single method — evaluate(prompt: str) -> float — so anything that returns a number between 0 and 1 works.

The interactive CLI

The quickest way to try it is the interactive mode:

genetic-prompt-programming evolve --interactive

It walks you through setup: describe your task, review LLM-generated seed prompts (or provide your own), review generated test inputs (or paste your own). Then it runs evolution with a live terminal UI built on Rich:

Progress bar per generation
Current best prompt displayed in real time
Fitness sparkline history (▁▂▃▄▅▆▇█)
Latest evaluated prompt and its score
Full lineage trail for the current best (e.g., seed → rephrase → crossover → add-constraint)

At the end, you get a fitness curve, a ranked table of the top 3 prompts, and the best prompt printed as clean text for copying.

For scripted use, the CLI takes everything as flags:

genetic-prompt-programming evolve \
  --task "Write a haiku about the topic given." \
  --seed "Write a haiku:" \
  --seed "Compose a 5-7-5 syllable poem about:" \
  --test-input "autumn leaves" \
  --test-input "morning coffee" \
  --population 10 \
  --generations 5 \
  --output best_prompt.json

Results

We ran the included summarization example (examples/evolve_summarizer.py) with llama3.1:8b on an M3 Max MacBook Pro, using native Ollama for GPU-accelerated inference via Metal.

Generation	Best Fitness	Avg Fitness
1	0.915	0.887
2	0.935	0.904
3	0.950	0.918
4	0.965	0.926

Configuration: population of 8, mutation rate 0.7, crossover rate 0.3, elite count 2, tournament size 3.

The full run completed in approximately 3 minutes 50 seconds. For comparison, the same run under Docker (CPU-only on macOS) took around 15 minutes — roughly a 4x slowdown from losing GPU acceleration.

These are LLM-as-judge scores, so the absolute numbers should be taken with the usual caveats about self-evaluation. What matters more is the trend: fitness consistently climbed across generations, and the best prompt at generation 4 was measurably different from the seeds we started with. The algorithm found rewrite strategies — adding specificity constraints, restructuring instruction order — that we wouldn’t have tried manually.

Architecture

The codebase is small:

src/genetic_prompt_programming/
├── models.py      # Individual, Generation, EvolutionConfig
├── engine.py      # EvolutionEngine — the evolutionary loop
├── operators.py   # tournament_select, crossover, mutate
├── fitness.py     # LLMJudge, CustomFitness, FitnessEvaluator protocol
├── client.py      # Anthropic SDK client factory
├── interactive.py # Interactive CLI with Rich live UI
└── cli.py         # Click CLI

Two design choices worth noting:

EvolutionObserver protocol. The engine emits events (on_generation_start, on_individual_evaluated, on_generation_end, etc.) through an observer. The interactive UI is one observer implementation. You can write your own — log to a file, push to a dashboard, pipe to a webhook — without touching engine code.

FitnessEvaluator protocol. A single evaluate(prompt: str) -> float method. LLMJudge is the default, CustomFitness wraps any callable. If you have a downstream metric you care about (BLEU score, regex match rate, user click-through), plug it in directly instead of relying on LLM-as-judge.

Limitations

Cost per run. Every fitness evaluation requires at least one LLM call (two if using separate judge models), plus one call per mutation and crossover. A run with population 10 over 5 generations makes hundreds of API calls. Running locally with Ollama keeps this free, but it’s not cheap on hosted APIs.
LLM-as-judge reliability. The default fitness function is only as good as the judge model’s ability to score outputs consistently. We’ve seen score variance across identical inputs, especially with smaller models. Using a stronger judge model helps, but doesn’t eliminate the problem.
Test input quality matters. The algorithm optimizes for performance on your test inputs specifically. If your test inputs aren’t representative of real usage, the evolved prompt may overfit to the test set.
Speed. This is inherently slower than single-pass generation. If you need a prompt in 10 seconds, use a prompt generator. If you need the best prompt you can find and have a few minutes, this is the tool.

Quick start

Local with Ollama (free, GPU-accelerated on Apple Silicon)

pip install -e .
brew install ollama
./scripts/setup-local.sh     # starts Ollama, pulls llama3.1:8b
genetic-prompt-programming evolve --interactive

With the Anthropic API

pip install -e .
export ANTHROPIC_API_KEY=sk-...
genetic-prompt-programming evolve --interactive --model claude-sonnet-4-6 --base-url anthropic

As a library

from genetic_prompt_programming.engine import EvolutionEngine
from genetic_prompt_programming.fitness import LLMJudge
from genetic_prompt_programming.models import EvolutionConfig

config = EvolutionConfig(population_size=10, generations=5, model="llama3.1:8b")

evaluator = LLMJudge(
    task_description="Summarize text concisely.",
    test_inputs=["Some long text here..."],
    scoring_criteria="Conciseness, completeness, clarity.",
)

engine = EvolutionEngine(
    config=config,
    fitness_evaluator=evaluator,
    seed_prompts=["Summarize this:", "Provide a brief summary:"],
    task_description="Find the best summarization prompt",
)

best = engine.run()
print(best.prompt, best.fitness)

What’s next

The obvious next step is parallelizing fitness evaluation. Each individual in the population can be scored independently, and right now they’re evaluated sequentially. On a machine with enough memory to run multiple Ollama instances — or against a hosted API with high rate limits — this could cut run times significantly.

We’re also interested in whether the mutation operator set itself could be evolved. The current 8 operators were chosen by hand. A meta-evolutionary layer that discovers which operators produce the highest-fitness offspring would close the loop — evolution optimizing its own search strategy.

The repo is at github.com/stack-research/genetic-prompt-programming.