A four-generation prompt evolution run moved average fitness from 0.887 to 0.926. The best prompt reached 0.965. The run used a population of 8 prompts and completed in under 4 minutes on a MacBook Pro with llama3.1:8b running locally through Ollama.
The useful trick is not genetic programming in the old sense of random token edits. Mutation and crossover are language-model calls. Every variant is still a valid prompt. The model rewrites prompts in ways a human prompt engineer might recognize: tighter wording, added constraints, reordered instructions, more concrete examples, removed weak parts.
Genetic Prompt Programming treats prompt optimization as an evolutionary search problem. A population of prompts is evaluated against task inputs, selected by fitness, recombined, mutated, and evaluated again. The result is not one polished prompt from a meta-prompt. It is a prompt that survived several rounds of task-specific selection pressure.
Why Single-Pass Generation Is Not Enough
Prompt generators are useful. A task description goes in, a polished prompt comes out, and many workflows can stop there. The limitation is that single-pass generation is still a guess.
The generated prompt may follow general best practices, but it has not been tested against the specific inputs, outputs, scoring criteria, and failure modes of the task. There is no feedback loop. There is no population. There is no selection.
The question this project asks is narrower and more empirical: given a task and a set of test inputs, which prompt performs best under a scoring function?
That changes prompt writing from composition into search.
The Evolution Loop
The algorithm follows a standard evolutionary shape:
seed prompts
-> initial population
-> evaluate fitness
-> select high performers
-> crossover and mutate
-> repeat for N generations
-> return best prompt
The components are intentionally simple.
| Component | Implementation |
|---|---|
| Selection | Tournament selection with k=3 |
| Crossover | LLM combines the strongest elements of two parent prompts |
| Mutation | LLM applies one semantic rewrite operator |
| Fitness | LLM-as-judge by default, or a custom scoring function |
| Elitism | Top 2 prompts carry forward unchanged each generation |
The population starts from seed prompts. If the user provides fewer seeds than the configured population size, the engine generates variants by mutating the available seeds.
LLMs as Genetic Operators
Traditional mutation flips bits, swaps tokens, or inserts noise. Most mutations are bad. Evolution works because selection keeps the few useful changes.
Prompt mutation has a different shape. A random character edit usually destroys the prompt. A language-model rewrite can preserve meaning while changing structure. The project uses mutation operators such as:
- Rephrase for conciseness.
- Add a constraint or requirement.
- Change tone or framing.
- Remove the weakest part.
- Add a concrete example.
- Restructure ordering.
- Increase specificity.
- Add reasoning instructions.
The crossover operator is also an LLM call. It receives two parent prompts and returns a child prompt that combines their strongest elements.
One core mutation path looks like this:
def mutate(individual: Individual, config: EvolutionConfig) -> Individual:
mutation_instruction = random.choice(mutation_types)
client = make_client(config.base_url)
response = client.messages.create(
model=config.model,
max_tokens=1024,
messages=[{
"role": "user",
"content": (
f"You are a genetic mutation operator. {mutation_instruction}\n\n"
"Do NOT explain — output ONLY the mutated prompt.\n\n"
f"ORIGINAL PROMPT:\n{individual.prompt}"
),
}],
)
mutated_prompt = _strip_thinking(response.content[0].text)
return Individual(
prompt=mutated_prompt,
parent_ids=[individual.individual_id],
mutation_history=individual.mutation_history + [mutation_instruction],
)
The _strip_thinking step handles reasoning models that emit <think>...</think> blocks before the final answer. The evolutionary loop needs the prompt, not the model’s scratch output.
Every individual carries lineage: seed, parent IDs, mutation history, crossover history, and fitness. At the end of a run, the best prompt is inspectable as an artifact, not just a string.
Fitness Evaluation
The default evaluator is LLM-as-judge. For each prompt, the evaluator runs the prompt against test inputs, asks a judge model to score the outputs against criteria, and averages the scores.
evaluator = LLMJudge(
task_description="Summarize text concisely while capturing all key points.",
test_inputs=["The James Webb Space Telescope has discovered...", ...],
scoring_criteria=(
"Score based on: (1) conciseness, (2) completeness, "
"(3) clarity, (4) no hallucination."
),
model="llama3.1:8b",
)
The judge can be the same model used for mutation and crossover, or a separate stronger model. The project also supports custom fitness:
from genetic_prompt_programming.fitness import CustomFitness
evaluator = CustomFitness(lambda prompt: your_scoring_function(prompt))
The protocol is deliberately small: evaluate(prompt: str) -> float. Any task-specific metric that returns a number from 0 to 1 can replace the judge.
The Run
The included summarization example, examples/evolve_summarizer.py, was run with llama3.1:8b through native Ollama on an M3 Max MacBook Pro.
| Generation | Best fitness | Average fitness |
|---|---|---|
| 1 | 0.915 | 0.887 |
| 2 | 0.935 | 0.904 |
| 3 | 0.950 | 0.918 |
| 4 | 0.965 | 0.926 |
Configuration:
| Setting | Value |
|---|---|
| Population | 8 |
| Generations | 4 |
| Mutation rate | 0.7 |
| Crossover rate | 0.3 |
| Elite count | 2 |
| Tournament size | 3 |
The full run completed in approximately 3 minutes and 50 seconds. The same run under Docker on macOS, without GPU acceleration, took around 15 minutes.
The absolute fitness numbers should be read carefully because they come from LLM-as-judge scoring. The trend is the useful result: scores rose across generations, and the best prompt at generation 4 was structurally different from the seeds. The successful changes were recognizable: more specific constraints, clearer ordering, and less ambiguous task framing.
CLI and Library Use
The fastest path is interactive mode:
genetic-prompt-programming evolve --interactive
The CLI asks for the task, seed prompts, and test inputs, then runs evolution with a Rich terminal interface. It shows generation progress, current best prompt, fitness history, the latest evaluated prompt, and lineage for the best candidate.
Scripted runs take the same inputs as flags:
genetic-prompt-programming evolve \
--task "Write a haiku about the topic given." \
--seed "Write a haiku:" \
--seed "Compose a 5-7-5 syllable poem about:" \
--test-input "autumn leaves" \
--test-input "morning coffee" \
--population 10 \
--generations 5 \
--output best_prompt.json
The library API exposes the same pieces:
from genetic_prompt_programming.engine import EvolutionEngine
from genetic_prompt_programming.fitness import LLMJudge
from genetic_prompt_programming.models import EvolutionConfig
config = EvolutionConfig(population_size=10, generations=5, model="llama3.1:8b")
evaluator = LLMJudge(
task_description="Summarize text concisely.",
test_inputs=["Some long text here..."],
scoring_criteria="Conciseness, completeness, clarity.",
)
engine = EvolutionEngine(
config=config,
fitness_evaluator=evaluator,
seed_prompts=["Summarize this:", "Provide a brief summary:"],
task_description="Find the best summarization prompt",
)
best = engine.run()
print(best.prompt, best.fitness)
Architecture
The codebase is small enough to inspect quickly:
src/genetic_prompt_programming/
├── models.py # Individual, Generation, EvolutionConfig
├── engine.py # EvolutionEngine
├── operators.py # tournament_select, crossover, mutate
├── fitness.py # LLMJudge, CustomFitness, FitnessEvaluator protocol
├── client.py # Anthropic SDK client factory
├── interactive.py # Rich live UI
└── cli.py # Click CLI
Two interfaces matter most.
EvolutionObserver lets the engine emit events such as generation start, individual evaluation, and generation end. The interactive UI is one observer. A file logger, dashboard, or webhook can be another.
FitnessEvaluator keeps scoring pluggable. If a downstream metric matters more than a judge model, the evolutionary loop can optimize for that directly.
Limitations
Cost grows quickly. Every fitness evaluation requires model calls, and mutation and crossover require more. A population of 10 over 5 generations can produce hundreds of calls. Local Ollama makes experimentation cheap; hosted APIs need budgets and rate limits.
LLM-as-judge is noisy. Smaller judge models can vary across identical inputs, and even stronger judges can reward style over task success. Custom metrics are preferable when a reliable downstream metric exists.
Test inputs shape the result. If the test set is narrow, the evolved prompt can overfit. The right evaluation setup needs held-out inputs or repeated runs across different seeds.
Speed is slower than single-pass prompt generation. If the task needs a prompt in seconds, a generator is the right tool. Evolution is for cases where a few minutes of search is worth the improved candidate.
What This Does Not Prove
One run does not prove that genetic prompt optimization generalizes across tasks. It proves that the mechanism can produce a measurable improvement on the included summarization setup.
A stronger evaluation would repeat the run across several random seeds, compare against a single-pass prompt generator and a hand-written prompt, and score on held-out inputs. That would separate genuine task improvement from judge noise and test-set overfitting.
The project is useful now as a local optimization tool and as a research scaffold. It turns prompt iteration into an inspectable search process: population, lineage, fitness curve, best candidate, and limitations.
Genetic Prompt Programming is available at github.com/stack-research/genetic-prompt-programming.
