Prompts are Hyperparameters

A practitioner’s tour of DSPy, MIPROv2 and GEPA. The reframe (prompts are parameters of an LLM program, not the artefact you ship), the five axes any optimiser can tune, how MIPROv2 and GEPA actually work, where this set of methods quietly disappoints, and a decision tree for picking one.

A prompt is a hyperparameter of an LLM application, not the artefact you ship.

Open Table of contents

What prompt optimization is
Why this matters now
The DSPy thread
The reframe in plain English
A worked example: multi-hop question answering
The five axes of prompt optimisation
The four methods, in chronological order
Cost reality check
Why this often disappoints in practice
A decision tree you can actually follow
Further reading

What prompt optimization is

A prompt is the natural-language instruction you hand to a language model to get it to do a task. “Classify this email as spam or not.” “Given these search results, answer the user’s question and cite your sources.” “Read this contract and extract the termination clauses as JSON.” Almost every LLM-powered feature you have used in the last three years is, under the hood, one or more prompts wrapped around a model call.

Prompt optimization is the practice of systematically improving those instructions (plus the few-shot examples, formatting scaffolding, and per-module wording around them) so that the program performs measurably better on the task you actually care about. The key word is systematically. The naive alternative, which most teams still run, is to sit in a playground, type a prompt, run it on a handful of cases, eyeball the outputs, edit, repeat. Sometimes adding “think step by step.” Sometimes adding “be careful.” Sometimes removing “be careful” because it stopped working. The loop is intuitive but it is also slow, unrepeatable, biased toward whichever failure cases happen to be top of mind that morning, and fragile against the slightest change in the model or the data.

A prompt optimizer replaces the human-in-the-playground loop with a procedure. Define what good output looks like (a metric). Define what the inputs look like (a small dataset, typically 100 to 300 examples). Let an algorithm search the space of candidate prompts (or candidate few-shot demonstrations, or candidate per-module instructions) to maximize the metric. Ship the result as a compiled artefact, the same way you would ship a trained model.

Why this matters now

Three things changed in the last two years that make hand-tuning untenable for serious work.

First, prompts became load-bearing infrastructure. A non-trivial LLM application is no longer one prompt. It is half a dozen prompts, chained, with retrieval and tool calls between them. Each one has an instruction section, a few-shot section, possibly a persona, possibly a constraint list. The combinatorial surface is large enough that tuning each component in isolation by hand is not a strategy. It is a confession that you do not have one.

Second, models keep moving. The carefully-worded prompt you spent a week on for GPT-4 may produce subtly different outputs on Claude, on a fine-tuned open model, on the next version of the same model. Anyone who shipped LLM features in 2023 felt the pain of “we swapped the model and all the prompts broke.” Without a compile step, every model change is a re-authoring exercise.

Third, the metrics got cheap. Writing a programmatic check (“does the output parse as JSON”, “does the answer match the gold label”, “does the tool call use the right schema”) is fast. Running an LLM-as-judge over a few hundred examples is a few cents. Once you can score outputs cheaply, the question “which prompt scores highest” becomes answerable by search rather than intuition. And once it is answerable by search, you stop wanting a human to do it.

So the field has converged on a simple idea: treat prompts the way every other parameter of an ML system gets treated. As something an optimizer chooses against a metric, not something a human authors against vibes. That is what prompt optimization means as a discipline, and it is the lens for the rest of this post.

The DSPy thread

Hand-tuning prompts in a playground works fine for prototyping, exploring what a model can do, or shipping a single-call feature. It is how almost every LLM feature starts. It stops working once the prompt is one component of a larger system that nobody is treating as one. Every other parameter of an ML system gets its values from a compiled or optimised process. Prompts can too.

This post is about the work that takes that view seriously. The DSPy framework from Stanford NLP, plus a four-year line of papers (DSP, MIPRO, GEPA), has built up the tooling to treat prompts the way you would treat any other set of hyperparameters: declare a program, define a metric, let an optimiser sweep candidate prompts against that metric, ship the compiled output.

The headline empirical result, from the GEPA paper (Agrawal et al., July 2025): on multi-hop QA, instruction-following, and claim-verification benchmarks, a reflective prompt optimiser matches or beats GRPO-trained models using 3 to 35× fewer rollouts. The interpretable nature of language is, in their words, “a much richer learning medium for LLMs, compared with policy gradients derived from sparse, scalar rewards.” Reading a rollout trace and rewriting the prompt is denser signal than backpropagating a scalar reward.

What follows is a practitioner’s tour. What “compiling a prompt” means in code. The five axes any prompt optimiser can tune. The two methods most teams reach for in 2026 (MIPROv2 and GEPA), and the cases each is best for. Where this whole approach quietly disappoints. And a decision tree for picking the right method for what you are building.

The reframe in plain English

The move that makes the whole DSPy thread cohere. You write your LLM application as a program: a typed signature ("question, context: list[str] -> answer: str"), a module that calls an LLM against that signature, and a metric that tells you whether the output was right. You define a training set: 100 to 300 input/output pairs.

Then you do not write the prompt.

The framework writes the prompt by running your program on the training set, recording what every module produced internally, harvesting the intermediate outputs of successful runs as candidate few-shot demonstrations, asking another LLM to propose candidate instructions grounded in the dataset and the trajectories that have worked so far, Bayesian-searching the joint space of (instruction, demonstrations) per module, and returning a new program with the best instructions and demos baked in.

The output of all that is a JSON file. Instructions, demos, trial scores, chosen configuration, all in one artefact you commit to git like any other build output. When you swap models or your data shifts, you re-compile.

That last point matters more than it sounds. The hand-written prompt era is over for the kinds of tasks where you have a metric. And the qualifier matters: this whole approach is downstream of having a metric. Where you can write one (classification accuracy, multi-hop QA exact match, JSON schema validity, regex match, tool-call format), the framework is a clear win. Where you cannot (open-ended generation, “does this output feel right”), you become an LLM-judge engineer, and the optimisation is only as good as your judge. The DSPy literature is honest about this in the small print but undersells it in the headlines.

A worked example: multi-hop question answering

Here is what DSPy code actually looks like. The task is multi-hop QA over Wikipedia: given a question that requires hopping between two or more articles to answer, produce the answer. The standard benchmark is HotPotQA.

Rather than dump the whole program at you, here are the four moving parts.

The signatures are typed I/O contracts. They are also where the proposer LLM later writes its instruction.

self.gen_query = dspy.ChainOfThought("question, notes: list[str] -> next_query: str")
self.answer    = dspy.ChainOfThought("question, context: list[str] -> answer: str")

The forward method is just Python. No graph, no special framework constructs. The two predictors are wired around a retrieval call.

def forward(self, question):
    notes = []
    for _ in range(self.num_hops):
        q = self.gen_query(question=question, notes=notes).next_query
        notes.extend(search(q, k=self.k))
    return self.answer(question=question, context=notes)

The metric has the one DSPy gotcha worth memorising: return bool under trace (so the bootstrap filter is strict about which demos to keep), float otherwise (so eval gets a continuous score).

def metric(ex, pred, trace=None):
    ok = answer_exact_match(ex, pred)
    return bool(ok) if trace is not None else float(ok)

The compile call is two lines.

optimizer = dspy.MIPROv2(metric=metric, prompt_model=prompt_lm,
                         task_model=student_lm, auto="light")
compiled = optimizer.compile(MultiHopQA().deepcopy(), trainset=trainset, valset=valset)

That is the whole thing. Roughly 15 lines of program, 3 lines of metric, 2 lines of compile. The output is a JSON file you commit.

On HotPotQA dev, a baseline MultiHopQA with the default ChainOfThought prompts lands around 30 to 40% exact-match. The MIPROv2-compiled version typically picks up 8 to 15 points by inserting bootstrapped demonstrations into the gen_query module where the first-hop query rewrites were particularly good, and by editing the instructions to be more explicit about what to put in next_query given the current notes.

The compile run on auto="light", with gpt-4o-mini as the student and gpt-4o as the proposer, takes around 10 to 20 minutes and costs single-digit dollars. What you are paying for: hundreds to low thousands of LLM calls, mostly to evaluate candidate prompts on minibatches of your validation set.

The five axes of prompt optimisation

Tuning a prompt is not one thing. The DSPy literature treats it as a set of related-but-distinct moves, and being clear about which axis you are on saves time. Five axes recur across the papers.

Demonstrations. Which input/output pairs go into the few-shot section of your prompt. Bootstrapped from successful runs: run the program on the trainset, keep the trajectories where the metric passed, harvest each module’s intermediate (input, output) pair as a candidate demo for that module. The trick traces back to Khattab et al.’s 2022 Demonstrate-Search-Predict paper and is the workhorse of everything that came after. Across every paper in the literature, removing demonstration optimisation hurts more than removing any other axis. 0-shot MIPRO consistently underperforms full MIPRO.

Instruction text. The natural-language task description at the top of each module’s prompt. An LLM-as-proposer reads a structured bundle (dataset summary, program summary, prior trials and their scores) and generates candidate instructions. A search method picks among them. Smaller lever than demonstrations on average, but the biggest on tasks where wording carries genuinely hard-to-demonstrate information: negation handling, instruction-following constraints, tabular schema prose.

Program structure. Not the prompt at all. What modules exist, how they connect, what each one’s signature is. The DSP, IReRa, and STORM papers each show that picking the right structure (sub-queries, infer-retrieve-rank, persona-conditioned researchers) outperforms tuning prompts within a worse structure. The DSPy literature does not have an automated structure search. That part is on you. The pragmatic rule: if your DSPy program has one module and you are trying to optimise its prompt, you have probably under-decomposed.

Retrieval behaviour. When to retrieve, what to retrieve with, how to rewrite the query mid-pipeline. The bootstrap procedure recovers the intermediate retrieval query from a successful trajectory and uses it as a demo for the query-rewriting module. Same pattern as demonstration optimisation, applied to a different module.

Constraints (assertions). Post-conditions on outputs. JSON must parse. Citations must point to retrieved documents. Output length must be under N tokens. Tool calls must use a defined schema. The DSPy Assertions paper (Khattab et al., 2023) shows that adding dspy.Assert(condition, message) does two useful things at once. At compile time, the optimiser filters its bootstrap pool to keep only trajectories that satisfied all assertions, so demos are not just metric-passing but also constraint-respecting. At inference time, a violation triggers a backtrack with the failure message injected into the prompt. Up to 164% improvement in constraint satisfaction and 37% in downstream quality in the paper’s case studies. Constraints are not just safety rails; they are a source of supervision.

If you take one thing from this section: the axes are different problems. A team that conflates them, or that only ever tunes one of them, ends up doing the wrong work for too long.

The four methods, in chronological order

The DSPy line of work is a 2022-to-2025 progression where each method addresses a specific limitation of the previous one.

DSP (Dec 2022): bootstrap pipeline-aware demonstrations

The seed of everything that came after. The Demonstrate-Search-Predict paper introduced the trick that defines DSPy optimisers to this day: run your program end-to-end on training inputs, keep the runs whose final output is correct, and harvest the intermediate (input, output) pairs as demonstrations for every module along the way. You do not need labels for the intermediate predicates. You do not need to hand-write few-shots for each step. The pipeline runs itself and supervises itself.

Reported gains were 37 to 120% over vanilla GPT-3.5, 8 to 39% over standard retrieve-then-read, 80 to 290% over self-ask on multi-hop benchmarks. Big numbers, simpler headline message: pipeline-aware demonstrations beat hand-written ones.

DSPy + BootstrapFewShot (Oct 2023): productisation

This is the framework as most readers will encounter it. Signatures (typed I/O contracts), modules (Predict, ChainOfThought, ReAct, etc.), teleprompters (the optimisers). Five optimisers in the original release, all variations on the bootstrap trick: LabeledFewShot, BootstrapFewShot, BootstrapFewShotWithRandomSearch, BootstrapFinetune, Ensemble.

The headline number from the paper: GSM8K accuracy on GPT-3.5 goes from 25.2% (vanilla zero-shot) to 81.6% (CoT + bootstrap + ensemble). HotPotQA exact match goes from 31.5% to 45.6% with multihop + bootstrap + ensemble.

Honest reading: those deltas are real, but the contribution is partly the program structure (multihop, CoT, ensembles) and partly the bootstrap demos. The “compilation” framing is doing marketing work for what is, in the GSM8K case, “CoT prompting + self-distilled few-shots + ensembling”. Useful; not revolutionary on its own.

The insight that made the framing land in the community was not the optimisers. It was the separation of program from prompt: the same program runs against GPT-3.5, Llama-2, a fine-tuned T5, with the same code, recompiled per model. Anyone who shipped LLM features in 2023 felt the pain of “we swapped the model and all the prompts broke”. DSPy made that pain a type system. The teleprompter API gave a clean hook for future optimisers, which is exactly what the next paper fits into.

MIPROv2 (Jun 2024): joint instruction + demo optimisation

MIPROv2 is the first DSPy optimiser sophisticated enough to be worth singling out. Where BootstrapFewShot and its variants only learn demonstrations and leave the instruction string as whatever you typed in, MIPROv2 jointly optimises both the instruction and the demonstrations for every module in your program. It does this under a fixed evaluation budget, without per-module labels, and without gradients.

There are three innovations stacked on top of each other.

The first is a grounded instruction proposer. The optimiser makes an LLM call, typically using a stronger model than your task model, and prompts it with a structured bundle of context: a short two-or-three-sentence summary of the dataset generated up-front, a summary of the program describing how the pipeline is wired, examples drawn from the bootstrap pool of successful trajectories, the prior trial instructions tagged with the scores they earned, and a categorical “tip” knob with options like “be concise”, “be precise”, or “use specific reasoning”. The proposer reads all of this and generates the next candidate instruction. This is not a gradient on the instruction in any formal sense. It is closer to a metric-conditioned next-instruction generator.

The second is a Bayesian search using TPE, the Tree-structured Parzen Estimator from Optuna. It searches over the joint discrete space of (instruction_choice, demo_set_choice) per module. TPE handles credit assignment implicitly because it learns a joint density over the entire combinatorial space, so it can figure out which module-level variable matters most without needing module-level labels to guide it.

The third is a stochastic minibatch surrogate for the evaluation step. Most trials get scored on small minibatches of the validation set, which is cheap but noisy. Periodically, the configuration that currently has the best mean score is promoted to a full validation evaluation, which is expensive but accurate. The combination keeps the total evaluation budget tractable while still anchoring the search against the real underlying metric.

The results back the design. MIPROv2 typically delivers 2 to 3 percentage points over BootstrapFewShotWithRandomSearch on most tasks, and up to 13 points on tasks where the wording of the instruction really matters, such as the paper’s ScoNe negation-handling benchmark. The 0-shot version of MIPRO, which optimises instructions but not demonstrations, consistently underperforms the full version. Demonstrations still carry most of the signal, even with all this instruction machinery sitting on top.

It is worth being clear about what MIPRO is not. It is not a method that “learns” prompts in any meaningful sense. It generates a pool of candidates through the proposer and then selects among them using TPE. The cleverness lies in how the search space is constructed, since the grounded proposer reliably produces good candidates, rather than in the optimiser itself. This is a useful clarification to keep on hand when a colleague says “MIPRO learned the prompt.” It did not. It picked from a pool.

GEPA (Jul 2025): reflective prompt evolution

GEPA is the newest method in this line of work and the most differentiated from what came before. Where MIPROv2 scores each candidate with a scalar and lets TPE drive the search, GEPA does something different. It lets an LLM read the full rollout traces in natural language and propose a prompt edit grounded in what specifically went wrong on each trace.

The quotable line from the paper:

“The interpretable nature of language can often provide a much richer learning medium for LLMs, compared with policy gradients derived from sparse, scalar rewards.”

The reasoning behind the design is informational. Each step of GEPA extracts on the order of prompt-tokens worth of edit signal from a single rollout, while each step of policy-gradient RL extracts roughly one scalar per rollout. The language channel is simply bandwidth-richer per sample, and the reflection LLM has strong priors about what good prompts tend to look like.

The algorithm itself fits in a paragraph. The optimiser maintains a Pareto frontier of candidate prompts. On each iteration, it samples a candidate from the frontier and runs it on a minibatch. For every trial in the minibatch, it records the trace (reasoning, tool calls, and tool outputs) along with any textual feedback the metric emits, such as compiler errors, validator messages, or retrieval logs. All of that gets fed to a reflection LLM along with the current prompt, and the reflection LLM is asked to propose a revised version. If the revision beats the parent on the minibatch, it gets promoted to the full validation set and added to the pool. The use of a Pareto frontier rather than a greedy best-candidate approach is what prevents the search from collapsing onto a single lineage of edits.

Compared to GRPO, GEPA matches or beats accuracy while using 3 to 35× fewer rollouts. The 35× figure comes from IFBench specifically, and the average across benchmarks sits closer to 10×.

There are two caveats worth taking seriously. The first, which the paper itself flags, is that GEPA is instruction-only. It does not do any demonstration optimisation at all, and demonstration optimisation is what MIPROv2 spends most of its budget on. The two optimisers are therefore not strictly comparable, since they are doing somewhat different things underneath the same interface.

The second caveat is about implementation. GEPA without rich textual feedback in the metric is, effectively, just an expensive version of MIPRO. If your metric returns a bare float with no surrounding context, the reflection LLM falls back to reasoning over “this trajectory got a score of X,” and most of GEPA’s advantage evaporates. The engineering lever the paper specifically calls out as the most under-explored is what they call feedback engineering: deliberately designing the textual signal your evaluator emits so the reflection LLM has something concrete to work with.

Picking between MIPROv2 and GEPA

This is the natural question to ask in 2026, and the decision rule is fairly simple in practice.

Default to MIPROv2 for most situations. It is the more mature option, it covers both of the main axes (instructions and demonstrations together), and a 2-to-3-point gain over BootstrapFewShotWithRandomSearch is a real win on the kinds of tasks most teams are actually working on. Reach for GEPA when your metric naturally emits real text explaining its failures: compiler errors, JSON schema validator output, retrieval hit/miss logs, judge rationales, unit-test diffs. That text is exactly what the reflection LLM converts into a targeted prompt edit, and it is what separates GEPA from a slower MIPRO when it works well. Without that kind of feedback in your metric, GEPA tends to underperform its own potential.

If neither approach lifts the needle on your task, the bottleneck is almost certainly the metric or the program decomposition, not the optimiser. That is worth stating plainly, because it is easy to spend weeks tuning the optimiser when the real problem is upstream.

Cost reality check

For a single-predictor program with 200 trainset and 300 valset on gpt-4o-mini, rough numbers from the docs and community reports:

Optimiser	LLM calls	Wall clock	Dollar cost
BootstrapFewShot	~200	minutes	< $1
BootstrapFewShotWithRandomSearch(N=10)	~3.2k	~10 min	$2-3
MIPROv2(auto=“light”)	hundreds to low thousands	10-20 min	$1-5
MIPROv2(auto=“heavy”)	tens of thousands	hours	$20+
GEPA(auto=“medium”)	metric calls + reflection LM	varies	reflection-LM dominated

Multi-predictor programs scale roughly linearly in predictor count for the instruction-search step of MIPROv2.

For a grounded calibration point, one published full MIPROv2 run on a small program ran to about 238k tokens, $0.04, and 14 minutes on gpt-4o-mini. The fear of “thousands of dollars” compile costs is unfounded at small scale. The real cost spike is what you see when you run auto="heavy" on a multi-stage RAG program.

Why this often disappoints in practice

Five honest caveats are worth flagging before you go run any of this on your own application.

Metric quality is the actual hard part. Every DSPy optimiser is a metric-maximiser. The literature focuses on optimisation methods, but for the tasks where prompt optimisation is hardest in practice (open-ended generation, summarisation, agentic workflows), writing a metric that tracks what users actually care about is the bottleneck. DSPy will enthusiastically overfit a bad metric. The hours you would save by skipping prompt iteration get spent on judge tuning instead.

Reported gains are partly an artefact of weak baselines. Many DSPy papers compare to “vanilla prompt with no few-shots and no decomposition” rather than to what a competent engineer would write by hand in three hours. The deltas over a well-tuned BootstrapFewShotWithRandomSearch are often 2 to 3 points, not 30. The wins are real, but smaller than the abstracts suggest when graded against a strong human-written baseline.

Demonstrations still dominate. 0-shot MIPRO consistently underperforms full MIPRO. GEPA is instruction-only and still does well, but the comparison is murky. If you are picking one axis to start with, pick demonstration selection. It is the single biggest lever in this entire body of work.

Compilation cost is real for large programs. Plan for thousands of LLM calls. The framework hides this until you run it. Build a budget into the optimiser invocation (max_metric_calls, auto="light") before you discover what auto="heavy" costs on a four-stage pipeline.

Program structure is the lever none of these optimisers touch. The biggest gains in the literature (DSP, IReRa, STORM) come from structural choices: sub-query decomposition, infer-retrieve-rank, persona-conditioned researchers. The optimiser only kicks in after you have made those structural choices. A bad decomposition will not be rescued by any prompt optimiser, full stop.

A decision tree you can actually follow

Given the task you have:

If your task is a single LLM call with a clean metric (classification, structured extraction, simple QA), start with BootstrapFewShotWithRandomSearch. Cheapest, simplest, and the gap to fancier optimisers is small at this scale.
If you have a multi-module program with a clean metric, reach for MIPROv2(auto=“light”). Joint optimisation of instructions and demos per module is what the framework was built for. Plan for hundreds to thousands of LM calls.
If your metric emits rich textual feedback already (compiler errors, retrieval logs, unit-test diffs, structured-output validation), use GEPA. This is where the language-as-gradient bet pays off.
If you have hard correctness constraints, layer dspy.Assert(condition, message) over whichever optimiser you use. Compile-time filtering plus inference-time backtracking gets you reliability without sacrificing the optimisation signal.
If your output space is huge (extreme multi-label, code generation against a large API), decompose with an Infer-Retrieve-Rank pattern (the IReRa paper) before you optimise prompts. Optimising within the wrong structure is a waste.

Table of contents