Go back
Published:
· llm / dspy / prompt-optimization

Inside MIPROv2: Bootstrap, Propose, Search

A walkthrough of MIPROv2 (Opsahl-Ong et al., 2024), DSPy's flagship prompt optimiser. The three-phase pipeline (bootstrap, propose, search), how Bayesian Optimisation makes the discrete combinatorial space tractable, what changes between the baseline and the compiled prompt, and a decision rule for when to run it.

Most LLM pipelines get built the same way. You write a prompt, try it on some examples, decide a few outputs aren’t quite right, tweak the wording, and run it again. After a week of this, whatever prompt you happened to stop at is what ships. We live with the loop because the alternative, asking an algorithm to optimise the prompt text for you, sounds slightly absurd. You can’t take a gradient through English.

You do not need a gradient. You need a metric, some training examples, and a search procedure that does not burn cash on every step. MIPROv2 (Multi-prompt Instruction PRoposal Optimiser, version 2) is the algorithm DSPy uses when you want it to handle the tuning. This post walks through what it actually does, with two small applets to build intuition.

Table of contents

Open Table of contents

Where this post sits

The Prompts are Hyperparameters post argued the reframe at breadth: prompts are parameters of an LLM program, not the artefact you ship. The GEPA walkthrough drilled into the reflective alternative that mutates prompts by reading their own trace logs. This post fills in the middle, the optimiser most teams use when they have a multi-step DSPy program with a stable metric and a few hundred labelled examples: MIPROv2.

A short word on DSPy

MIPROv2 only makes sense inside DSPy’s model of what an LLM program is, so a brief detour before we get to the algorithm itself.

DSPy is a Python framework for building LLM-powered programs without hand-writing prompts. The core idea is to declare what each LM call is supposed to do via a typed signature, then let the framework worry about the prose. A signature looks like question -> answer or, for a more involved task, passages, question -> reasoning, answer. The left side of the arrow is what goes in, the right is what comes out.

You wrap a signature in a module like Predict, ChainOfThought, or ReAct. That module (often called a predictor in DSPy docs, since each one drives one LM call) is what your code actually calls. The interesting part is what happens between “you call the module” and “the LM returns text”.

At runtime, DSPy compiles the signature into a real prompt string: an instruction (initially the signature’s docstring or an auto-generated description), structured field markers it can parse outputs from, and optionally a few-shot demos block. The prompt text is not something you wrote. It is something DSPy materialised. A full DSPy program is a Python class that chains these modules together, and nothing about how they are prompted lives in your source code.

Because DSPy generates the prompt rather than asking you to write it, the prompt is just another parameter of the program. DSPy ships a family of optimisers (originally called “teleprompters”) that tune those parameters against a metric and training data you supply. MIPROv2 is the most widely used, and the rest of this post walks through how it works.

What MIPROv2 actually changes

Before getting into the algorithm, it helps to see what the algorithm’s output looks like in practice. The applet below toggles between the unoptimised version of a math-QA predictor and the version MIPROv2 spits out. Same signature, same field markers, very different prose.

Notice what changed and what did not. The field markers (Question:, Answer:) are structural; DSPy needs those to parse outputs and they stay put. The two things MIPROv2 hunts for are the instruction at the top and the few-shot demonstrations in the middle. Those are the parameters.

A quick expectations check before going further. MIPROv2 is for programs past the sketching phase, with stable structure, a metric you trust, and at least a few dozen labelled examples. If you are still iterating heavily on what your program should even do, or you only have a handful of examples, hand-tuning a prompt is faster than spinning up an optimiser. The optimiser is a sander, not a sketcher.

Stated as an optimisation problem

You have a program with kk predictors, some labelled data, and a metric. The metric can be anything that returns a number: exact match against a gold answer, F1 on a target span, a regex check, semantic similarity against a reference, or even a small LM-as-judge verifier. MIPROv2 uses two splits internally: a trainset for bootstrapping demonstrations from successful traces, and a valset for scoring candidate configurations during search. You can supply both directly (as in the snippet below) or pass only a trainset and let MIPROv2 carve a valset out of it. What you want, for each predictor, is the pair (instruction, demos) that maximises expected metric on the valset, which you hope is a reasonable proxy for production.

The space is discrete, combinatorial, and gradient-free. Each evaluation is an end-to-end run of your program, which costs real money and real latency. So the optimiser has to be sample-efficient. MIPROv2 splits the work into three sequential phases, each doing something slightly different.

In code, that splits into a program definition, a metric, and a single optimizer.compile() call.

End-to-end MIPROv2 setup in DSPy
import dspy

class RAG(dspy.Module):
    def __init__(self):
        self.generate = dspy.ChainOfThought("context, question -> answer")
    def forward(self, question):
        return self.generate(context=retrieve(question), question=question)

# Metric: did the gold answer appear in the prediction?
def metric(example, pred, trace=None):
    return example.answer.lower() in pred.answer.lower()

# Optimise: this is where MIPROv2 does its work
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
compiled = optimizer.compile(RAG(), trainset=train, valset=val)

Three details worth flagging. The RAG class declares what the program does without ever specifying prompt text. The metric is just a Python function returning a score. And optimizer.compile() is where the entire MIPROv2 procedure runs: bootstrap, propose, search. Everything that follows in this post is what happens inside that one call.

The three-phase pipeline

PHASE 01 Bootstrap run program on trainset, keep passing traces as demos PHASE 02 Propose prompt-model writes C grounded instruction candidates per predictor PHASE 03 Search TPE picks the best joint (instruction, demos) per trial output: N demo sets / predictor output: C instructions / predictor output: best-scoring compiled program

Figure 1. The MIPROv2 pipeline. Bootstrap surfaces few-shot demos from successful program runs. Propose drafts candidate instructions with grounding signals from the data, the code, and stylistic tips. Search runs Bayesian Optimisation over the joint space of (instruction, demos) per predictor and returns the highest-scoring configuration on the full validation set.

Phase 1: Bootstrap

Run your current (unoptimised) program on training inputs. For each example, you get a full trace through every predictor. If the final output passes the metric, keep that trace, because it is now a valid demonstration of the program doing the right thing. Sample NN of those into a candidate demo set, repeat with different seeds to build multiple candidate sets per predictor.

What is doing the work here is that you did not need intermediate labels. If your program is question -> reasoning -> answer and only the final answer has ground truth, you still recover good intermediate reasoning, because the only traces that survive bootstrapping are the ones whose end-to-end answers happened to be correct. The supervision is weak and outcome-only: you trust the final metric and let it implicitly score the intermediate steps that led to the right answer.

Phase 2: Propose, with grounding

For each predictor, you now want a pool of candidate instruction strings to search over. Naively you could just ask an LM to “write me 10 versions of this instruction”. MIPROv2 does something more deliberate. It uses a second LM (the “prompt model”) inside a small DSPy module called GroundedProposer, and feeds the proposer four pieces of context:

  1. An auto-generated summary of properties of the training dataset, so the proposer knows what kind of inputs to expect.
  2. A summary of the program’s code and which specific predictor the instruction is being written for, so the instruction matches the predictor’s role.
  3. The bootstrapped demos from Phase 1, so the instruction is consistent with what successful traces look like.
  4. A randomly sampled tip like “be concise” or “be creative” or “think step by step”, so the candidates diversify across stylistic axes rather than collapsing onto one phrasing.

To make this less abstract, here is the kind of instruction transformation MIPROv2 produces on a grade-school maths task. The unoptimised instruction is whatever the signature’s docstring contained, often nothing more than a restatement of the field types. The optimised one tends to read like something a thoughtful engineer would write after actually looking at the data.

StateInstruction
Before”Given the question, produce the answer.”
After MIPROv2”You are solving grade-school arithmetic word problems. Identify each quantity in the problem, perform the calculation step by step, and state the final answer as a single integer or simple fraction. Do not restate the question or add commentary.”

The grounding is visible if you know to look for it. “Grade-school arithmetic” came from the dataset summary. “Step by step” likely came from a sampled tip. The output-format constraint reflects the shape of the bootstrapped traces, which all happened to produce single-number answers. The proposer was not being creative; it was being specific.

Slot 0 of every predictor’s candidate list is always the original signature instruction. The narrow guarantee is that the baseline lives somewhere inside the search space, not that the optimiser will always pick it or beat it. In practice the search reliably lands at or above the baseline on the validation set, but that says nothing about production: small valsets, evaluation noise, the LM’s own stochasticity, and the gap between your validation distribution and real traffic all leave room for a winning configuration to regress once it ships.

Phase 3: Bayesian Optimisation over the joint space

At this point you have, per predictor, CC instruction candidates and CC demo-set candidates. With kk predictors and two slots each, the joint configuration space is C2kC^{2k}. For modest values (C=10C = 10, k=3k = 3) that is already a million configurations. You are not evaluating a million configurations.

MIPROv2 uses Bayesian Optimisation, specifically Optuna’s TPE (the Tree-structured Parzen Estimator) to search this space efficiently. The intuition first. After every trial, TPE looks at which configurations have produced good scores and which have produced bad ones, and at the next trial it samples more from regions of the space that have produced winners and less from regions that have produced losers. The “good” cutoff is just the top quantile of trials seen so far.

The formalism that implements that intuition is short. TPE maintains two distributions over configurations, p(xscore is good)p(x \mid \text{score is good}) and p(xscore is bad)p(x \mid \text{score is bad}), and at each trial proposes a candidate that maximises their ratio. Configurations that are likely under the good distribution and unlikely under the bad one are exactly the regions you want to revisit.

Each trial evaluates a configuration on a minibatch from the validation set (cheap), and periodically the running best is re-evaluated on the full validation set (expensive but rare). At the end, MIPROv2 hands back the program with whichever configuration scored highest on that full eval.

Worth watching in that applet. Early trials are nearly random because TPE has no information to condition on. After roughly ten trials, sampling concentrates in the better rows and columns, and you can see clusters form. The “best so far” curve is monotone-non-decreasing by construction, but the per-trial scores are noisy because each evaluation uses a small minibatch. That noise is exactly why MIPROv2 periodically re-evaluates the running best on the full validation set; minibatch winners are not always real winners.

What you give up, and what you get

MIPROv2 has real strengths and real costs, and which side dominates depends on the shape of your task.

On the strengths side, MIPROv2 treats prompt tuning as the optimisation problem it always was, with a real metric instead of vibes. Bootstrap is outcome-supervised, so you do not need intermediate labels; end-to-end ground truth is enough. The optimisation runs jointly across multiple predictors, so a multi-step program’s prompts get tuned to each other rather than in isolation. The baseline instruction sits in the candidate pool from trial 0, so the search reliably lands at or above it on the validation set. And the whole framework decouples program logic from prompt prose, so you can edit one without touching the other.

The costs and caveats also track the mechanism. A typical auto="medium" run on a two-predictor program is in the low thousands of LM calls; auto="light" is in the hundreds, auto="heavy" can hit tens of thousands. Cost scales with predictors ×\times trials ×\times minibatch size, on top of bootstrap and propose. There is overfit risk on small validation sets, and the resulting instructions can be brittle and oddly long. Optimised prompts are tuned to a specific inference setup (LM, decoding settings, DSPy field markers), so transfer to a different stack is not guaranteed. TPE on a discrete combinatorial space is fine, not miraculous, and can stall on hard objectives. And the diagnostic story when MIPROv2 produces a strange winning instruction is harder than when a human-written prompt fails, because the optimiser’s choices do not always have a legible reason behind them.

When to use it

Three conditions determine whether MIPROv2 will pay off, and you want all three to hold before running a serious budget through it.

  1. A metric you trust to correlate with what you actually care about. If the metric is shaky, fix the metric first. An optimiser pointed at a bad metric will efficiently find prompts that game it.
  2. Around 100 or more labelled examples, enough to support a real train/val split. Below 50, use COPRO or hand-tune; Bayesian Optimisation does not have enough signal to work with.
  3. A stable program structure. If you are going to refactor the predictor graph next week, do not spend compute optimising the current shape of it.

When those conditions hold, the default move is to run MIPROv2 with auto="medium" once. The cost is real but bounded. The worst case is you spend an afternoon’s compute confirming your hand-written prompts were already good. The expected case is a meaningful jump on the validation set, and a smaller but real jump in production once you have validated the configuration on held-out traffic.

If your task naturally produces rich textual feedback (compiler errors, JSON schema validator output, judge rationales, unit-test diffs), the GEPA walkthrough covers the reflective alternative that consumes that signal directly. MIPROv2 is the right default for most DSPy programs; GEPA wins on the subset where the metric emits actionable English alongside its score, because the reflection step can target a specific clause of the prompt rather than searching over a fixed candidate pool.

The deeper point is the paradigm shift the framework is selling. Once you have internalised that prompts are parameters and not source code, the question stops being “is this prompt good?” and becomes “is this program’s metric and training set enough to let an optimiser find a good prompt for me?”. When the answer is yes, the right call is to let the optimiser do the work.

References


Tagged