How to A/B Test AI Prompts (Multi-Criteria Guide with Real Examples)
Compare two AI prompts across multiple criteria and inputs — no code required. The systematic method teams use to make confident prompt decisions.
A/B testing prompts means comparing two prompt variants against the same inputs, judged by pre-defined criteria. Three structural requirements: (1) a representative test set — not just happy-path examples, (2) explicit success criteria defined before you run the test, and (3) enough inputs to surface meaningful differences — at minimum 10. Gut-feel comparison is not A/B testing.
Data from PromptEval's public leaderboard shows that even top-ranked production prompts — those scoring above 70 out of 100 — consistently show weak spots in at least one structural dimension. The prompt you think is better often isn't, in the specific ways that matter for your use case.
A/B testing is how you find out for certain. But the way most developers test prompts — run both a few times, see which looks better — doesn't produce reliable conclusions. This guide gives you a systematic method that does.
Why A/B testing prompts is harder than A/B testing web pages
Web A/B tests measure a single metric: click rate, conversion, time on page. One number, clear winner.
Prompt A/B tests are inherently multi-dimensional. A prompt can be better at following format instructions but worse at tone. Better for short inputs but worse for edge cases. Better for one model but not another. If you evaluate on only one dimension, you'll optimize for it while regressing on others you didn't measure.
This is why prompt A/B testing requires explicit, pre-defined criteria — not a vague sense of which output looks better. The criteria you choose before running the test determine what you actually learn from it. Choosing them after you see the results is how you confirm your pre-existing bias instead of testing it.
The other structural difference: sample size. A web A/B test can run on thousands of real users. A prompt test runs on a set of inputs you curate manually. If your test set doesn't cover edge cases, adversarial inputs, and format variations — not just the happy path — you'll get a winner that only wins on easy inputs.
The PACE Framework for Prompt A/B Testing
PACE is a four-step process for running prompt A/B tests that produce actionable results, not just impressions.
P — Pair your prompts. Define Prompt A and Prompt B with a single variable changed between them. The most common mistake is changing multiple things at once — system prompt, role, format, and tone — and then not knowing which change drove the result. Test one variable per run. If you changed the role definition and the output format simultaneously, you don't know which change mattered.
A — Assert your criteria. Write down what "better" means before you run a single test. Be specific and binary or scored where possible: "Response stays under 150 words" is testable. "Response is clear" is not. For each criterion, decide: is this pass/fail, or a score from 1 to 5? Aim for 3 to 7 criteria. Fewer than 3 misses meaningful dimensions; more than 7 creates noise in the results.
C — Cover your inputs. Build a test set that represents the actual distribution your prompt will face in production — not just the inputs you expect it to handle well. A representative set for most production prompts includes: 5-7 typical inputs (the happy path), 2-3 edge cases (unusual formats, missing information, borderline requests), and 1-2 adversarial inputs (attempts to confuse the prompt or produce off-topic output). Ten inputs total is the minimum for meaningful conclusions.
E — Evaluate and compare. Run both prompts against all inputs. Score each output against each criterion. Sum the scores. The prompt that wins on more criteria across more inputs is the better choice — but look at the dimensional breakdown before you decide. A prompt that wins 6 of 7 criteria but loses on the one that's most critical for your use case is not the right choice, even if the total score is higher.
Before applying PACE, a structural quality check on each prompt variant catches problems that will make your test results meaningless regardless of how well you designed the test. The four structural dimensions that determine prompt quality give you a framework for that review.
How to define evaluation criteria that actually measure what matters
The failure mode here is writing criteria that sound specific but aren't. "Accuracy" is not a criterion. "Response contains only information present in the provided context, with no hallucinated facts" is a criterion.
Six criteria categories that work for most production prompt tests:
- Format compliance: Does the output match the specified format? (JSON schema, bullet count, word count, section headers)
- Instruction following: Does the output address what was asked, without ignoring or reinterpreting the request?
- Factual grounding: Does the output avoid introducing information not in the input or context?
- Tone consistency: Does the output maintain the specified tone across different input types?
- Edge case handling: Does the output behave correctly when the input is ambiguous, missing data, or off-pattern?
- Output length: Is the response within the specified length range — neither truncated nor padded?
For most use cases, 3 to 5 of these six cover the dimensions that matter. Pick the ones most likely to break in production, not the ones easiest to evaluate.
Real example: two prompts, 5 criteria, 10 inputs
A SaaS support team is testing two system prompts for a ticket-triage agent. The agent reads support tickets and classifies them by urgency (P1/P2/P3) and category (billing, technical, general).
Prompt A (concise): "You are a support triage agent. Classify each ticket by urgency (P1: system down, P2: major feature broken, P3: minor issue or question) and category (billing, technical, general). Return JSON: {urgency, category, reason}."
Prompt B (detailed): "You are an expert support triage agent for a B2B SaaS product. Read each ticket and output a classification. Urgency: P1 (complete system outage, data loss risk), P2 (major feature unavailable, revenue impact possible), P3 (minor bug or question). Categories: billing (payment, invoice, subscription), technical (bug, error, integration), general (how-to, feature request, feedback). Return exactly: {urgency: string, category: string, reason: string}. Reason should be one sentence."
Criteria (5):
- JSON format valid and parseable
- Urgency classification matches expected label
- Category classification matches expected label
- Reason field present and one sentence
- Classification holds on edge-case inputs (multi-issue ticket, vague ticket with no category signals)
Test set (10 inputs): 6 typical tickets (billing, technical, general at P2 and P3 urgency levels), 2 edge cases (multi-issue ticket, vague ticket), 2 adversarial inputs (ticket in a different language, empty ticket body).
Results: Prompt B won on 4 of 5 criteria across all 10 inputs. Both tied on JSON validity (10/10 each). The gap was largest on edge-case handling: Prompt A failed on 3 of 4 edge/adversarial inputs; Prompt B failed on 1. The added specificity in the urgency definitions drove the difference — not the length of the prompt.
This is the kind of insight you don't get from running each prompt once and reading the outputs.
Four A/B testing mistakes that invalidate your results
1. Changing multiple variables between A and B. If Prompt B has a different role, different format instructions, and a different output schema than Prompt A, you don't know what caused any difference in results. Each test should isolate one variable.
2. Writing success criteria after you see the outputs. If you decide what "better" means after reading both outputs, you're selecting criteria to justify the output you already prefer. Define criteria before you run a single prompt.
3. Testing only on easy inputs. A test set of 10 typical, well-formed inputs doesn't reveal how each prompt handles the actual distribution it'll face in production. Include edge cases and at least one adversarial input in every test set.
4. Stopping at "which prompt won" without looking at the dimensional breakdown. A prompt that wins on 4 of 5 criteria but fails on format compliance is not a good choice for a system that parses JSON outputs downstream. The winner depends on which criteria matter most for your use case — not just the aggregate score.
For more on what makes a prompt structurally ready before any test, the complete pre-production evaluation process covers the structural review step that should happen first.
How to run batch A/B tests without writing code
The traditional approach requires a Python script, an API key, a JSON dataset, and 30-60 minutes of setup before you see a single result. That's a reasonable investment for CI/CD eval pipelines. It's a disproportionate barrier for a product manager or a founder who needs a directional answer before committing to a full eval pipeline.
PromptEval's Batch A/B Test runs the same structured test in a four-step wizard: define Prompt A and Prompt B, select up to 7 evaluation criteria (binary or scored), add up to 10 test inputs, and run. An LLM judge evaluates each combination — both prompts against each input, on each criterion — and displays results as a radar chart and bar chart. The visual comparison makes trade-offs immediately visible: Prompt A wins on format compliance and tone; Prompt B wins on edge-case handling and instruction following.
If you want to iterate on prompt variants before committing to a batch test, the Playground (Pro) lets you test prompts live against the Anthropic or OpenAI API with your own key, seeing outputs in real time across different inputs before you design the formal comparison.
Most tools on this list charge from day one. PromptEval gives you 3 full evaluations free — no credit card. That's enough to run a structural check on both prompt variants before you invest time in batch testing.
Frequently Asked Questions
What is prompt A/B testing?
Prompt A/B testing is the process of comparing two prompt variants against the same set of test inputs, judged by pre-defined evaluation criteria. The goal is to determine which prompt produces better outputs for a specific use case, based on measurable criteria rather than subjective impression.
How many test inputs do I need for a valid prompt A/B test?
A minimum of 10 inputs — 5-7 typical cases, 2-3 edge cases, and at least 1 adversarial input. Fewer than 10 is likely to give you a winner that only wins on easy inputs. For high-stakes production prompts, 20-50 inputs is better, but 10 is the minimum for directionally reliable results.
Can I A/B test prompts without writing code?
Yes. PromptEval's Batch A/B Test runs structured multi-criteria comparisons in a browser wizard — no SDK, no CLI, no Python required. You define two prompts, up to 7 criteria, and up to 10 test inputs, and the system handles evaluation and visualization. Braintrust also has a UI-based workflow for output testing, though it requires more initial setup.
How do I choose evaluation criteria for a prompt A/B test?
Write down what "better" means before you run any test. Make each criterion testable: not "accurate" but "contains only information present in the input." Aim for 3 to 7 criteria — the ones most likely to break in production. Format compliance, instruction following, and edge-case handling are almost always worth including.
What's the difference between prompt A/B testing and LLM evaluation?
LLM evaluation measures a model's general capabilities across standardized benchmarks. Prompt A/B testing measures which of your specific prompts produces better outputs for your specific use case, on your specific test set. The same model can return different quality results depending on which prompt you use — which is exactly what the A/B test tells you.
Apply what you just learned — evaluate your prompt free.
Try PromptEval →