2026-05-17·Francisco Ferreira·10 min read

How to Write Specific AI Prompts (With Before/After Examples and Scores)

Learn the 4 levels of prompt specificity with real before/after examples and PromptEval scores. The most practical guide to getting consistent AI outputs.

Quick Answer

A specific AI prompt defines the output format, length, audience, constraints, and edge cases — leaving no decision to the model's interpretation. The fastest lever: define output format first. "Write an email" → "Write a 120-word email with subject line, one ROI metric, and a single CTA." That single addition raises specificity scores by 20–40 points on average.

Across 110 prompts evaluated on PromptEval, specificity is the dimension with the widest gap between the highest and lowest scoring prompts. Prompts that define output format, constraints, and audience score 20–40 points higher on specificity than prompts that define only the task — while their clarity and structure scores stay nearly the same.

That gap has a direct production consequence. When specificity is low, the model fills the gaps on its own — and fills them differently on every run. This guide explains the 4 levels of prompt specificity, the 5 levers that move the score, and 3 before/after examples with real dimensional scores.

Why specificity breaks first in production

A prompt that works in testing often fails in production not because the model changed — but because the range of real inputs is wider than anything you tested. Specificity failures are the main cause, because a vague prompt relies on the model making consistent interpretations across wildly different inputs.

Three things happen when a prompt is underspecified in production:

  • Format drift. The model returns a bulleted list one day and a paragraph the next. If your downstream code parses the output, it breaks intermittently and is hard to debug.
  • Length variance. Without a word count or length constraint, outputs range from one sentence to several paragraphs depending on how the model reads the input complexity.
  • Scope creep. Without explicit exclusions ("don't include pricing details"), the model adds information that's outside your intended scope on some inputs but not others.

Specificity is a constraint problem, not a wording problem. You're not trying to phrase the request more cleverly — you're trying to reduce the number of decisions the model makes on your behalf.

The 4 levels of prompt specificity

Every prompt sits somewhere on this scale. Moving up one level typically raises the specificity dimension score by 15–25 points on PromptEval's 0–100 scale.

Level What's defined Example Typical specificity score
1 — Vague Task only "Write a follow-up email" 15–30
2 — Constrained Task + format + length "Write a 100-word follow-up email for a B2B SaaS prospect" 40–60
3 — Parameterized Task + format + constraints + audience "Write a 100-word follow-up email for a CFO who saw a demo. No jargon. Focus on ROI." 65–80
4 — Production-ready Task + format + constraints + persona + examples + edge cases Full system prompt with persona, format spec, one-shot example, explicit exclusions, fallback instruction 82–95

Most prompts in production sit at Level 2. They define the task and a rough format, but leave audience, constraints, and edge cases to the model. Level 3 is where output consistency becomes reliable enough to ship. Level 4 is what you need when the prompt runs on thousands of different inputs per day.

The 5 specificity levers

These are the five things you can make specific in any prompt, ranked by impact on output consistency:

1. Output format definition

The highest-impact lever. Replace any description of quality with a description of structure. Instead of "write a professional summary," write "write a 3-sentence summary: sentence 1 states the outcome, sentence 2 gives the method, sentence 3 gives the metric."

Format definition covers: document type (email, report, JSON), structure (bullets vs. prose vs. table), length (word count or sentence count), and required elements (must include: subject line, one CTA, no more than 2 paragraphs).

2. Constraints and exclusions

Explicit exclusions are more reliable than asking the model to "keep it short" or "be professional." Both are interpretations. Exclusions are instructions: "don't include pricing information," "no bullet points," "maximum 3 paragraphs," "never recommend a specific vendor."

Constraints also cover what the model should do when the input is outside the expected scope — the edge case instruction that turns a Level 3 prompt into a Level 4.

3. Audience and purpose parameters

Telling the model who will read the output changes how it writes more than almost any other instruction. "CFO" vs "software engineer" vs "first-time user" produces different vocabulary, assumed knowledge, and level of technical detail — without you having to specify each one. One audience definition replaces dozens of individual tone instructions.

4. Scope definition

Scope is what the prompt covers and what it doesn't. Scope failures produce outputs that are technically correct but practically useless — a summary that includes tangential information, or a recommendation that covers adjacent products you didn't ask about. Define the scope boundary: "only cover the features mentioned in the input, don't recommend upsells."

5. One-shot examples

When the output format is complex or non-standard, one example of what you want is worth more than two paragraphs of description. "Format the response like this: [example]" eliminates interpretation entirely for the format dimension. One-shot examples are the last lever because they're the most expensive in tokens — use them when format definition alone isn't producing consistent structure.

Before and after: 3 prompts rewritten for specificity

Each example shows the original prompt, the Level 4 rewrite, and the dimensional score change on the specificity dimension. The scores are based on the structural patterns PromptEval evaluates — not subjective quality ratings.

Example 1: Sales follow-up email

Before — Level 1 — specificity score: 18

"Write a follow-up email for a sales lead who didn't respond."

After — Level 4 — specificity score: 84

"You are a B2B SaaS account executive. Write a 120-word follow-up email to a CFO who attended a live product demo 5 days ago but hasn't responded. Tone: direct and confident, not apologetic. Include: subject line, one specific pain point the CFO mentioned (use [PAIN_POINT] placeholder), one ROI metric relevant to their industry (use [METRIC] placeholder), and one clear next step with a specific date. No bullet points. No sign-off other than first name. Don't mention competitors or pricing."

The specificity score jumped from 18 to 84 by adding: persona (account executive), audience (CFO), constraints (120 words, direct tone, no bullets), required elements (subject, pain point, metric, CTA), explicit exclusions (no competitors, no pricing), and template variables for the variable input.

Example 2: Product feature explanation

Before — Level 1 — specificity score: 22

"Explain how our token optimizer feature works."

After — Level 3 — specificity score: 71

"Write a 2-paragraph explanation of the token optimizer feature for a product landing page. Audience: non-technical product managers who understand AI tools but not token mechanics. Paragraph 1: what problem it solves (cost and inconsistency). Paragraph 2: how it works in 3 steps, without technical jargon. End with one concrete result: 'Prompts tested on this feature compressed from 112 tokens to 48 — a 57% reduction.' No feature names from competitor products."

Example 3: Data analysis summary

Before — Level 2 — specificity score: 41

"Summarize the key findings from this sales report in bullet points. Be concise."

After — Level 4 — specificity score: 88

"Summarize the key findings from this sales report. Output format: exactly 5 bullet points. Each bullet: one sentence, max 20 words, past tense, starts with a metric or percentage. Only include findings with data to support them — if the report doesn't contain a specific number, don't include that finding. If there are fewer than 5 data-supported findings, write only as many as exist and note how many are missing at the end."

The last example illustrates the edge case instruction at Level 4: "if there are fewer than 5 data-supported findings, write only as many as exist." This single line prevents the model from fabricating data to fill 5 bullets — a failure mode that's invisible in testing but dangerous in production.

Specificity traps — when being specific backfires

Over-specification is a real problem, not a theoretical one. Three patterns cause it:

Specifying what the model already handles correctly. If you ask for "professional tone" and the model already defaults to professional tone for your use case, that instruction occupies tokens without changing behavior. Test first — add constraints only for dimensions where the model guesses wrong.

Conflicting constraints. "Write a comprehensive explanation in under 50 words" is an internally contradictory instruction. The model will prioritize one constraint over the other, unpredictably. Audit your constraints for conflicts before shipping.

Format that doesn't fit all inputs. A rigid output format — "exactly 5 bullet points" — breaks when the input has 2 relevant items. Always add a fallback: "if fewer than 5 items are present, include only what the input supports."

The top-ranked prompt on PromptEval's leaderboard — a B2B sales agent scoring 87 overall — has clarity at 92 but specificity at 78. That 14-point gap between two dimensions shows that specificity is often the constraint that gets skipped, even by experienced prompt writers. It's worth checking explicitly.

How to test whether your specificity improvements worked

Three tests, in order of cost:

  1. The interpretation test. Read the prompt out loud and ask: could a smart colleague who has never seen your product interpret this differently than you intended? Every ambiguity is a specificity gap. Fix it before running any model calls.
  2. The format test. Run the prompt 3 times on the same input. If the output format (length, structure, included elements) varies between runs, the format constraints aren't specific enough. Add word count, structure definition, or required elements until the format stabilizes.
  3. The edge case test. Run the prompt on 5 inputs that are slightly outside your ideal case — shorter than expected, in a different format, missing a field your prompt assumes. If the model breaks format or hallucinates missing data, you need an edge case instruction.

For a systematic approach to the full evaluation workflow — including how specificity interacts with clarity, structure, and robustness scores — this guide on evaluating AI prompt quality covers the four-dimension framework in detail. And if you want to see exactly where your prompt sits on the specificity scale before shipping, this comparison of AI prompt checkers shows which tools give you dimension-by-dimension scores.

Prompt specificity is a measurable property, not a feeling. A prompt that scores 22 on specificity can reach 84 in two revision cycles — without changing the task, the model, or anything else. The revisions just replace interpretations with instructions.

You just learned how to write a more specific prompt

See exactly what score it gets — PromptEval evaluates it free with 3 credits. You'll see separate scores for clarity, specificity, structure, and robustness — with specific callouts for the phrases that are pulling the score down. Or test your prompt engineering skills today with the Daily Challenge.

Frequently Asked Questions

What does it mean to write a specific AI prompt?

A specific AI prompt defines exactly what the model should produce — output format, length, audience, constraints, and edge cases — leaving no decision to the model's interpretation. The more decisions you make explicit, the less the model has to guess, and the more consistent the output becomes across different runs and different inputs.

What's the fastest way to make an AI prompt more specific?

Define the output format first. Most prompts fail on specificity because they describe the task but not the result. "Write an email" is a task. "Write a 120-word email with a subject line, one ROI metric in sentence 2, and a single CTA as the last sentence" is a specific instruction. Adding format constraints — type, length, included elements, excluded elements — is the single highest-impact lever for output consistency.

How do I know if my prompt is specific enough?

Run the interpretation test: could two different people read your prompt and expect a different output format, length, or tone? Every "yes" is a specificity gap. A prompt is specific enough when the only variable left is the model's wording — not its interpretation of what you want. You can also score it structurally using prompt evaluation metrics before running it on real inputs.

Can a prompt be too specific?

Yes. Over-specification wastes tokens on constraints the model handles correctly by default, and creates conflicting instructions when format requirements don't fit every input. The practical rule: only specify constraints where the model currently guesses wrong. Test the default behavior first — add constraints only where you observe inconsistency.

How does prompt specificity affect AI output quality scores?

Specificity is one of the four dimensions scored by PromptEval (0–100). Prompts that define output format, constraints, and audience consistently score 20–40 points higher on specificity than prompts that define only the task. The current leaderboard top prompt — a B2B sales agent scoring 87 overall — has specificity at 78 and clarity at 92, confirming that specificity is often the last dimension to be fully addressed even in high-performing production prompts.

Apply what you just learned — evaluate your prompt free.

Try PromptEval →