— AI prompt quality

Is your prompt good? Find out and improve it.

What's failing in your prompt, a better version, and fewer tokens.

PROMPTS EVALUATED1,330and counting

TOKENS SAVED53,319via optimizer

Start free →See example report

3 free evaluations/month · no credit card · cancel anytime

— try it now

máx. 12,000 chars

0/100

Clarity83

Specificity85

Structure88

Robustness73

example · click the box to test your own prompt

— the score

A number that means something specific.

Not “this prompt could be better.” Exactly which dimension is failing and why.

Dimension	What it measures	Common failure
Clarity	The task has exactly one reasonable interpretation. No guessing required.	Vague verbs — "help me with", "do something about", "improve this"
Specificity	Output requirements are measurable — not adjectives. The model has no decisions left about what "done" looks like.	"Write a concise summary" vs "Write a 3-sentence summary in plain language"
Structure	Instructions follow logical order: role first, context second, task third, format last.	Format spec buried after the task, role missing entirely, constraints scattered
Robustness	Edge cases have explicit instructions — not generic fallbacks, but specific handling for the most likely failures.	Prompt assumes clean, well-formed input when real users submit anything but

Before

“Help the customer with their issue and be professional.”

Score: 18 / 100

Clarity 22Specificity 8Structure 25Robustness 17

After

“You are a support specialist. Read the message and return: summary (1 sentence), customer_intent (only what’s stated), urgency (urgent / normal / low). If no clear request: intent = ‘unclear’, urgency = ‘normal’.”

Score: 79 / 100

Clarity 85Specificity 82Structure 77Robustness 72

61-point improvement. Same model. Better prompt.

“

Output shape specification works because you’re constraining the whole output distribution at once. Scripted edge-case responses are high-durability: pre-built templates survive ambiguity better than abstract rules.

— CodeMaitre · Reddit · came in skeptical, came out convinced

— regression gate

Ship prompts like code.

The GitHub Action evaluates the prompt on the pull request and blocks the merge if the score drops, if there is a contradicting instruction, or if it regresses against the production version.

✕Some checks were not successful

✕PromptEval / regression gatefailed in 7s

baseline (production) 82 → this PR 74

delta −8 · merge blocked

.github/workflows/prompt-check.yml

- uses: FranciscoFerreiraff/prompteval-action@v1
  with:
    api_key: ${{ secrets.PROMPTEVAL_API_KEY }}
    prompt_file: prompts/support-agent.md
    baseline_slug: support-agent
    min_score: 75
    fail_on_conflict: true

Works on any plan (lint). Serving + regression gate on Pro. read the docs

— the loop

Six tools. One loop.

They evaluate the model’s output.

We evaluate the prompt’s structure. Before it runs.

EVALUATE

Evaluator

0-100 score across 4 dimensions · critical errors with the exact line · improved prompt rewrite · conflict graph

Score: 43 · Robustness: 17 ← critical

Basic+

FIX

Iterator

Fix the exact instruction that’s failing · minimal surgical edit · without rewriting what already works

Robustness: 17 → 61 · line 3 instruction fixed

VERSION

Library

Save and version your prompts · each version keeps score, diff and change context · traceable history

V2 saved · Robustness +44pts · diff: 3 lines edited

Pro

SERVE

Prompt API

Serve library prompts via GET · no redeploy on each content change · updates when you save in the library

GET /api/v1/prompts/support-bot · V2 in prod

Basic+

TEST

A/B Playground

Test with your own API key · Batch A/B with two prompts · up to 7 criteria · LLM judge · radar chart results

V2 vs V1 · 10 inputs · 7 criteria · V2 won

COMPARE

Compare

Compare two versions side by side · score per dimension · automatically detects where it regressed and why

V1: 43 · V2: 74 · V3: 67 ← regressed here

↩ prompt changed in production · a new cycle starts here

— who it’s for

Built for people who actually run prompts in production.

Free · 3 credits/month

Solo dev shipping an AI feature

“My prompt works in testing. Fails on 20% of real inputs. I don’t know why.”

Structural eval finds the exact robustness gap. Fix it once. No more "it worked yesterday."

Basic · $9/month

Developer using AI every day at work

“3 free evals don’t last a week. I want technical analysis and an improved prompt without paying $19.”

30 credits/month to evaluate, iterate, and map conflicts. Playground with your own key, unlimited library, and full technical analysis.

Pro · $19/month

Dev shipping prompts to production

“I changed the prompt, quality dropped, and nothing caught it before deploy.”

A CI regression gate + GitHub Action fail the PR when the score drops. Versioned history with diffs. Slug serving swaps production without a redeploy.

Team · $49/month

Team governing prompts in production

“Each prompt is a dependency. I need an approval process, not vibes.”

Workspaces with roles (viewer/editor/admin). Production approval flow before a version goes live. Audit log of who changed what. 250 API evals/month.

daily training · free · no signup

Learn prompt engineering on
real workplace tasks.

Every session is a real work task with one technique to master. You write the prompt, get a per-criterion score, and see exactly what was missing. With competency tracks and a commented reference answer.

“

If people see what changed without getting the answer handed to them, it keeps the challenge intact but still teaches the pattern. That’s probably what makes it sticky instead of a one time try.

LeaderAtLeading · Reddit

”

Fintech risk analysisSupport triageData extractionSafety guardrails+ new scenarios weekly

train today

— plans

Start free. Upgrade when you outgrow it.

no contracts · cancel anytime

FREE

forever

3 web evals/month + API lint 10/mo (unlimited BYOK). Library up to 5 prompts with versioning. Free daily training.

BASIC

/month

30 credits/month · technical analysis · improved prompt · iterator · map · playground · API lint 30/mo · unlimited library · daily training with tracks and reference answers.

MAIS POPULAR

PRO

$19

/month

Unlimited web · slug serving (swap production, no redeploy) · CI regression gate + GitHub Action · full API · batch A/B · 35k chars.

TEAM

$49

/month

Pro + workspaces & roles · production approval · audit log · 250 API evals/mo · export · 60k chars.

compare plans in detail

payments processed securely via Stripe

faq

Frequently asked questions

Do I need an account to try PromptEval?

Evaluations require a free account — 3 per month, no credit card. The token counter and daily challenge run without any account.

How long does an evaluation take?

Typically 30–60 seconds depending on prompt length. You get a full dimensional breakdown, critical issues, strengths, and recommendations.

Can I use it in my CI/CD?

Yes. There is a REST API (POST /api/v1/eval) and an official GitHub Action that fails the pull request when the score drops, when there is a contradicting instruction, or when the prompt regresses against the production version. Lint mode works on any plan; full needs Pro/Team or BYOK.

Do I need a redeploy to change a production prompt?

No. On Pro+ you give a prompt a slug and serve the production version via GET /api/v1/prompts/{slug}. Change the production version in the library and it takes effect within ~60 seconds, no deploy.

Does the API work on any plan? What about BYOK?

Yes — the evaluation API is open on every plan, with a monthly managed quota (free 10 · basic 30 · pro 75 · team 250). With BYOK (your Anthropic key in the X-Provider-Key header) inference runs on your key: it consumes no quota and unlocks full mode on any plan.

How is this different from asking Claude to review my prompt?

Claude gives conversational suggestions — subjective, not reproducible, no version tracking. PromptEval gives a numeric 0-100 score across 4 specific dimensions, versioned and comparable across iterations.

Is my prompt data private?

Yes. All prompts are stored with Row Level Security — only your account can access them. PromptEval does not use your prompts to train models.

Gate your next prompt before it breaks.

3 free evaluations · no credit card required

1,330 prompts evaluated · 53,319 tokens saved

Start free →