Ship prompts like code.
Evaluate, version and gate prompt regressions in CI. Serve the production version without a redeploy.
3 free evaluations/month · no credit card · cancel anytime
A number that means something specific.
Not “this prompt could be better.” Exactly which dimension is failing and why.
| Dimension | What it measures | Common failure |
|---|---|---|
| Clarity | The task has exactly one reasonable interpretation. No guessing required. | Vague verbs — "help me with", "do something about", "improve this" |
| Specificity | Output requirements are measurable — not adjectives. The model has no decisions left about what "done" looks like. | "Write a concise summary" vs "Write a 3-sentence summary in plain language" |
| Structure | Instructions follow logical order: role first, context second, task third, format last. | Format spec buried after the task, role missing entirely, constraints scattered |
| Robustness | Edge cases have explicit instructions — not generic fallbacks, but specific handling for the most likely failures. | Prompt assumes clean, well-formed input when real users submit anything but |
61-point improvement. Same model. Better prompt.
Output shape specification works because you’re constraining the whole output distribution at once. Scripted edge-case responses are high-durability: pre-built templates survive ambiguity better than abstract rules.
— CodeMaitre · Reddit · came in skeptical, came out convinced
Gate it in CI, like you gate code.
The GitHub Action evaluates the prompt on the pull request and blocks the merge if the score drops, if there is a contradicting instruction, or if it regresses against the production version.
- uses: FranciscoFerreiraff/prompteval-action@v1
with:
api_key: ${{ secrets.PROMPTEVAL_API_KEY }}
prompt_file: prompts/support-agent.md
baseline_slug: support-agent
min_score: 75
fail_on_conflict: trueWorks on any plan (lint). Serving + regression gate on Pro. read the docs →
Six tools. One loop.
They evaluate the model’s output.
We evaluate the prompt’s structure. Before it runs.
Built for people who actually run prompts in production.
Practice prompt engineering daily.
Track how you improve.
A new constrained challenge every day. Pick difficulty modifiers to multiply your score. Share your result. Compete on the leaderboard.
If people see what changed without getting the answer handed to them, it keeps the challenge intact but still teaches the pattern. That’s probably what makes it sticky instead of a one time try.
— LeaderAtLeading · Reddit
Start free. Upgrade when you outgrow it.
no contracts · cancel anytime
Frequently asked questions
Do I need an account to try PromptEval?
How long does an evaluation take?
Can I use it in my CI/CD?
Do I need a redeploy to change a production prompt?
Does the API work on any plan? What about BYOK?
How is this different from asking Claude to review my prompt?
Is my prompt data private?
Gate your next prompt before it breaks.
3 free evaluations · no credit card required
0 prompts evaluated · 0 tokens saved
Start free →