prompt linter · CI regression gate

Ship prompts like code.

Evaluate, version and gate prompt regressions in CI. Serve the production version without a redeploy.

PROMPTS EVALUATED0and counting
TOKENS SAVED0via optimizer
Start free →See example report

3 free evaluations/month · no credit card · cancel anytime

the score

A number that means something specific.

Not “this prompt could be better.” Exactly which dimension is failing and why.

DimensionWhat it measuresCommon failure
ClarityThe task has exactly one reasonable interpretation. No guessing required.Vague verbs — "help me with", "do something about", "improve this"
SpecificityOutput requirements are measurable — not adjectives. The model has no decisions left about what "done" looks like."Write a concise summary" vs "Write a 3-sentence summary in plain language"
StructureInstructions follow logical order: role first, context second, task third, format last.Format spec buried after the task, role missing entirely, constraints scattered
RobustnessEdge cases have explicit instructions — not generic fallbacks, but specific handling for the most likely failures.Prompt assumes clean, well-formed input when real users submit anything but
Before
“Help the customer with their issue and be professional.”
Score: 18 / 100
Clarity 22Specificity 8Structure 25Robustness 17
After
“You are a support specialist. Read the message and return: summary (1 sentence), customer_intent (only what’s stated), urgency (urgent / normal / low). If no clear request: intent = ‘unclear’, urgency = ‘normal’.”
Score: 79 / 100
Clarity 85Specificity 82Structure 77Robustness 72

61-point improvement. Same model. Better prompt.

Output shape specification works because you’re constraining the whole output distribution at once. Scripted edge-case responses are high-durability: pre-built templates survive ambiguity better than abstract rules.

— CodeMaitre · Reddit · came in skeptical, came out convinced

regression gate

Gate it in CI, like you gate code.

The GitHub Action evaluates the prompt on the pull request and blocks the merge if the score drops, if there is a contradicting instruction, or if it regresses against the production version.

Some checks were not successful
PromptEval / regression gatefailed in 7s
baseline (production) 82 → this PR 74
delta −8 · merge blocked
.github/workflows/prompt-check.yml
- uses: FranciscoFerreiraff/prompteval-action@v1
  with:
    api_key: ${{ secrets.PROMPTEVAL_API_KEY }}
    prompt_file: prompts/support-agent.md
    baseline_slug: support-agent
    min_score: 75
    fail_on_conflict: true

Works on any plan (lint). Serving + regression gate on Pro. read the docs →

the loop

Six tools. One loop.

They evaluate the model’s output.

We evaluate the prompt’s structure. Before it runs.

EVALUATE
Evaluator
0-100 score across 4 dimensions · critical errors with the exact line · improved prompt rewrite · conflict graph
Score: 43 · Robustness: 17 ← critical
Basic+
FIX
Iterator
Fix the exact instruction that’s failing · minimal surgical edit · without rewriting what already works
Robustness: 17 → 61 · line 3 instruction fixed
VERSION
Library
Save and version your prompts · each version keeps score, diff and change context · traceable history
V2 saved · Robustness +44pts · diff: 3 lines edited
Pro
SERVE
Prompt API
Serve library prompts via GET · no redeploy on each content change · updates when you save in the library
GET /api/v1/prompts/support-bot · V2 in prod
Basic+
TEST
A/B Playground
Test with your own API key · Batch A/B with two prompts · up to 7 criteria · LLM judge · radar chart results
V2 vs V1 · 10 inputs · 7 criteria · V2 won
COMPARE
Compare
Compare two versions side by side · score per dimension · automatically detects where it regressed and why
V1: 43 · V2: 74 · V3: 67 ← regressed here
↩ prompt changed in production · a new cycle starts here
who it’s for

Built for people who actually run prompts in production.

Free · 3 credits/month
Solo dev shipping an AI feature
My prompt works in testing. Fails on 20% of real inputs. I don’t know why.
Structural eval finds the exact robustness gap. Fix it once. No more "it worked yesterday."
Basic · $9/month
Developer using AI every day at work
3 free evals don’t last a week. I want technical analysis and an improved prompt without paying $19.
30 credits/month to evaluate, iterate, and map conflicts. Playground with your own key, unlimited library, and full technical analysis.
Pro · $19/month
Dev shipping prompts to production
I changed the prompt, quality dropped, and nothing caught it before deploy.
A CI regression gate + GitHub Action fail the PR when the score drops. Versioned history with diffs. Slug serving swaps production without a redeploy.
Team · $49/month
Team governing prompts in production
Each prompt is a dependency. I need an approval process, not vibes.
Workspaces with roles (viewer/editor/admin). Production approval flow before a version goes live. Audit log of who changed what. 250 API evals/month.
daily challenge · free · no signup

Practice prompt engineering daily.
Track how you improve.

A new constrained challenge every day. Pick difficulty modifiers to multiply your score. Share your result. Compete on the leaderboard.

If people see what changed without getting the answer handed to them, it keeps the challenge intact but still teaches the pattern. That’s probably what makes it sticky instead of a one time try.

— LeaderAtLeading · Reddit

Common · No CapsUncommon · No RepetitionRare · AlliterationEpic · WhisperLegendary · ZenMystery · ??? ×3.5
plans

Start free. Upgrade when you outgrow it.

no contracts · cancel anytime

FREE
$0
forever
3 web evals/month + API lint 10/mo (unlimited BYOK). Library up to 5 prompts with versioning.
BASIC
$9
/month
30 credits/month · technical analysis · improved prompt · iterator · map · playground · API lint 30/mo · unlimited library.
PRO
$19
/month
Unlimited web · slug serving (swap production, no redeploy) · CI regression gate + GitHub Action · full API · batch A/B · 35k chars.
TEAM
$49
/month
Pro + workspaces & roles · production approval · audit log · 250 API evals/mo · export · 60k chars.
compare plans in detail →
payments processed securely via Stripe
faq

Frequently asked questions

Do I need an account to try PromptEval?
Evaluations require a free account — 3 per month, no credit card. The token counter and daily challenge run without any account.
How long does an evaluation take?
Typically 30–60 seconds depending on prompt length. You get a full dimensional breakdown, critical issues, strengths, and recommendations.
Can I use it in my CI/CD?
Yes. There is a REST API (POST /api/v1/eval) and an official GitHub Action that fails the pull request when the score drops, when there is a contradicting instruction, or when the prompt regresses against the production version. Lint mode works on any plan; full needs Pro/Team or BYOK.
Do I need a redeploy to change a production prompt?
No. On Pro+ you give a prompt a slug and serve the production version via GET /api/v1/prompts/{slug}. Change the production version in the library and it takes effect within ~60 seconds, no deploy.
Does the API work on any plan? What about BYOK?
Yes — the evaluation API is open on every plan, with a monthly managed quota (free 10 · basic 30 · pro 75 · team 250). With BYOK (your Anthropic key in the X-Provider-Key header) inference runs on your key: it consumes no quota and unlocks full mode on any plan.
How is this different from asking Claude to review my prompt?
Claude gives conversational suggestions — subjective, not reproducible, no version tracking. PromptEval gives a numeric 0-100 score across 4 specific dimensions, versioned and comparable across iterations.
Is my prompt data private?
Yes. All prompts are stored with Row Level Security — only your account can access them. PromptEval does not use your prompts to train models.

Gate your next prompt before it breaks.

3 free evaluations · no credit card required

0 prompts evaluated · 0 tokens saved

Start free →