PromptEval
← back
pricing

From first score to
prompt in production.

Evaluate for free. Gate regressions in CI and serve production prompts when you ship.

personal use
Free
Free

See your first prompt score in 60 seconds. No credit card.

Best for: trying the method on one prompt
✓ current plan
3 web evals/month — score + 4 dimensionsi
Eval API: lint 10/mo + unlimited BYOKi
Issues, warnings and strengthsi
Library — up to 5 prompts, unlimited versionsi
web evaluation · prompts up to 8,000 characters
personal use
Basic
$9/month

Fix prompts before they hurt users in production. 30×/month.

Best for: people who live in prompts day-to-day, no deploy

cancel anytime · no lock-in

30 credits/month — evals, iterator, map & optimizeri
Eval API: lint 30/mo + unlimited BYOKi
Technical analysis + prioritized recommendationsi
Improved prompt — AI rewrites with all issues fixedi
Production iterator — surgical fix of what failsi
Unlimited library + full version historyi
Playground — live testing with your own key (BYOK)i
web evaluation · prompts up to 12,000 characters
✦ recommended
◆ in production
Pro
$19/month
≈ $0.63/day · cancel anytime

Ship with a safety net — slug serving + CI regression gate.

Best for: devs shipping prompts to production

cancel anytime · no lock-in

Everything in Basic — technical analysis, iterator, map, playground & libraryi
Slug serving — prompt in production without redeployi
CI regression gate + GitHub Actioni
Full eval API + lint 75/moi
Unlimited web — evals, iterator and mapi
Batch A/B testing — LLM as judgei
web evaluation · prompts up to 35,000 characters
◆ in production
Team
$49/month

Govern prompts as a team — roles, approval, audit.

Best for: teams governing prompts in production

cancel anytime · no lock-in

Everything in Pro — serving, CI gate, full API & batch A/Bi
Workspaces with roles — viewer, editor, admini
Production approval workflowi
Audit log — who changed what and wheni
Eval API: lint 250/mo + unlimited BYOKi
Library export as JSON and CSVi
Priority support — response within 24hi
web evaluation · prompts up to 60,000 characters

Output shape specification works because you're constraining the whole output distribution at once. Scripted edge-case responses are high-durability: pre-built templates survive ambiguity better than abstract rules.

— CodeMaitre · Reddit · came in skeptical, came out convinced

0 prompts evaluated · 0 tokens saved

The character limit applies to the prompt submitted per evaluation. Prompts
above the plan limit won't be processed. Characters include spaces, line breaks and formatting.

Your prompts are processed and discarded — never used to train AI models.

What does PromptEval do beyond a chat (ChatGPT/Claude)?

A chat gives conversational, memoryless suggestions. Here you get a reproducible score (8 sub-criteria at temperature 0 against an anchored rubric — same prompt, same number) and, on top of it, what a chat doesn't have: versioning with diffs, a CI regression gate, and serving the production prompt by slug.

Is the score reliable? How is it calculated?

8 sub-criteria across 4 dimensions (clarity, specificity, structure, robustness), each at temperature 0 against an explicit rubric: below 60 = serious gaps, above 85 = genuinely robust. It adjusts ±8 for technical factors like instruction positioning (U-shaped attention weights the start and end more) and system/user separation. Structured analysis, not an opinion.

Why not just run the prompt to test it?

Static analysis and runtime testing are complementary. Static catches what breaks before you run it — an instruction buried in the middle of the context, an unhandled edge case, a contradiction — failures that show up in the output regardless of input. For the behavioral side there's the Playground and Batch A/B. Use static as a cheap CI gate; runtime when you need it.

Does it work with any model (GPT, Gemini, etc.)?

Yes. The evaluation is model-agnostic — it analyzes the prompt as a technical instruction regardless of the target model. The Playground and BYOK accept Anthropic and OpenAI keys.

Can I use it in my CI/CD?

Yes, on Pro+. A REST API (POST /api/v1/eval) plus an official GitHub Action that fails the PR if the score drops, instructions contradict, or it regresses vs production. Lint mode is open on every plan; full needs Pro/Team or BYOK.

Do I need a redeploy to change a production prompt?

No (Pro+). Give the prompt a slug and serve the production version via GET /api/v1/prompts/{slug}. Change it in the library and it takes effect in ~60s, no deploy.

What's the difference between web credits and API quota?

They are separate meters. Web credits cover the evaluator, iterator and playground on the site (Free 3, Basic 30; Pro and Team unlimited). The API quota is only for HTTP calls (lint 10/30/75/250 per month). And BYOK is unlimited on both — it runs on your key.

How is it different from promptfoo, LangSmith or PromptLayer?

Those are strong at runtime call tracing and observability. PromptEval is static analysis + a registry: it catches what breaks before you run it (structure, conflict, regression) and serves/versions the production prompt. They're complementary — see the comparisons at /en/compare.

How does the Free plan work?

3 web evaluations per month, auto-renewed, no credit card. Includes score, 4 dimensions, issues and warnings, library up to 5 prompts, and 10 API lint calls/month (unlimited BYOK).

Are my prompts private?

Yes. They're sent to Claude for evaluation and discarded after processing — never used to train models. Everything is stored with Row Level Security: only your account can access it.

secure payments via Stripe · prices in USD · cancel anytime · questions? get in touch