From first score to
prompt in production.
Evaluate for free. Gate regressions in CI and serve production prompts when you ship.
See your first prompt score in 60 seconds. No credit card.
Fix prompts before they hurt users in production. 30×/month.
cancel anytime · no lock-in
Ship with a safety net — slug serving + CI regression gate.
cancel anytime · no lock-in
Govern prompts as a team — roles, approval, audit.
cancel anytime · no lock-in
Output shape specification works because you're constraining the whole output distribution at once. Scripted edge-case responses are high-durability: pre-built templates survive ambiguity better than abstract rules.
— CodeMaitre · Reddit · came in skeptical, came out convinced
0 prompts evaluated · 0 tokens saved
The character limit applies to the prompt submitted per evaluation. Prompts
above the plan limit won't be processed. Characters include spaces, line breaks and formatting.
Your prompts are processed and discarded — never used to train AI models.
A chat gives conversational, memoryless suggestions. Here you get a reproducible score (8 sub-criteria at temperature 0 against an anchored rubric — same prompt, same number) and, on top of it, what a chat doesn't have: versioning with diffs, a CI regression gate, and serving the production prompt by slug.
8 sub-criteria across 4 dimensions (clarity, specificity, structure, robustness), each at temperature 0 against an explicit rubric: below 60 = serious gaps, above 85 = genuinely robust. It adjusts ±8 for technical factors like instruction positioning (U-shaped attention weights the start and end more) and system/user separation. Structured analysis, not an opinion.
Static analysis and runtime testing are complementary. Static catches what breaks before you run it — an instruction buried in the middle of the context, an unhandled edge case, a contradiction — failures that show up in the output regardless of input. For the behavioral side there's the Playground and Batch A/B. Use static as a cheap CI gate; runtime when you need it.
Yes. The evaluation is model-agnostic — it analyzes the prompt as a technical instruction regardless of the target model. The Playground and BYOK accept Anthropic and OpenAI keys.
Yes, on Pro+. A REST API (POST /api/v1/eval) plus an official GitHub Action that fails the PR if the score drops, instructions contradict, or it regresses vs production. Lint mode is open on every plan; full needs Pro/Team or BYOK.
No (Pro+). Give the prompt a slug and serve the production version via GET /api/v1/prompts/{slug}. Change it in the library and it takes effect in ~60s, no deploy.
They are separate meters. Web credits cover the evaluator, iterator and playground on the site (Free 3, Basic 30; Pro and Team unlimited). The API quota is only for HTTP calls (lint 10/30/75/250 per month). And BYOK is unlimited on both — it runs on your key.
Those are strong at runtime call tracing and observability. PromptEval is static analysis + a registry: it catches what breaks before you run it (structure, conflict, regression) and serves/versions the production prompt. They're complementary — see the comparisons at /en/compare.
3 web evaluations per month, auto-renewed, no credit card. Includes score, 4 dimensions, issues and warnings, library up to 5 prompts, and 10 API lint calls/month (unlimited BYOK).
Yes. They're sent to Claude for evaluation and discarded after processing — never used to train models. Everything is stored with Row Level Security: only your account can access it.