What is prompt caching and how does it save tokens?

Prompt caching (available on Anthropic Claude and OpenAI) stores repeated prompt prefixes server-side. The first call costs full price. Subsequent calls that reuse the same prefix get a 90% discount on cached tokens. This is the highest-leverage technique for applications that send the same system prompt with every request.

2026-05-10·9 min read

How to Optimize Prompt Tokens (Cut Costs Without Breaking Your Prompts)

Seven techniques to reduce prompt token count without degrading output quality — with before/after examples and a free token optimizer tool.

Quick Answer

Token optimization is the practice of reducing prompt length to cut API costs without degrading output quality. The core principle: every word in a prompt has a cost, but not every word contributes value. Identify which tokens earn their price — and cut the rest.

A production system prompt that runs 500 times per day at 300 tokens each costs roughly 150,000 tokens daily. Cut that to 180 tokens — entirely achievable — and you've reduced that line item by 40%. Across a year, that's millions of tokens.

The problem is that most optimization advice is vague. "Make your prompts shorter" isn't a technique. What follows is seven specific techniques, ranked by effort and savings, with before/after examples showing actual token counts.

Why token count matters more than prompt length (the cost math)

You're billed per token, not per character. A "token" is roughly 4 characters or 0.75 words in English — but that varies by model and content type. Code, JSON, and technical terminology tokenize differently than plain prose. Claude and GPT-4o use different tokenizers, so a 300-token prompt on one isn't 300 on the other.

The cost structure compounds fast:

Input tokens — charged every time you send the prompt
Output tokens — often 2-3x more expensive per token than input
Context window usage — larger prompts shrink the space for conversation history, forcing earlier truncation

For most production applications, input token cost is the main lever. Output token cost is controlled by specifying output format and length constraints explicitly — which is technique #5 below.

The 7 token optimization techniques — ranked by effort vs. savings

Technique	Tokens saved	Quality risk	Effort	When to use
Trim verbose role definitions	High (20-30 tokens)	Near-zero	Low	Always — do this first
Remove filler and hedging	High (15-30 tokens)	Zero	Low	Always — pure noise removal
Compress system prompt instructions	High (30-80 tokens)	Low	Medium	Long system prompts with repeated rules
Reduce few-shot examples	Medium (40-140 tokens)	Medium	Medium	When examples are lengthy or redundant
Specify output format explicitly	Medium (output tokens)	Zero	Low	When outputs are unnecessarily verbose
Use structured shorthand	Medium (10-25 tokens)	Low	Low	Instructions written in paragraph form
Implement prompt caching	90% on repeat calls	Zero	High	High-volume apps with repeated system prompts

1. Trim verbose role definitions

Role definitions are the most common source of token bloat. They tend to grow through iteration — someone adds "experienced," then "senior," then a list of technologies, then a philosophy statement. None of it changes model behavior in meaningful ways for straightforward tasks.

Cut everything that isn't load-bearing. The role needs to tell the model what mode to be in — not impress anyone.

2. Remove filler and hedging language

"Please," "carefully," "comprehensive," "thorough," "make sure to" — these words cost tokens and contribute nothing. The model doesn't respond to politeness markers. It doesn't try harder because you said "carefully." These are habits from human writing that have no effect on LLM output quality.

3. Compress system prompt instructions

Long system prompts often repeat the same constraint three different ways. "Be concise. Keep responses brief. Don't over-explain." Pick one. Consolidate repeated rules into a single, direct statement. Use bullets instead of paragraphs for lists of instructions — bullets parse faster and use fewer tokens.

4. Reduce few-shot examples

Few-shot examples are expensive — 60 tokens per example adds up fast. Evaluate whether you actually need all of them. Two well-chosen examples often outperform five mediocre ones. The goal is coverage of the most important patterns, not volume.

5. Specify output format explicitly

Output tokens cost more than input tokens on most APIs. If you don't constrain the output, the model will generate as much as it thinks is helpful — often 2-3x more than you need. Tell it exactly what to return: "Return only the JSON object. No explanation. No preamble."

6. Use structured shorthand

Instructions written as prose take more tokens than the same information in bullets or structured shorthand. "You should respond by first acknowledging the user's question, then providing the answer, and then offering a follow-up suggestion" → three bullets. Same meaning, fewer tokens.

7. Implement prompt caching for repeated prefixes

Both Anthropic and OpenAI offer prompt caching — a feature that stores repeated prompt prefixes server-side. The first call costs full price. Every subsequent call that reuses the same prefix gets a 90% discount on cached tokens. For applications that send the same system prompt thousands of times per day, this is by far the highest-leverage optimization. It requires a code change (cache_control parameter on Anthropic, or the cached endpoint on OpenAI), but the economics are hard to ignore.

Before/after examples: 3 real prompt patterns rewritten

These are patterns from the first 1,000 prompts evaluated through PromptEval. The token counts are measured using the Claude tokenizer.

Example 1 — Verbose role definition

Before — 38 tokens

"You are an experienced senior software engineer with 15 years of experience in Python, JavaScript, and cloud infrastructure who specializes in writing clean, maintainable code and enjoys helping junior developers learn best practices."

After — 11 tokens

"You are a senior software engineer. Write clean, maintainable code."

71% token reduction. For most tasks — code review, debugging, documentation — the compressed version produces identical quality output. The 27 extra tokens were describing personality traits and specializations the model doesn't need for the task.

Example 2 — Redundant instruction padding

Before — 40 tokens

"Please carefully read the following text and then provide a comprehensive and thorough summary that captures all the main points and key ideas present in the text, making sure not to miss any important details."

After — 10 tokens

"Summarize the following text. Cover all main points."

75% token reduction, zero quality loss. "Please," "carefully," "comprehensive and thorough," "making sure not to miss" — all padding. The model doesn't produce a worse summary because you left those out. It produces the same summary for a quarter of the price.

Example 3 — Few-shot bloat

Before — 180 tokens (examples section)

3 lengthy examples, averaging 60 tokens each. Examples cover the same core pattern three ways, with minor variation between them.

After — 40 tokens (examples section)

2 concise examples, averaging 20 tokens each. One positive example, one edge case. No redundancy.

78% token reduction. Slight quality risk on rare edge cases — the third example was covering a scenario the other two didn't. Worth measuring before shipping.

The trade-off most guides ignore: when compression hurts quality

Every optimization guide tells you to make prompts shorter. Almost none of them tell you when not to.

Four situations where cutting tokens creates real quality problems:

Removing context the model was using silently. Some context that looks redundant is actually load-bearing. A role description that mentions "you work in a regulated financial services environment" seems like flavor — but it's changing how the model handles ambiguous requests. Cut it and behavior shifts in ways you don't notice until production.

Dropping few-shot examples below two. One example is often worse than zero, because it can bias the model toward that specific pattern instead of generalizing. Two examples give the model enough to understand the pattern without over-fitting. Going from three to two is usually safe. Going from two to one isn't.

Aggressive system prompt compression in long conversations. In multi-turn conversations, the system prompt competes with growing conversation history for context window space. Counterintuitively, a shorter system prompt that gets silently truncated is worse than a longer one that fits comfortably. Know your context window limits before compressing.

Role compression on complex multi-step reasoning tasks. "Senior software engineer" works for code review. For a prompt that orchestrates a multi-step agent workflow, the detailed role definition is the scaffolding the model uses to decide what to do next. Compressing it can break the reasoning chain.

The rule: cut freely from filler and hedging, carefully from examples, and never blindly from instructions that constrain behavior.

How to check if optimization worked

Don't ship a compressed prompt based on reading alone. Here's the verification process that matters:

Step 1 — Score the prompt before and after optimization. Use PromptEval's token optimizer to flag what to cut, and the eval to score the prompt on the four structural dimensions — clarity, specificity, structure, robustness. If clarity or structure drops after compression, you cut something load-bearing. Find what it was before moving on.

Step 2 — Test on 10 representative inputs. Include at least 2-3 edge cases: the weird inputs, the boundary cases, the malformed requests. These are where compressed prompts break first. If the optimized version handles all 10 cleanly, you're in good shape.

Step 3 — A/B test the original against the compressed version in a controlled sample before full deployment. Run both in parallel for a defined period and compare outputs systematically. See our guide on how to A/B test the original against the compressed version for the full method.

If you skip Step 2 and go straight to production, you'll find the failures there — which is the worst time to find them. Ten test inputs before deployment is cheap insurance.

The goal isn't the smallest possible prompt. It's the smallest prompt that still produces the output quality you need. Those are different targets, and conflating them is how optimization causes regressions.

Want to see exactly what to cut in your current prompt? Paste it into PromptEval — the token optimizer flags the specific sections to trim, and the eval score tells you if quality held after changes. Free with 3 credits.

Frequently asked questions

How do I reduce token usage in my prompts?
Start with the two zero-risk techniques: trim verbose role definitions and remove filler language ("please," "carefully," "comprehensive"). Together they cut 40-70% of tokens in most prompts with no quality impact. Then score what's left using a prompt evaluator to find any remaining bloat.

Does reducing prompt tokens hurt output quality?
It depends on what you cut. Filler words and verbose role descriptions: near-zero risk. Few-shot examples below two: measurable quality risk on edge cases. Aggressively compressing system prompts: real risk of instruction forgetting in long conversations. The technique table above lists the quality risk for each approach.

What is prompt caching and how much does it save?
Prompt caching stores repeated prompt prefixes server-side. Anthropic and OpenAI both offer it, typically at a 90% discount on cached tokens. For a system prompt that runs 10,000 times per day, caching can cut that cost by 80%+. It requires a code change but no prompt rewriting.

How many tokens does a typical system prompt use?
In our first 1,000 PromptEval evaluations, system prompts ranged from 50 to 500 tokens. The median was around 180. Most had at least 40-60 tokens of pure filler — hedging language, redundant instructions, and verbose role descriptions that could be cut without touching behavior.

How do I check if prompt optimization worked?
Score the prompt before and after with a structured evaluation, test on 10 representative inputs including edge cases, and A/B test in production before full rollout. If the structured score drops, find what you cut that was load-bearing before deploying.

Apply what you just learned — evaluate your prompt free.

Try PromptEval →