How to Structure AI Prompts: 4 Techniques That Change Model Behavior

The 4 prompt structure techniques, system/user split, delimiters, chain of thought, few-shot, with concrete before/after examples and a decision guide for each.

Quick Answer

Prompt structure is how you organize the information inside a prompt. Where the role definition goes, how instructions are separated from input, whether reasoning steps are explicit. The 4 techniques that change model behavior: (1) system/user split for permanent vs. variable instructions, (2) delimiters to separate content blocks, (3) chain of thought to make reasoning explicit, (4) few-shot examples to define output patterns.

Most prompt problems come from wording. Most prompt failures in production come from structure. The two are different problems: wording is what you ask, structure is how the model knows where one piece of instruction ends and another begins.

A prompt scored 88 on structure on PromptEval's leaderboard, the current top-ranked prompt, uses all four structural techniques covered in this guide. Most first-draft prompts use none of them. That gap accounts for the 30–50 point difference in structure scores between early-stage and production-ready prompts.

Why structure beats wording

When a prompt fails, the instinct is to rephrase, find clearer words, adjust the tone, add more context. That fixes clarity problems. It doesn't fix structure problems, because structure failures happen at the parsing level, not the language level.

Three structural failures cause the most production issues:

Role bleed. The model doesn't know which part of your prompt is permanent behavior and which is the current input. It treats both as context and rebalances them with each request.
Boundary confusion. Without separators, the model reads instructions and input data as a single block. A system prompt that says "summarize the following document" followed immediately by the document, with no separator, creates ambiguity about where instructions end and content begins.
Absent reasoning chain. For multi-step tasks, the model jumps to an answer without working through intermediate steps. The answer is wrong not because the model can't reason, but because the prompt didn't ask it to.

Structure techniques fix all three. They're not stylistic choices, they're instructions about how to process the prompt.

The 4 prompt structure techniques, when to use each

Technique	What it solves	Use when	Token cost
System/user split	Role bleed, instruction drift across requests	Any prompt that runs more than once with different inputs	None (reorganizes existing tokens)
Delimiters	Boundary confusion between content blocks	Prompt contains 2+ distinct content types (instructions + data, or multiple examples)	Minimal (3–6 tokens per delimiter pair)
Chain of thought	Skipped reasoning on multi-step tasks	Task involves multiple dependent conclusions (math, decisions, logical chains)	Medium (reasoning steps add output tokens)
Few-shot examples	Format ambiguity for complex or non-standard outputs	Output format is hard to describe in words alone, easier to show	High (each example = full input + output in tokens)

Technique 1: System/user split

The system prompt is where you put everything that stays constant across every request: the role definition, output format, behavior rules, and constraints. The user message is where you put the variable input for each specific request.

Prompt structure is defined by where you put information, not just what information you include. Putting permanent instructions in the user message means the model re-evaluates them as context every request. And weights them against whatever else is in the message. Over time, this produces drift: the same prompt produces slightly different behavior depending on what the user input adds to the context.

Before, everything in user message

User: "You are a customer support agent. Reply in under 100 words. Be empathetic but direct. Never promise refunds without approval. Here is the customer message: [MESSAGE]"

After, system/user split

System: "You are a customer support agent. Reply in under 100 words. Be empathetic but direct. Never promise refunds without manager approval."

User: "[MESSAGE]"

The instruction content is identical. The structural difference: the system prompt version processes the role and constraints once, as permanent context. The user message is clean input. This prevents the customer's message from inadvertently overriding your instructions through prompt injection or context dilution.

Technique 2: Delimiters

A delimiter is a character sequence or tag that marks the boundary between different types of content in a prompt. Common choices: triple backtick fences, XML-style tags (<document>...</document>), or section markers (###).

Delimiters are necessary whenever the prompt contains two or more distinct content types. Instructions plus input data, multiple examples, or embedded reference material. Without them, the model reads the entire prompt as one continuous context block and can misidentify where instructions end and content begins.

Before, no delimiters

Summarize the following report in 3 bullet points. Focus on risks only. Report: Our Q1 revenue grew 24% year-over-year. Customer churn increased from 3.2% to 5.1%. Three enterprise contracts were delayed due to procurement issues. Gross margin declined from 68% to 61%.

After, with delimiters

Summarize the following report in 3 bullet points. Focus on risks only.

<report>
Our Q1 revenue grew 24% year-over-year. Customer churn increased from 3.2% to 5.1%. Three enterprise contracts were delayed due to procurement issues. Gross margin declined from 68% to 61%.
</report>

The delimiter version makes it unambiguous that everything inside <report> tags is input data, not instruction. This matters when the input contains phrases that look like instructions, "focus on costs" or "ignore previous context", which could otherwise override your actual instructions.

Technique 3: Chain of thought

Chain of thought (CoT) prompting asks the model to show its reasoning steps before stating the conclusion. The simplest form is appending "Think through this step by step" to your prompt. More structured versions define the reasoning steps explicitly.

CoT works because language models generate tokens sequentially, each token is influenced by the tokens before it. When you ask for a conclusion directly, the model generates an answer based on its highest-probability prediction, which may skip steps that would expose a reasoning error. When you ask it to reason step by step, each step becomes input that constrains the next, producing more reliable conclusions for multi-step problems.

Before, direct answer prompt

"A product costs $80 and we want 35% gross margin. What should the selling price be?"

After, chain of thought

"A product costs $80 and we want 35% gross margin. Calculate the selling price. Reason step by step: first state the gross margin formula, then apply the values, then state the result."

For simple, factual tasks, "translate this sentence" or "what is the capital of France", CoT adds tokens without improving accuracy. Save it for tasks with dependent reasoning steps: financial calculations, multi-criteria comparisons, diagnostic logic, or any task where the correct answer requires intermediate conclusions.

Technique 4: Few-shot examples

Few-shot prompting includes one or more input/output example pairs before the actual input. The model infers the pattern from the examples and applies it to the new input, without any additional instruction about what the pattern is.

Few-shot examples are most valuable when the output format is unusual, domain-specific, or hard to describe in natural language. If you can explain the format clearly in words, do that, it costs fewer tokens. If the format requires showing, use examples.

Few-shot structure, incident classification example

Classify the severity of the following customer support ticket.

Example 1:
Ticket: "My invoice shows the wrong amount and I have a payment due tomorrow."
Severity: HIGH, financial impact, time-sensitive

Example 2:
Ticket: "The color of the button in the app doesn't match your website."
Severity: LOW, cosmetic, no functional impact

Now classify:
Ticket: "[INPUT]"
Severity:

Two rules for few-shot examples: diversity beats quantity (two examples covering different cases work better than five similar ones), and always end with the same format you want the model to complete (the final "Severity:" with no answer forces the model to fill in the pattern).

The 3 most common prompt structure mistakes

Putting everything in one block without separators. A 500-word prompt with no delimiters is a single dense context block. The model has no way to distinguish which parts are instructions, which are input, and which are examples. Add structural markers before adding more words.

Using CoT on simple tasks. "What color is the sky? Think step by step" wastes output tokens and adds latency without improving the answer. CoT only pays off when the reasoning chain affects the conclusion. For lookup tasks, apply it.

Few-shot examples that all look the same. If your 4 examples are variations of the same case, the model learns a narrow pattern. One edge case example (an input that's slightly out of the expected range) is often worth more than three typical cases.

Structure dimension scores on PromptEval typically jump from 35–45 on unstructured prompts to 70–85 on prompts that use system/user split plus delimiters, before any other change. That's the baseline return on structural investment. For a full picture of how structure interacts with clarity, specificity, and production resilience, the prompt quality evaluation guide covers all four dimensions together.

And if you've just restructured a prompt using these techniques and want to see where each dimension lands, PromptEval gives you a 0–100 score on structure, specificity, clarity, and robustness in under 10 seconds, free with 3 credits, no setup required.

For the clarity side of prompt writing, the dimension most people confuse with structure, this guide on writing clear AI prompts covers the distinct set of techniques that reduce ambiguity without adding structure overhead.

You just restructured your prompt, now see the score

You just learned how to write a better prompt. See exactly what score it gets, PromptEval evaluates it free with 3 credits. Structure, specificity, clarity, and robustness, scored separately with callouts for each weak point. Or practice structural thinking with today's Daily Challenge.

Frequently Asked Questions

What does "prompt structure" mean in AI?

Prompt structure is how information is organized within a prompt. Where the role definition goes, how instructions are separated from input data, whether examples are included, and whether the model is asked to reason step by step. Structure determines how reliably a model interprets your intent, independent of the specific words used. A well-structured prompt can outperform a well-worded but unstructured one because it reduces the model's parsing ambiguity.

What is the difference between a system prompt and a user prompt?

A system prompt contains permanent instructions. Role definition, behavior rules, output format, and constraints that apply to every request. A user prompt contains the variable input specific to each request. Mixing both in the user message forces the model to reinterpret permanent instructions with every call, which causes output drift. Keeping them separate is the most impactful structural change for any prompt that runs repeatedly.

When should I use chain-of-thought prompting?

Use chain of thought when the task involves multiple dependent reasoning steps. Math problems, multi-criteria decisions, logical deductions, or diagnostic logic where each conclusion depends on a previous one. For simple, single-step tasks (translation, classification with clear criteria, factual lookup), chain of thought adds output tokens without improving accuracy. The test: if a wrong intermediate step would change the final answer, use CoT.

How many few-shot examples should I include in a prompt?

Two to four examples cover most use cases. Research shows example diversity matters more than quantity. Four examples covering different input types outperform eight similar examples. For simple classification, two examples are sufficient. For multi-step tasks with complex or non-standard output formats, three to four examples are worth the token cost. Never include more than five. Diminishing returns kick in quickly, and token cost scales linearly with every example pair.

What are delimiters in AI prompts and why do they matter?

Delimiters are characters or tags, triple backticks, XML-style tags, or section markers, that separate distinct sections of a prompt. They prevent boundary confusion: when a prompt contains instructions followed by input data, without a delimiter the model reads both as continuous context. Inputs that contain instruction-like language ("ignore the above" or "change the format to") can override your actual instructions if there's no structural boundary separating them.