AI Skill Certs
Prompt Engineering & Structured Output·Task 4.2·Bloom: understand·Difficulty 2/5·8 min read·Updated 2026-06-07

Few Shot Prompting: The Highest-Leverage Way to Make Claude Consistent

Apply few-shot prompting to improve output consistency and quality

SUBy Solomon UdohReviewed by Solomon UdohAI-assisted · human-reviewed
In short
Few-shot prompting means showing Claude two to four worked examples of a task inside the prompt so it copies the demonstrated pattern instead of guessing from instructions alone. For consistency problems it is the highest-leverage technique available, because examples communicate format and judgement far more reliably than extra prose rules.

What few shot prompting actually is

Few shot prompting means placing a small number of worked examples directly in the prompt, so Claude sees the task done correctly before it attempts its own. Rather than describing the behaviour you want in prose, you demonstrate it: here is an input, here is the ideal output, repeated two to four times. The model infers the pattern from the demonstrations and applies it to the new input. Anthropic's own documentation calls this technique multishot prompting and describes it as one of the most reliable ways to steer output format, tone, and structure.

The Claude Certified Architect exam treats this knowledge point as the gateway to Task 4.2 because it overturns a beginner instinct. When output looks wrong, the natural reflex is to write more rules. The few-shot principle says the opposite: show, do not tell. One unambiguous example often carries more steering signal than three carefully worded paragraphs, because a demonstration cannot be misread the way an instruction can.

Few-shot prompting
Including a handful of input-output examples inside the prompt so the model copies the demonstrated pattern. Two to four targeted examples are usually enough to lock in a consistent format and a consistent standard of judgement.

Why examples beat more instructions

An instruction describes the destination; an example shows the route. When you write "format the answer cleanly," the model has to invent what clean means, and it will invent something slightly different each run. When you instead show two answers already formatted the way you want, there is nothing left to interpret. The pattern is on the page. This is why the foundational principle of Task 4.2 is that examples, not additional instructions or confidence thresholds, are the most effective lever for consistency.

There is a deeper reason rooted in how the model reasons. Instructions are processed as constraints the model must translate into behaviour, and translation introduces variance. Examples are processed as patterns the model can match against directly, which is a far lower-variance operation. The closer your example is to the real input, the less translation happens and the more stable the output becomes. That is also why a vague instruction like "be conservative" fails for the same family of problems that explicit categorical criteria were designed to solve: both ask the model to guess at a standard you could simply have shown it.

Examples also teach judgement, not just formatting. If your task involves a borderline call, an example that resolves that call the way you would communicates a standard no adjective can capture. The model does not merely copy the shape of the output; it absorbs the decision the example embodies, which is what later lets it generalise to inputs you never demonstrated.

Showing beats prohibiting

Anthropic's prompting guidance sharpens this principle one notch further: positive examples that demonstrate the behaviour you want tend to steer more reliably than negative instructions that list what the model must not do. A rule such as "do not add a preamble" still leaves the model to imagine what an acceptable preamble-free answer looks like, and it will imagine a slightly different one each run. An example that simply shows the answer arriving with no preamble settles the question outright. Few-shot prompting is therefore not only show-do-not-tell; at its best it is show-the-target rather than forbid-the-alternatives, which is why one clean demonstration so often outperforms a paragraph of prohibitions.

The two-to-four sweet spot

How many examples is enough is the question every architect asks first. The practical answer is two to four targeted examples for most steering problems, in line with Anthropic's guidance of roughly three to five well-chosen demonstrations. The first example establishes the format. The second confirms it was not a coincidence. A third or fourth covers variation the first two missed. Past that, you are usually paying for tokens without buying more consistency.

More is not better here, and that surprises people. A long gallery of examples bloats the prompt, raises cost on every call, and can even hurt: if the examples lean one way, the model may overfit to an accidental pattern, such as always answering in three sentences because your examples happened to. The skill is choosing few examples that are individually high-signal, which is the focus of the downstream knowledge point on constructing effective few-shot examples.

2-4
targeted examples is the usual sweet spot
consistency
the property examples move most
show > tell
a demonstration beats a description

What Anthropic's guidance says about leverage

Anthropic's documentation does not hedge about how much examples buy you. It describes them as one of the most reliable ways to steer Claude's output format, tone, and structure, and says a handful of well-crafted ones can dramatically improve both accuracy and consistency. That is the empirical backbone of this knowledge point: the leverage is not folklore, it is the documented recommendation, which is exactly why the exam treats "add examples" as the default-correct move for a consistency complaint.

Two operational details from that guidance are worth carrying into the exam. First, examples earn their keep when they are clearly delimited from the surrounding prose. Wrapping each demonstration in an <example> tag, and the whole set in an <examples> tag, lets the model read them as demonstrations rather than as more rules to obey, which is the difference between a pattern it copies and an instruction it tries to interpret. Second, you are not on your own when choosing them: the same guidance suggests asking Claude to evaluate your own examples for relevance and coverage, or to propose additional ones from a starter set. Selecting which few demonstrations carry the most signal is its own skill, developed in constructing effective few-shot examples; what matters at this root knowledge point is simply that a small, clearly marked set of demonstrations is the single highest-return edit you can make to a prompt that will not behave.

Where examples sit inside the prompt
Loading diagram...
Demonstrations are wrapped in their own tagged block between the instructions and the real input, so the model treats them as patterns to match rather than commands to interpret.

How a few-shot example steers the model

A few-shot prompt is read top to bottom as a single sequence, and the examples sit before the real input. By the time the model reaches the input it must actually answer, it has already seen the task resolved correctly several times in the same context window. That recent, in-context demonstration is what biases its next response toward the same shape and the same standard.

How examples reshape one response
Loading diagram...
Adding examples replaces an open-ended interpretation step with a low-variance pattern match.

Crucially, the examples do not change the model's weights or persist between requests. The Messages API is stateless, so the demonstrations live only inside the one prompt you send. Send the same task tomorrow without the examples and the consistency evaporates. Few-shot prompting is a property of the prompt, not of the model, which is why it composes cleanly with everything else you put in the context window.

A worked example: taming an inconsistent classifier

Worked example

A support team asks Claude to tag each incoming ticket as billing, technical, or account, and complains that identical tickets sometimes get different tags.

The original prompt was pure instruction: "Classify each ticket into billing, technical, or account. Use your best judgement on ambiguous cases." It worked most of the time, but a ticket like "I was charged after cancelling my account" would land on billing one run and account the next. The team's first instinct was to add rules, and the prompt grew to a page of edge-case clauses without ever becoming stable.

The fix was to delete most of the rules and add three examples instead. Each paired a realistic ticket with its correct tag and a one-line reason: the cancellation-then-charge ticket was tagged billing, with the note that the user's concrete problem is the charge, not the closure. A login failure after a password reset was tagged technical. A request to add a teammate was tagged account. Three demonstrations, each resolving a genuinely ambiguous case the way the team would.

Consistency jumped immediately, because the model was no longer interpreting a paragraph of clauses; it was matching against a handful of resolved cases. Just as important, the behaviour was now auditable: when a new ambiguous ticket was mis-tagged, the team did not argue about wording, they added one more example that drew the boundary correctly. That is the quiet superpower of few-shot prompting. It turns a debate about instructions into a question of which example to show.

Common misreadings to avoid

Misconception

If the output is inconsistent, the answer is to write more detailed instructions.

What's actually true

More instructions add more for the model to interpret, which usually adds variance rather than removing it. The highest-leverage move for inconsistency is to show two to four examples of the correct output, because a demonstration removes the interpretation step entirely.

Misconception

More examples are always better, so load the prompt with as many as you can.

What's actually true

Two to four targeted examples is the sweet spot. Beyond that you pay extra tokens for little gain, and a long, lopsided set can make the model overfit to an accidental pattern in your examples rather than the real task.

Misconception

Telling the model what not to do steers it just as well as showing it what to do.

What's actually true

Anthropic's guidance is that positive demonstrations generally outperform negative instructions. A prohibition still leaves the target shape to the model's imagination, while an example fixes it directly. Show the output you want rather than listing the outputs you do not.

Where this sits in the knowledge graph

This knowledge point is the root of Task 4.2, and the rest of the task branches directly from it. Once you accept that examples are the primary lever for consistency, the next questions are practical: how do you recognise that a prompt needs them, covered in when to deploy few-shot examples, and how do you build examples that generalise rather than merely pattern-match, covered in constructing effective few-shot examples. The same idea reappears whenever Claude communicates by demonstration rather than description, which is the theme of example-based communication in Domain 3.

It also has a soft prerequisite. Explicit categorical criteria and few-shot examples are complementary precision tools: criteria narrow what counts as a finding, and examples lock in how each finding should look and be judged. Architects who understand both can reach for the right one instead of defaulting to ever-longer instructions.

How it shows up on the exam

Domain 4 carries twenty percent of the exam, and Task 4.2 is built on this single idea. Questions rarely ask you to define few-shot prompting. Instead they describe a working prompt whose output drifts, formats unevenly, or makes inconsistent calls on similar inputs, and then offer four plausible fixes. The credited answer is almost always the one that adds a small set of examples, while the distractors pile on more instructions, lower the temperature, or reach for a confidence threshold. Train yourself to read inconsistency as a signal for examples, and this family of questions becomes almost mechanical.

Check your understanding

A team's prompt classifies product reviews as positive, negative, or mixed. The instructions are detailed, yet near-identical reviews are sometimes labelled differently on repeated runs. Which change most directly improves consistency?

Check your understanding

An engineer wants Claude to stop adding a chatty preamble before its JSON output. The prompt already says, in plain words, not to add a preamble, yet the preamble still appears intermittently. Which change best reflects few-shot best practice?

People also ask

How many examples should a few-shot prompt include?
Two to four targeted examples are the usual sweet spot, in line with Anthropic guidance of roughly three to five. The first sets the format, the next confirm it, and a fourth covers variation; beyond that you pay tokens for little extra consistency.
Does few-shot prompting work better than adding more instructions?
For consistency problems, yes. Examples show the exact pattern to copy and remove the interpretation that prose leaves open, so a few demonstrations usually beat several extra paragraphs of rules.
What is the difference between zero-shot and few-shot prompting?
Zero-shot gives the instruction with no worked examples; few-shot includes a handful of input-output pairs first. Few-shot trades a little prompt length for markedly more consistent format and judgement.

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

Official · Anthropic AcademyOpen full lesson in Academy

Providing examples

Why watch: Showing worked examples is the few-shot technique this KP names as the highest-leverage path to consistency.

More videos for this concept

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying