When to Use Few Shot Examples

In short: Few-shot examples are the right fix when a prompt produces inconsistent results despite clear instructions. The three classic signals are inconsistent formatting, inconsistent judgement on ambiguous cases, and extraction fields that come back empty even though the information is present in the source.

Reading the failure before choosing the fix

The previous knowledge point established that examples are the highest-leverage lever for consistency. This one is about diagnosis: learning to recognise the moment a prompt is asking for examples rather than more words. That distinction is what lifts this knowledge point to the analyse level of Bloom's taxonomy. You are not recalling a technique; you are inspecting a symptom and tracing it to the right cause.

Knowing when to use few shot examples matters because the wrong reflex is so tempting. When output disappoints, almost everyone reaches for the instruction block and starts adding clauses. Sometimes that is correct, when the instruction genuinely was unclear. But a large class of failures persists no matter how many sentences you add, because the problem was never the clarity of the rule. It was the absence of a concrete demonstration. The skill is telling those two situations apart.

Deployment signal: An observable symptom in a prompt's output that indicates examples will help more than additional instructions: drifting format, inconsistent judgement on similar inputs, or fields that come back empty despite the data being present.

First, rule out an unclear instruction

Before treating any symptom as a few-shot signal, an architect runs one cheaper test: is the instruction itself clear? Anthropic's guidance offers a memorable version of this check, sometimes called the golden rule of clear prompting. Show your prompt to a colleague who has minimal context on the task and ask them to follow it; if they would be confused, the model will be too. When the instruction is genuinely vague, the fix is to be clear and direct rather than to reach for examples: state the desired output format and constraints, and lay out any multi-step work as numbered steps so the order and completeness are unmistakable.

This gate is what makes the diagnosis an analyse-level skill rather than a reflex. A vague instruction and a clear-but-undemonstrated instruction can produce the very same disappointing output, yet they call for opposite fixes. If the rule is muddled, clarify the prose. Only once the instruction would survive the colleague test, and the output still drifts, are you looking at the kind of failure that examples were built for. Skipping this step is how teams end up adding examples to a prompt whose real problem was an ambiguous sentence, or piling on sentences when the real problem was a missing demonstration.

Signal one: inconsistent formatting

The first signal is output that is correct in substance but unstable in shape. The model answers the question, but one run returns a bulleted list, the next a paragraph, the next a numbered list with a preamble. You wrote "return the result as a short list," and yet the rendering wanders. Detailed instructions alone routinely produce this kind of formatting drift, because there are many shapes that all technically satisfy a prose description.

Formatting is the textbook case for examples because format is exactly what a demonstration pins down with zero ambiguity. Two examples that both render the answer the same way leave no room for the model to improvise a third shape. If you find yourself writing increasingly specific instructions about layout, spacing, ordering, or punctuation, that escalation is itself the signal: stop describing the format and show it.

Signal two: inconsistent judgement on similar inputs

The second signal is subtler and more important. The format is fine, but the model makes inconsistent calls on ambiguous cases. Two reviews that are substantively alike get different sentiment labels. Two borderline tickets route to different queues. The instruction names the categories clearly, yet the model still wobbles whenever a case sits near a boundary.

This wobble is not randomness to be suppressed with a lower temperature; it is genuine uncertainty about where your standard draws the line. The model has no way to know which side of an ambiguous case you would choose, so it guesses, and guesses differently. An example that resolves a near-boundary case the way you would communicates the standard precisely. This is the bridge to constructing effective few-shot examples, where the examples are deliberately chosen to show reasoning, and to few-shot for ambiguous edge cases, which targets exactly these boundary calls.

Signal three: empty extraction fields

The third signal shows up in extraction work and is the one most often misdiagnosed. You ask the model to pull structured data from documents, and certain fields come back empty or null even though the information is plainly present in the source. The model is not refusing; it is unsure that the value it sees belongs in that field, especially when the document's layout differs from whatever it implicitly expected.

This false-null pattern is a few-shot signal, but it must be distinguished carefully from a different failure with the same surface: a field that is empty because the information is genuinely absent from the document. That second case cannot be fixed by examples or by retrying, a boundary explored in retry effectiveness. The architect's job is to check the source first. If the data is there and the field is empty, an example showing that field filled from a similar layout usually fixes it; if the data is not there, no example will conjure it.

The one mechanism behind all three signals

The three signals look unrelated until you notice what they share. Claude's recent models follow instructions literally and explicitly; as Anthropic describes it, the model does not silently generalise an instruction from one item to another, nor does it infer requests you did not make. That literalism is a strength for predictable pipelines, but it is also precisely why a single clear rule can still produce inconsistent output across varied inputs. The model will not quietly extend your intent to a case it does not recognise.

Read the three signals through that lens and they collapse into one root cause. Formatting drifts because the instruction never showed which of several valid shapes you meant, and the model will not invent a canonical one on your behalf. Judgement wobbles because the rule named the categories but never demonstrated where the boundary falls, so each borderline case is decided afresh. Fields come back empty because a value buried in an unfamiliar layout does not obviously match the field the model was told to fill, and it will not assume that it does. In every case the gap is the same: a specific case your instruction did not cover and the model declined to extrapolate. An example is simply the most direct way to supply that missing case, which is why all three symptoms route to the same fix.

Diagnosing an inconsistent prompt

Loading diagram...

Examples are the answer for three specific symptoms; a genuinely missing fact is a different problem entirely.

A worked example: a false-null extraction field

Worked example

A contracts pipeline extracts effective_date, parties, and termination_notice_days from agreements, and effective_date comes back empty on roughly a third of documents.

The team's first move was to assume the prompt was unclear and to expand the instruction for effective_date into a paragraph describing every phrasing a contract might use. Accuracy barely moved. The empty fields persisted, and crucially they clustered: documents where the date appeared inside a dense opening recital, rather than in a labelled header, were the ones that failed.

That clustering was the real signal. The instruction was not unclear; the model simply was not confident that a date buried in a sentence like "this Agreement is made effective as of the third day of March" belonged in a field it associated with a labelled value. The information was present, so this was a false null, not an absent fact. That pointed squarely at examples rather than at more instructions or a retry loop.

The fix was two examples. One showed effective_date extracted from a clean labelled header. The other showed it extracted from exactly the kind of recital sentence that had been failing, with the parsed date filled in. With those two demonstrations, the false-null rate on recital-style documents collapsed. The lesson the team carried forward was procedural: before touching the instruction block, confirm the data is in the source, then ask whether the failure is a format wobble, a judgement wobble, or a false null. All three are few-shot signals, and naming which one you are looking at tells you what the example needs to demonstrate.

A second worked example: formatting that will not settle

Worked example

A release pipeline asks Claude to turn each merged pull request into a one-line changelog entry, and the entries come back in a different shape almost every run.

The instruction was clear enough to pass the colleague test: summarise each pull request as a single changelog line that begins with a verb. Yet the output wandered. One run produced a tidy imperative line, the next wrapped it in a bullet, another prefixed a category label, and a fourth split into two sentences. Nothing was wrong, exactly, but the changelog read as though four different people had written it.

This is signal one in its purest form. The instruction described the destination without showing the route, and a single changelog line can take many shapes that all technically obey the rule. The team had been escalating the prose, adding clauses about punctuation and capitalisation, which only made the prompt longer without making it converge. By the golden-rule standard the instruction was fine; the missing ingredient was a demonstration, not a clarification.

The fix was three examples, each pairing a realistic pull-request title with the exact changelog line they wanted: same verb-first phrasing, same length, no bullet and no label. With the shape demonstrated rather than described, the drift stopped on the next run. The team had spent a week refining adjectives when two minutes of showing would have done it, which is the lesson this signal keeps teaching.

Common misreadings to avoid

Misconception

If extraction returns empty fields, the prompt needs a longer, more detailed instruction for that field.

What's actually true

When the data is present in the source but the field is empty, the model is unsure the value belongs there for an unfamiliar layout. A single example showing the field filled from a similar document fixes this far more reliably than another paragraph of description.

Misconception

Inconsistent judgement on similar inputs is randomness you remove by lowering the temperature.

What's actually true

Lowering temperature reduces sampling variability but not the underlying uncertainty about where your standard draws the line. The model wobbles because it does not know which way you would resolve a borderline case; an example that resolves one is what removes the wobble.

Misconception

If the instruction makes sense to me, the prompt is clear, so any remaining inconsistency must be a model defect.

What's actually true

The golden-rule test is whether a colleague with minimal context would be confused, not whether the author is. And even a genuinely clear instruction can still need examples: when the gap is a missing demonstration rather than a missing rule, the output drifts despite flawless prose.

How it shows up on the exam

Task 4.2 questions at this level present you with a symptom and ask for a diagnosis. A stem might describe a prompt with thorough instructions whose output still formats unevenly, or an extraction job with stubborn empty fields, and the distractors will offer to add more instructions, tighten the wording, or lower the temperature. The credited answer recognises the symptom as a few-shot signal and adds examples. The harder variants add a twist: a field is empty because the information truly is not in the document, and the correct answer is to recognise that examples will not help. Reading the symptom precisely, rather than reflexively adding rules, is the entire skill being tested.

Check your understanding

An extraction prompt returns null for a customer_tier field on about a quarter of records. An engineer checks the failing source documents and confirms the tier is clearly stated in each one. What is the most appropriate next step?

Check your understanding

A summarisation prompt is reviewed by a teammate with no prior context, who follows it without confusion, yet the live output still alternates between bullet lists and paragraphs on similar inputs. What does this most strongly indicate?

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

No videos curated for this concept yet

We are still curating the best official and community videos for this topic.

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying

When to Use Few Shot Examples: The Signals That You Need Them

Reading the failure before choosing the fix

First, rule out an unclear instruction

Signal one: inconsistent formatting

Signal two: inconsistent judgement on similar inputs

Signal three: empty extraction fields

The one mechanism behind all three signals

A worked example: a false-null extraction field

A second worked example: formatting that will not settle

Common misreadings to avoid

How it shows up on the exam

People also ask

Watch and learn

References & primary sources

Master this concept with Archie