- In short
- Few-shot scenario evaluation is the judgement of choosing few-shot examples only when the problem is inconsistent judgement or formatting, and choosing a different tool when it is not. Syntax errors call for tool use with a schema, missing or fabricated data calls for a validation-and-retry loop, and vague criteria call for explicit categorical rules.
Knowing when to use few shot prompting
Knowing when to use few shot prompting is a higher-order skill than knowing how to write the examples. This knowledge point sits at the evaluate level of Bloom's taxonomy because the exam is not asking you to apply a technique, it is asking you to choose among techniques under a described failure and defend the choice. Few-shot is powerful, but applying it to the wrong kind of problem wastes effort and leaves the real defect untouched.
The discipline is diagnosis first, prescription second. Before adding examples, you name the symptom precisely, then map that symptom to the technique built to address it. Few-shot earns its place only for one family of symptoms; three other families have better-matched fixes, and confusing them is the single most reliable way to lose marks on this task statement.
- Technique selection for prompts
- Diagnosing the specific failure in a prompt's output and selecting the matched fix: few-shot for inconsistency, tool use with a schema for malformed structure, a validation-retry loop for missing or fabricated data, and explicit categorical criteria for vagueness.
Four symptoms, four matched fixes
The cleanest way to internalise this knowledge point is as four pairings. Each symptom has a fix that targets its root cause, and the fixes are not interchangeable.
The first symptom is inconsistency: the output is structurally fine, but Claude makes different judgement calls on similar inputs, or formats equivalent answers differently from one run to the next. This is the few-shot zone. Examples show the model the consistent judgement and layout you want, which instructions alone struggle to pin down.
The second symptom is malformed structure: the content is broadly right, but the output is not valid, a JSON object with a trailing comma, a missing bracket, an unexpected field. Examples can nudge format, but they cannot guarantee it. The matched fix is tool use with a JSON schema, where the schema constrains the shape so invalid structure becomes impossible rather than merely discouraged.
The third symptom is missing or fabricated data: fields are blank that should be filled, or worse, the model invents plausible values. The matched fix is a validation-and-retry loop that checks the output against the source and feeds errors back for another attempt. No number of examples repairs a process that never verifies its own output.
The fourth symptom is vagueness: the criteria themselves are subjective, so the model cannot be consistent because the target is undefined. The matched fix is explicit categorical criteria, which replace "be conservative" or "only flag important issues" with concrete, checkable rules.
A decision rule you can run in your head
The value of holding the whole tree in mind is that it stops you reaching for your favourite tool by reflex. An architect who loves examples will try to solve malformed JSON with more examples; an architect who loves schemas will try to solve inconsistent judgement with a stricter schema. Both are misdiagnoses. The tree forces you to start from the observed behaviour and let the symptom select the fix.
Why misdiagnosis is the real trap
The exam rarely asks "what does few-shot do." It asks you to look at a described failure and pick the best response, and it stocks the options with techniques that each sound reasonable. The wrong answers are wrong not because they are bad techniques but because they are aimed at the wrong symptom. Offering tool use for an inconsistency problem, or few-shot for a malformed-structure problem, is the kind of plausible-but-mismatched choice the distractors are built from.
This is why the knowledge point is framed as evaluation. You are being scored on whether you can separate "the output is the wrong shape" from "the output is the right shape but the wrong content" from "the judgement is inconsistent." Each of those sentences points at a different fix, and the few-shot fix belongs to only one of them. It connects naturally to the precision versus recall trade-off in review prompts, where choosing what to optimise is itself an evaluate-level judgement.
Evaluating trade-offs once the symptom is clear
Matching symptom to fix is necessary but not sufficient. Even within the correct branch you weigh costs. Few-shot adds tokens to every request, so for a high-volume pipeline you balance the consistency gain against the per-call expense. Tool use with a schema adds rigidity that is exactly what you want for structure but unwelcome when you need open-ended prose. A validation loop adds latency and extra calls, which is fine for an overnight batch and painful for a synchronous, user-facing flow.
So the full evaluation is two steps: identify the symptom to choose the technique, then weigh that technique's costs against your operational constraints. A good answer on this knowledge point names the matched fix and shows awareness of what it costs, rather than treating few-shot as a free, universal upgrade.
Worked example
A structured-extraction service has three open complaints. Each describes a different failure, and a junior engineer wants to fix all three by adding more few-shot examples.
Complaint one: the service returns records where author judgement varies, sometimes classifying a borderline document as a report, sometimes as a memo, with no consistency. This is inconsistency in judgement, the few-shot branch. You add two examples that demonstrate the borderline call and the reasoning behind it.
Complaint two: roughly five percent of responses are not valid JSON, breaking the downstream parser. This is malformed structure. Few-shot is the wrong fix; you move the output behind a tool definition with a JSON schema, after which invalid shapes simply cannot be emitted. The engineer's instinct to add examples here would have reduced the failure rate slightly while leaving a parser that still crashes.
Complaint three: a key field is sometimes filled with confident but fabricated values when the source is silent. This is a semantic, data-integrity problem. The fix is a validation step that checks the field against the source and, when it is unsupported, returns the record for a retry with feedback, plus a nullable field so the model can legitimately say "not present" instead of inventing.
Three complaints, three different fixes, and only one of them is few-shot. The lesson the worked example drives home is the one the knowledge point is named for: you evaluate the failure before you prescribe the technique, and resisting the urge to apply your favourite tool to every symptom is the skill being tested.
A diagnosis checklist you can run under exam pressure
When a question describes a failing prompt, run a quick triage before you even read the answer options. First, is the output the right shape? If it is structurally invalid, you are in the tool-use branch and few-shot is almost certainly a distractor. Second, if the shape is fine, is the content present and faithful to the source, or is data missing or invented? If integrity is the issue, you are in the validation-loop branch. Third, if the content is present and well-formed but the decisions or formatting wander from run to run, you are finally in the few-shot branch. Fourth, if you cannot even articulate what a correct decision looks like, the criteria themselves are too vague and the fix is explicit categorical rules.
Running that four-step check in order keeps you from anchoring on the first plausible technique you happen to see. The exam authors know few-shot is the famous, attractive answer, so they frequently dangle it in front of structural or semantic failures. The checklist is your defence against picking it by reputation rather than by fit, and it is fast enough to run in your head while you read the stem.
Why this is an evaluate-level skill, not a recall one
It is worth pausing on why this knowledge point is pitched at the evaluate tier rather than apply. Applying few-shot is a procedure: once you have decided examples are needed, you write them. Evaluating is a judgement: you must weigh several viable techniques against a described situation and justify why one of them wins. That comparison is irreducible. There is no formula that converts a symptom into a fix without understanding what each technique actually does and what it costs to run. The exam is testing that you can hold the whole toolkit in view at once and reason about fit, which is exactly the competency a working architect needs the moment a production pipeline starts misbehaving and several fixes are on the table.
Misconceptions that cost marks
Misconception
If Claude's output is inconsistent in any way, few-shot examples are the answer.
What's actually true
Misconception
Few-shot examples are the most reliable way to guarantee valid JSON output.
What's actually true
The cost of choosing the wrong branch
It is worth being concrete about what a misdiagnosis costs, because that is what makes this evaluation matter in production rather than only on the exam. Solve a malformed-structure problem with few-shot and you ship a pipeline whose parser still crashes intermittently, now masked by a slightly lower failure rate that lulls you into believing it is fixed. Solve a missing-data problem with few-shot and you teach the model to format its fabrications more convincingly while the underlying integrity gap remains untouched. Solve an inconsistency problem with a stricter schema and you constrain the shape without ever reaching the judgement that was actually wandering. In each case the wrong branch produces motion without progress, effort spent that leaves the real defect in place and, worse, often hides it. Naming the symptom correctly is therefore not pedantry; it is the difference between a genuine fix and a convincing disguise that fails again next week.
The takeaway
Treat few-shot as a precision instrument with a specific indication, not a cure-all. Diagnose the symptom, select the technique built for it, then sanity-check the technique against your latency and cost constraints. Getting that sequence right is what this evaluate-level knowledge point rewards, and it is reinforced by the more specialised judgement of aiming few-shot at ambiguous edge cases once you have confirmed few-shot is the correct tool.
An extraction pipeline emits content that is usually correct, but about 8% of responses are invalid JSON that crashes the downstream parser. An engineer suggests adding few-shot examples of well-formatted output. What is the best evaluation of this proposal?
People also ask
When should you use few-shot prompting instead of tool use?
How do you decide between few-shot examples and a validation loop?
Is few-shot the right fix for malformed JSON output?
Watch and learn
Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.
Exercise on prompting
Why watch: Deciding when examples are the right fix versus another technique is the evaluation skill this KP tests.
More videos for this concept
References & primary sources
Master this concept with Archie
Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.