Claude Code Refinement Techniques

In short: The refinement technique hierarchy is the order in which you reach for ways to improve Claude Code output: concrete input/output examples are the highest-leverage move, followed by test-driven iteration that shares failing tests, then the interview pattern where Claude asks questions before building. Adding more prose description sits at the bottom because the model generalises from demonstrations more reliably than from longer instructions.

What the Claude Code refinement techniques hierarchy is

The Claude Code refinement techniques hierarchy is a ranked list of the moves you make when output is not yet what you want, ordered from highest leverage to lowest. It exists because, when results disappoint, almost everyone's instinct is the same: write a longer, more detailed instruction. That instinct is backwards. The exam for the Claude Certified Architect treats this knowledge point as a corrective: it wants you to recognise that the most effective refinements are usually the ones that show rather than tell.

There are three techniques worth memorising in order. At the top sit concrete input/output examples, two or three demonstrations of the exact transformation you expect. Below them is test-driven iteration, where you write tests that pin down the behaviour and feed Claude the failures. Below that is the interview pattern, where you ask Claude to question you before it builds anything. And underneath all of them, as the default people overuse, is plain prose description. The whole point of the hierarchy is that the bottom rung is where most people start and the top rung is where they should.

Refinement technique hierarchy: The effectiveness ordering of ways to improve Claude Code output: concrete examples first, then test-driven iteration sharing failing tests, then the interview pattern, with more prose description as the weakest fallback. Higher rungs work because they replace ambiguous description with demonstrations the model can generalise from.

Why examples sit at the top

A prose instruction asks the model to imagine the result you have in your head and reconstruct it from words. Two or three worked examples remove the imagining entirely: they hand the model the pattern directly. This is why Anthropic's own guidance leans so heavily on showing a transformation rather than narrating it, and why a developer who pastes a before-and-after pair almost always gets a tighter result than one who writes another paragraph of rules.

The reason is generalisation. Given a handful of consistent examples, the model infers the underlying rule and applies it to new inputs, including the subtle conventions that are genuinely hard to put into words. Spacing, naming, the shape of an edge case, the tone of a commit message: these resist description but are obvious in a demonstration. The cost of an example is also tiny compared with its payoff, which is exactly what "highest leverage" means. You spend a few lines and you move the output more than a long instruction would.

examples

rung 1: highest leverage

failing tests

rung 2: unambiguous feedback

interview

rung 3: surfaces the unknown

Why test-driven iteration is the second rung

When the thing you want is behaviour rather than format, examples in prose can still be ambiguous, and that is where test-driven iteration takes over. You write tests that encode the expected behaviour, run them, and share the failures with Claude. A failing test is the least ambiguous feedback there is: it is machine-verifiable, it names the exact gap, and it cannot be misread the way an English description can. Claude reads the failure, adjusts the implementation, runs again, and closes the loop on its own.

This is the second rung rather than the first only because it costs more to set up. Writing good tests is real work, and not every refinement needs that rigour. But for complex logic, the kind where you cannot fully spell out correct behaviour in two examples, it is the most reliable technique available, and it pairs naturally with the broader principle that you should always give Claude a check it can run. The deep treatment lives in test-driven iteration with failing tests.

Why the interview pattern is the third rung

Sometimes the problem is not that Claude misunderstood you; it is that you have not yet decided what you want. In an unfamiliar domain you may not know which edge cases matter or which tradeoffs are in play. The interview pattern handles this: you ask Claude to interview you, with clarifying questions, before it writes anything. It probes the parts of the spec you skipped, and the act of answering forces the decisions that would otherwise have surfaced as bugs.

It ranks below examples and tests because it addresses a different failure, under-specification rather than miscommunication, and because it is slower, front-loading a conversation before any code appears. But when you genuinely do not know enough to give an example or write a test, interviewing is the move that produces the missing knowledge. The ranking is therefore not a podium where one technique always wins; it is a default reading order that you adjust to the symptom in front of you, which is the subject of when to switch refinement techniques.

The refinement technique hierarchy, top to bottom

Loading diagram...

Effectiveness runs from top to bottom. Most people start at the bottom rung; skilled users start at the top.

Reading the hierarchy as a decision, not a ladder you always climb

It helps to think of the ranking as a starting guess rather than a fixed sequence you march through. You do not always begin with examples and only fall back to tests if they fail. Instead, you read the symptom and jump to the rung that matches it. Inconsistent formatting despite clear instructions points straight to examples. Complex behaviour you cannot demonstrate in two lines points to tests. A domain you barely understand points to an interview. The hierarchy tells you which rung is usually strongest, and your read of the situation tells you which one to try first.

What the ranking firmly rules out is the reflex at the bottom. When output is wrong and your first move is to add another sentence of description, you are reaching for the weakest tool in the kit. That reflex is the single behaviour this knowledge point is designed to break, and recognising it in a scenario question is most of the battle.

Worked example

A developer wants Claude Code to write git commit messages in a strict house style. Their CLAUDE.md says 'use conventional commits, imperative mood, under 50 characters, with a scope', yet the messages keep drifting: sometimes past tense, sometimes no scope, sometimes too long.

The instinct is to expand the instruction: add more rules, spell out the format, list forbidden words. That is climbing down to the weakest rung, and it tends to produce a longer instruction that Claude still reads loosely.

The hierarchy says reach for examples instead. The developer adds three real before-and-after pairs to the project configuration: a diff summary alongside the exact commit line it should produce, feat(auth): add OAuth callback handler, fix(api): reject empty pagination cursor, refactor(db): extract migration runner. No new prose, just three demonstrations of the transformation.

The drift stops almost immediately, because the examples encode the conventions that the prose could only gesture at: the scope in parentheses, the imperative verb, the length, the absence of a trailing period. Claude generalises the pattern to new commits it has never seen. Notice the move that mattered was not writing more; it was switching rungs, from describing the style to demonstrating it. Had the task been about subtle logic rather than format, the right switch would instead have been to tests, but for a formatting problem, examples are the top rung and they did the job in three lines.

Where each rung lives in the Claude Code workflow

The hierarchy is not just an ordering of tricks; each rung attaches to a phase of the recommended Claude Code workflow, which runs explore, then plan, then implement, then commit. The interview pattern belongs at the front, in the explore-and-plan phase. Anthropic's own advice for a larger feature is to start with a minimal prompt and have Claude interview you, using its question-asking tool to probe technical implementation, interface, edge cases, and tradeoffs, until the specification is complete, and then begin a fresh session to build from that written spec. That is the third rung formalised into a workflow step, which is exactly why interviewing pays off most when you do not yet know what you want.

Examples and tests attach to the implement phase, where the leverage is highest, and both are expressions of a single broader principle in the guidance: give Claude a check it can run. A worked example is a check on format, does the output match the demonstrated shape?, while a failing test is a check on behaviour that the model can run, read, and iterate against without you in the loop. Seen this way, the ranking is less a podium than a map of when each technique earns its keep: interview to define the target, examples to pin down its shape, tests to lock in its behaviour, with added prose as the fallback that rarely closes any of those gaps on its own. The same logic explains why a strong demonstration is worth committing to CLAUDE.md: it converts a one-off refinement into a check the whole team inherits on every future session.

Common misreadings to avoid

Both of the traps below come from misjudging where the leverage actually lives.

Misconception

When Claude's output is wrong, the most reliable fix is to write a longer, more detailed prose instruction.

What's actually true

More prose is the weakest rung of the hierarchy. The reliable fix is usually to switch techniques: show two or three concrete examples, or share a failing test. Demonstrations move output more than added description because the model generalises from them, which is exactly what the ranking is meant to teach.

Misconception

Since examples are the top rung, they are the right first move for every refinement problem.

What's actually true

Examples top the ranking for inconsistent format, but they cannot capture behaviour too intricate to demonstrate in a couple of pairs, and they are useless when you cannot yet say what correct looks like. For complex logic reach for tests, and for an unfamiliar, under-specified task reach for the interview pattern. The ranking is a default reading order, not a single answer for every symptom.

Misconception

The three techniques are interchangeable, so it does not matter which one you pick first.

What's actually true

They address different failures and differ in leverage. Examples fix inconsistent format, tests fix ambiguous behaviour, and the interview pattern fixes under-specification. Picking the rung that matches the symptom is the skill; treating them as equivalent throws away the whole point of ranking them.

How the hierarchy connects to the rest of Domain 3

This knowledge point is the root of task statement 3.5, and the three rungs each open into their own topic. Examples expand into example-based communication; tests expand into test-driven iteration; and the question of how to deliver several fixes at once leads into batch versus sequential feedback. All of them assume you can first name the ranking and resist the pull toward more prose. The configuration topics earlier in the domain, the three-level configuration hierarchy and how it travels through version control, are where many of these examples and rules end up living, since a good demonstration committed to CLAUDE.md benefits the whole team.

How it shows up on the exam

Domain 3 (Claude Code Configuration and Workflows) carries 20 percent of the exam, and task statement 3.5 is its iterative-refinement strand, tested mostly through Scenario 2 (code generation with Claude Code) and Scenario 5 (Claude Code for continuous integration). At the understand level you are rarely asked to recite the ranking in the abstract. Instead a question describes a developer whose output is inconsistent and whose first move was to add more detailed instructions, then asks for the better approach. The reliable answer traces back to the hierarchy: switch rungs toward a demonstration. Internalise that examples beat prose, tests beat ambiguous examples for behaviour, and the interview pattern rescues an under-specified task, and the scenario questions in this strand become quick to read.

Check your understanding

A developer's CLAUDE.md gives a long, detailed description of the JSON shape they want Claude Code to emit, but the output keeps varying in field order and formatting. They are about to add another paragraph of formatting rules. According to the refinement technique hierarchy, what is the higher-leverage move?

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

No videos curated for this concept yet

We are still curating the best official and community videos for this topic.

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying

Claude Code Refinement Techniques: The Effectiveness Hierarchy

What the Claude Code refinement techniques hierarchy is

Why examples sit at the top

Why test-driven iteration is the second rung

Why the interview pattern is the third rung

Reading the hierarchy as a decision, not a ladder you always climb

Where each rung lives in the Claude Code workflow

Common misreadings to avoid

How the hierarchy connects to the rest of Domain 3

How it shows up on the exam

People also ask

Watch and learn

References & primary sources

Master this concept with Archie