- In short
- Test-driven iteration means writing tests that define the expected behaviour before any implementation exists, running them so they fail, and handing that failing output to Claude Code as concrete feedback. Because a failing test is unambiguous and machine-verifiable, the model can iterate on the implementation until every test passes without a human re-checking each change.
What test driven development with Claude Code actually means
Test driven development with Claude Code is an iterative refinement technique: you write tests that encode the behaviour you want before the implementation exists, run them so they fail, and then hand that failing output to the model as the specification it must satisfy. The order is the whole point. You are not asking Claude to build something and then prove it works; you are defining "works" as a runnable check first, and letting that check drive the code into existence.
This matters because the model has a strong default tendency to do the opposite. Left to its own devices, Claude Code will usually write the implementation, then write tests that pass against the code it just produced. Those after-the-fact tests validate whatever was built rather than pinning down what should have been built. Genuine test-driven iteration inverts that habit on purpose, which is why doing it well is a deliberate, taught skill rather than something that happens automatically.
- Test-driven iteration with failing tests
- An iterative refinement workflow in which you author tests that define expected behaviour first, run them to confirm they fail, then give the failing output to Claude Code as concrete feedback so it iterates the implementation until the suite passes.
Why failing tests are the strongest feedback signal
Every refinement technique is, at heart, a way of telling Claude what to change. The quality of that signal determines how many rounds you need and how reliably the model converges on what you actually meant. A failing test is the highest-quality signal available because it is unambiguous, executable, and machine-verifiable. The assertion expected 90 but received 100 leaves no room for interpretation; the model knows precisely which input produced which wrong output and what the right output should have been.
Contrast that with a prose instruction such as "make it handle the edge cases properly." Prose has to be parsed, disambiguated, and turned back into concrete behaviour, and each of those steps is a chance for the model to guess differently from you. A test removes the guessing. It also lets the loop close on its own: Claude runs the suite, reads the result, and iterates until the check passes, instead of waiting for you to notice each mistake. Anthropic frames this as giving Claude "a way to verify its work," and notes that without a check it can run, "looks done" becomes the only signal and you become the verification loop. Failing tests turn that human bottleneck into an automated one.
Why this is tested on the Claude Certified Architect exam
Knowledge point 3.5.5 lives in Domain 3, Claude Code Configuration and Workflows, which carries 20 percent of the exam, and inside Task Statement 3.5, applying iterative refinement techniques for progressive improvement. It is assessed at Bloom level "apply," so questions will not ask you to recite a definition. They put you inside a situation, usually drawn from Scenario 2 (code generation with Claude Code) or Scenario 5 (Claude Code for continuous integration), and ask which refinement move best fits.
The architect-level insight the exam rewards is recognising when a problem is really an ambiguity problem in disguise. A developer who keeps re-prompting in prose and keeps getting subtly wrong code is fighting the wrong battle; the fix is to convert the desired behaviour into failing tests so the model gets verifiable feedback. The exam also leans on the related D3 principle that an independent check catches more than self-assertion, and a committed test suite is exactly that kind of external, deterministic check. If you can spot, in a scenario, that test-driven iteration would replace several noisy prose rounds with one precise one, you are answering at the level this knowledge point targets.
The loop, step by step
Test-driven iteration is a small cycle you repeat. The structure below is what separates it from simply asking for tests at the end.
Write the tests before the implementation
State explicitly that you are doing TDD and that no implementation should be written yet. Specificity helps: name the file, the function, and the cases. "Write failing unit tests in pricing.test.ts covering 9, 10, 49, 50, 99, 100 and 101 units; do not write the implementation" gives the model far more to work with than "add some tests."
Confirm the tests fail for the right reason
Run the suite and check that it is red because the behaviour is missing, not because of a typo or an import error. A test that passes immediately, or fails for an unrelated reason, is not yet exercising the behaviour you care about. This step validates the tests themselves before you trust them to validate the code.
Hand the failing output back as feedback
Commit the tests, then give Claude the actual failing output and ask it to make the suite pass by changing only the implementation. The concrete assertions are the feedback; the model reads which inputs produced which wrong results and works from there.
Iterate to green without editing the tests
Let Claude run the suite, read the failures, and adjust the implementation in a tight loop. The one rule it must not break is altering the committed tests to force a pass. For unattended runs you can make this deterministic with a Stop hook that re-runs the suite and blocks the turn from ending until it is green.
Prose descriptions versus failing tests
The single exam trap attached to this knowledge point is describing expected behaviour in prose when test cases would be more precise. It is a tempting mistake because prose feels faster: you type a sentence instead of authoring assertions. But the saving is an illusion. Prose pushes the disambiguation work onto the model and gives you no automatic way to confirm the result, so you pay the cost back in extra correction rounds and in bugs that slip through an eyeball review.
A failing test front-loads that effort into a form the machine can check forever. Writing expect(applyBulkDiscount(50, 100)).toBe(4400) takes a moment longer than writing "discount kicks in at fifty units," but it is exact, it is repeatable, and it catches the off-by-one at the tier boundary that prose would have quietly waved through. On the exam, an option that proposes "describe the requirements in more detail" is almost always the distractor; the keyed answer encodes the requirements as a test.
A worked example: pinning down tier boundaries first
Worked example
A developer uses Claude Code to implement applyBulkDiscount(quantity, unitPrice), where 10+ units get 5 percent off, 50+ get 12 percent off, and 100+ get 20 percent off, and wants the tier boundaries pinned down before any code exists.
Instead of writing a paragraph that says "apply tiered discounts at 10, 50, and 100 units," the developer asks Claude Code to write the tests first and nothing else: "Write failing unit tests in pricing.test.ts for applyBulkDiscount covering 9, 10, 49, 50, 99, 100 and 101 units. Do not write the implementation yet." Those boundary quantities are exactly where tiered logic tends to break, and a test makes each one explicit.
Claude creates the test file and the developer runs the suite. Every assertion errors because applyBulkDiscount does not exist yet, and that failure is the point: a red suite confirms the tests actually exercise the missing behaviour rather than passing against a stub. The developer commits the tests at this red stage, which fixes the target so it cannot quietly drift later.
Now the developer hands the failing output back: "Here is the test output: [paste]. Write only the implementation so all tests pass. Do not modify the tests." Claude reads the concrete assertions, such as expected 950 for quantity 10, and writes a function to satisfy them. It runs the suite, sees two boundary cases still red because it used > instead of >= at the tier edges, and iterates on its own, because the output tells it precisely what is wrong. When the suite is green, the developer reviews the diff to confirm that only pricing.ts changed and the committed tests are untouched.
Contrast this with the prose-first path. Had the developer simply said "add tiered bulk discounts" and eyeballed the result, the off-by-one at quantity 50 would most likely have shipped. The failing tests turned an ambiguous request into a machine-verifiable contract, and the model closed the loop against that contract without a human re-checking every branch.
Test behaviour, not implementation details
A failing test is only as good as what it asserts, and the most common way test-driven iteration goes wrong is testing the wrong thing. Tests should pin down externally observable behaviour through the public interface, the inputs a caller provides and the outputs or effects they can see, rather than the internal mechanics of how the code achieves them. A test that asserts applyBulkDiscount(50, 100) returns 4400 describes a contract the implementation must honour. A test that asserts the function called a private helper twice describes an implementation detail that has nothing to do with whether the discount is correct.
The distinction matters because brittle tests undermine the whole loop. When tests are coupled to internal structure, every legitimate refactor breaks them even though behaviour is unchanged, so Claude Code spends iterations chasing failures that signal nothing real, and you lose the ability to let it reshape the implementation freely. There is also a perverse incentive: the easy way to satisfy an internals-focused test can be to preserve an awkward internal structure rather than improve the code. Tests written against the public contract leave the implementation free to change as long as the observable result stays correct, which is exactly what you want a refinement loop to optimise.
In practice this means writing assertions a caller would care about: return values, raised errors, persisted records, emitted events. Avoid reaching into private fields, mocking internal collaborators the public behaviour does not depend on, or asserting call counts on helpers. When the tests describe what the code does rather than how, the red-to-green loop drives Claude toward correct behaviour while leaving it room to find a clean implementation, and the suite keeps its value through later refactors instead of fighting them.
Common misconceptions
Misconception
If I describe the expected behaviour clearly enough in prose, Claude Code does not need failing tests to get it right.
What's actually true
Misconception
Doing TDD just means asking Claude Code to add tests after it writes the function.
What's actually true
Misconception
When a test fails, the fastest fix is to let Claude adjust the test so the suite goes green.
What's actually true
Where test-driven iteration sits among the refinement techniques
This knowledge point builds directly on the technique hierarchy for refinement, which is its hard prerequisite: you choose test-driven iteration deliberately when a task has verifiable, well-defined behaviour, rather than reaching for it by reflex. It also pairs with neighbouring techniques. Sharing a whole failing suite at once is a form of batch versus sequential feedback, and writing precise assertions is a concrete instance of example-based communication, where a worked example beats an abstract description. Knowing when to switch refinement techniques tells you when tests are the right tool and when a quick prose nudge will do.
The workflow rule that the tests must be written first and not edited is also a natural fit for project configuration. Encoding "do test-driven development, never modify committed tests" into shared, version-controlled guidance follows the same logic as the three-level configuration hierarchy: standards your whole team relies on belong in project-level config, not in one developer's personal setup.
How it shows up on the exam
Expect a scenario, not a definition check. A team is generating or fixing code with Claude Code, an approach keeps producing subtly wrong results, and you must choose the refinement move that most reliably resolves it. The keyed answer turns the ambiguous requirement into failing tests and feeds the output back; the distractors usually offer more prose detail, a sampling tweak such as temperature, or a self-review step that produces no verifiable signal. Hold onto the core idea: test-driven iteration replaces interpretation with an executable contract, and a failing test is the most precise feedback you can give the model.
You are using Claude Code to fix a CSV parser that mishandles quoted fields containing commas. You describe the bug in prose, Claude produces a fix that looks right, but two weeks later a similar quoting edge case slips through to production. Which iterative-refinement change would most reliably prevent a recurrence, per the exam's guidance on test-driven iteration?
People also ask
How do you do test-driven development with Claude Code?
Why does Claude write the implementation before the tests?
Should you commit the tests before the implementation?
How do you stop Claude Code from editing tests to make them pass?
Watch and learn
Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.
Test-driven development (TDD) with Claude Code
Why watch: Shows the core loop of writing tests first and feeding failing test output back to Claude Code as concrete, machine-verifiable feedback to guide implementation.
More videos for this concept
References & primary sources
Master this concept with Archie
Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.