Test-Driven Development With Claude Code

In short: Test-driven iteration means writing tests that define the expected behaviour before any implementation exists, running them so they fail, and handing that failing output to Claude Code as concrete feedback. Because a failing test is unambiguous and machine-verifiable, the model can iterate on the implementation until every test passes without a human re-checking each change.

What test driven development with Claude Code actually means

Test driven development with Claude Code is an iterative refinement technique: you write tests that encode the behaviour you want before the implementation exists, run them so they fail, and then hand that failing output to the model as the specification it must satisfy. The order is the whole point. You are not asking Claude to build something and then prove it works; you are defining "works" as a runnable check first, and letting that check drive the code into existence.

This matters because the model has a strong default tendency to do the opposite. Left to its own devices, Claude Code will usually write the implementation, then write tests that pass against the code it just produced. Those after-the-fact tests validate whatever was built rather than pinning down what should have been built. Genuine test-driven iteration inverts that habit on purpose, which is why doing it well is a deliberate, taught skill rather than something that happens automatically.

Test-driven iteration with failing tests: An iterative refinement workflow in which you author tests that define expected behaviour first, run them to confirm they fail, then give the failing output to Claude Code as concrete feedback so it iterates the implementation until the suite passes.

Why failing tests are the strongest feedback signal

Every refinement technique is, at heart, a way of telling Claude what to change. The quality of that signal determines how many rounds you need and how reliably the model converges on what you actually meant. A failing test is the highest-quality signal available because it is unambiguous, executable, and machine-verifiable. The assertion expected 90 but received 100 leaves no room for interpretation; the model knows precisely which input produced which wrong output and what the right output should have been.

Contrast that with a prose instruction such as "make it handle the edge cases properly." Prose has to be parsed, disambiguated, and turned back into concrete behaviour, and each of those steps is a chance for the model to guess differently from you. A test removes the guessing. It also lets the loop close on its own: Claude runs the suite, reads the result, and iterates until the check passes, instead of waiting for you to notice each mistake. Anthropic frames this as giving Claude "a way to verify its work," and notes that without a check it can run, "looks done" becomes the only signal and you become the verification loop. Failing tests turn that human bottleneck into an automated one.

Tests first

behaviour defined before any code

Red to Green

iterate until the check passes

Do not edit tests

change the code, not the target

Why this is tested on the Claude Certified Architect exam

Knowledge point 3.5.5 lives in Domain 3, Claude Code Configuration and Workflows, which carries 20 percent of the exam, and inside Task Statement 3.5, applying iterative refinement techniques for progressive improvement. It is assessed at Bloom level "apply," so questions will not ask you to recite a definition. They put you inside a situation, usually drawn from Scenario 2 (code generation with Claude Code) or Scenario 5 (Claude Code for continuous integration), and ask which refinement move best fits.

The architect-level insight the exam rewards is recognising when a problem is really an ambiguity problem in disguise. A developer who keeps re-prompting in prose and keeps getting subtly wrong code is fighting the wrong battle; the fix is to convert the desired behaviour into failing tests so the model gets verifiable feedback. The exam also leans on the related D3 principle that an independent check catches more than self-assertion, and a committed test suite is exactly that kind of external, deterministic check. If you can spot, in a scenario, that test-driven iteration would replace several noisy prose rounds with one precise one, you are answering at the level this knowledge point targets.

The loop, step by step

Test-driven iteration is a small cycle you repeat. The structure below is what separates it from simply asking for tests at the end.

The red-to-green test-driven iteration loop

Loading diagram...

The tests are written and committed before any implementation, then the failing output drives iteration until the suite is green.

Write the tests before the implementation

State explicitly that you are doing TDD and that no implementation should be written yet. Specificity helps: name the file, the function, and the cases. "Write failing unit tests in pricing.test.ts covering 9, 10, 49, 50, 99, 100 and 101 units; do not write the implementation" gives the model far more to work with than "add some tests."

Confirm the tests fail for the right reason

Run the suite and check that it is red because the behaviour is missing, not because of a typo or an import error. A test that passes immediately, or fails for an unrelated reason, is not yet exercising the behaviour you care about. This step validates the tests themselves before you trust them to validate the code.

Hand the failing output back as feedback

Commit the tests, then give Claude the actual failing output and ask it to make the suite pass by changing only the implementation. The concrete assertions are the feedback; the model reads which inputs produced which wrong results and works from there.

Iterate to green without editing the tests

Let Claude run the suite, read the failures, and adjust the implementation in a tight loop. The one rule it must not break is altering the committed tests to force a pass. For unattended runs you can make this deterministic with a Stop hook that re-runs the suite and blocks the turn from ending until it is green.

Prose descriptions versus failing tests

The single exam trap attached to this knowledge point is describing expected behaviour in prose when test cases would be more precise. It is a tempting mistake because prose feels faster: you type a sentence instead of authoring assertions. But the saving is an illusion. Prose pushes the disambiguation work onto the model and gives you no automatic way to confirm the result, so you pay the cost back in extra correction rounds and in bugs that slip through an eyeball review.

A failing test front-loads that effort into a form the machine can check forever. Writing expect(applyBulkDiscount(50, 100)).toBe(4400) takes a moment longer than writing "discount kicks in at fifty units," but it is exact, it is repeatable, and it catches the off-by-one at the tier boundary that prose would have quietly waved through. On the exam, an option that proposes "describe the requirements in more detail" is almost always the distractor; the keyed answer encodes the requirements as a test.

A worked example: pinning down tier boundaries first

Worked example

A developer uses Claude Code to implement applyBulkDiscount(quantity, unitPrice), where 10+ units get 5 percent off, 50+ get 12 percent off, and 100+ get 20 percent off, and wants the tier boundaries pinned down before any code exists.

Instead of writing a paragraph that says "apply tiered discounts at 10, 50, and 100 units," the developer asks Claude Code to write the tests first and nothing else: "Write failing unit tests in pricing.test.ts for applyBulkDiscount covering 9, 10, 49, 50, 99, 100 and 101 units. Do not write the implementation yet." Those boundary quantities are exactly where tiered logic tends to break, and a test makes each one explicit.

Claude creates the test file and the developer runs the suite. Every assertion errors because applyBulkDiscount does not exist yet, and that failure is the point: a red suite confirms the tests actually exercise the missing behaviour rather than passing against a stub. The developer commits the tests at this red stage, which fixes the target so it cannot quietly drift later.

Now the developer hands the failing output back: "Here is the test output: [paste]. Write only the implementation so all tests pass. Do not modify the tests." Claude reads the concrete assertions, such as expected 950 for quantity 10, and writes a function to satisfy them. It runs the suite, sees two boundary cases still red because it used > instead of >= at the tier edges, and iterates on its own, because the output tells it precisely what is wrong. When the suite is green, the developer reviews the diff to confirm that only pricing.ts changed and the committed tests are untouched.

Contrast this with the prose-first path. Had the developer simply said "add tiered bulk discounts" and eyeballed the result, the off-by-one at quantity 50 would most likely have shipped. The failing tests turned an ambiguous request into a machine-verifiable contract, and the model closed the loop against that contract without a human re-checking every branch.

Test behaviour, not implementation details

A failing test is only as good as what it asserts, and the most common way test-driven iteration goes wrong is testing the wrong thing. Tests should pin down externally observable behaviour through the public interface, the inputs a caller provides and the outputs or effects they can see, rather than the internal mechanics of how the code achieves them. A test that asserts applyBulkDiscount(50, 100) returns 4400 describes a contract the implementation must honour. A test that asserts the function called a private helper twice describes an implementation detail that has nothing to do with whether the discount is correct.

The distinction matters because brittle tests undermine the whole loop. When tests are coupled to internal structure, every legitimate refactor breaks them even though behaviour is unchanged, so Claude Code spends iterations chasing failures that signal nothing real, and you lose the ability to let it reshape the implementation freely. There is also a perverse incentive: the easy way to satisfy an internals-focused test can be to preserve an awkward internal structure rather than improve the code. Tests written against the public contract leave the implementation free to change as long as the observable result stays correct, which is exactly what you want a refinement loop to optimise.

In practice this means writing assertions a caller would care about: return values, raised errors, persisted records, emitted events. Avoid reaching into private fields, mocking internal collaborators the public behaviour does not depend on, or asserting call counts on helpers. When the tests describe what the code does rather than how, the red-to-green loop drives Claude toward correct behaviour while leaving it room to find a clean implementation, and the suite keeps its value through later refactors instead of fighting them.

Common misconceptions

Misconception

If I describe the expected behaviour clearly enough in prose, Claude Code does not need failing tests to get it right.

What's actually true

Prose is ambiguous and unverifiable, so the model has to guess and you have no automatic check on the result. A failing test encodes the exact expected behaviour as something Claude can run, giving it unambiguous, machine-verifiable feedback to iterate against. Describe behaviour as test cases, not paragraphs.

Misconception

Doing TDD just means asking Claude Code to add tests after it writes the function.

What's actually true

That is test-after, not test-driven. In genuine TDD you write the tests first, confirm they fail against the not-yet-written behaviour, and only then implement. Tests added afterwards validate whatever code already exists rather than driving its design, which is the value the exam cares about.

Misconception

When a test fails, the fastest fix is to let Claude adjust the test so the suite goes green.

What's actually true

Editing the test to match the implementation defeats the entire purpose. The implementation must change to satisfy the committed test, not the other way around. Keep tests under version control and review the diff so a quietly weakened test cannot pass unnoticed.

Where test-driven iteration sits among the refinement techniques

This knowledge point builds directly on the technique hierarchy for refinement, which is its hard prerequisite: you choose test-driven iteration deliberately when a task has verifiable, well-defined behaviour, rather than reaching for it by reflex. It also pairs with neighbouring techniques. Sharing a whole failing suite at once is a form of batch versus sequential feedback, and writing precise assertions is a concrete instance of example-based communication, where a worked example beats an abstract description. Knowing when to switch refinement techniques tells you when tests are the right tool and when a quick prose nudge will do.

The workflow rule that the tests must be written first and not edited is also a natural fit for project configuration. Encoding "do test-driven development, never modify committed tests" into shared, version-controlled guidance follows the same logic as the three-level configuration hierarchy: standards your whole team relies on belong in project-level config, not in one developer's personal setup.

How it shows up on the exam

Expect a scenario, not a definition check. A team is generating or fixing code with Claude Code, an approach keeps producing subtly wrong results, and you must choose the refinement move that most reliably resolves it. The keyed answer turns the ambiguous requirement into failing tests and feeds the output back; the distractors usually offer more prose detail, a sampling tweak such as temperature, or a self-review step that produces no verifiable signal. Hold onto the core idea: test-driven iteration replaces interpretation with an executable contract, and a failing test is the most precise feedback you can give the model.

Check your understanding

You are using Claude Code to fix a CSV parser that mishandles quoted fields containing commas. You describe the bug in prose, Claude produces a fix that looks right, but two weeks later a similar quoting edge case slips through to production. Which iterative-refinement change would most reliably prevent a recurrence, per the exam's guidance on test-driven iteration?

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

No videos curated for this concept yet

We are still curating the best official and community videos for this topic.

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying

Test Driven Development With Claude Code: Iterating on Failing Tests

What test driven development with Claude Code actually means

Why failing tests are the strongest feedback signal

Why this is tested on the Claude Certified Architect exam

The loop, step by step

Write the tests before the implementation

Confirm the tests fail for the right reason

Hand the failing output back as feedback

Iterate to green without editing the tests

Prose descriptions versus failing tests

A worked example: pinning down tier boundaries first

Test behaviour, not implementation details

Common misconceptions

Where test-driven iteration sits among the refinement techniques

How it shows up on the exam

People also ask

Watch and learn

References & primary sources

Master this concept with Archie