Multi Instance Review Pipeline Design

In short: A multi instance review pipeline is an end-to-end review system that combines three ideas: an independent review instance rather than self-review, a multi-pass structure of per-file and cross-file analysis, and confidence-based routing with calibrated thresholds that sends uncertain findings to humans. Designing one means evaluating how these components fit together to balance coverage, cost, and human effort.

What a multi instance review pipeline is

A multi instance review pipeline is the synthesis knowledge point of task statement 4.6: it takes the three ideas you have already met and asks you to assemble them into one working quality gate. The first building block is an independent review instance, a separate call that judges the work without the bias of having produced it. The second is a multi-pass structure, with per-file local passes for consistent depth and a cross-file integration pass for interactions. The third is confidence-based routing, where the reviewer attaches a confidence to each finding and a calibrated threshold decides which findings a human must see. A multi instance review pipeline is what you get when these three are wired together rather than used in isolation.

Because this is the capstone of the task statement, the exam tests it at the evaluate level. You are not asked merely to list the components; you are asked to judge a proposed pipeline, explain why a given design is weak, and defend a stronger one. That means understanding not just what each piece does but why it is there and what would break if it were removed. The skill being measured is architectural judgement under real constraints of cost, latency, and human time.

Multi instance review pipeline: An end-to-end review system that combines an independent review instance (not self-review), a per-file and cross-file multi-pass structure, and confidence-based routing with calibrated thresholds. The components are layered so that author bias, attention dilution, and human-attention allocation are each addressed deliberately.

Why each component is in the design

Good evaluation starts with knowing the job of every part. The independent instance exists because a model grading its own output rationalises its decisions; replacing self-review with a fresh instance removes that author bias at the structural level. The multi-pass structure exists because a single sweep over many files dilutes attention and reviews later files shallowly; per-file passes restore uniform depth and the cross-file pass recovers the interactions that no single-file view can show. Confidence routing exists because not every finding deserves a human, and human attention is the most expensive resource in the loop; routing concentrates it on the findings the model is least sure about, with the threshold calibrated against labelled data so the score actually means something.

The components are complementary rather than redundant, and that is the heart of the design argument. Independence without multi-pass still misses interaction bugs on large changes. Multi-pass without independence still suffers from a reviewer that defends its own prior work if the same instance produced the code. Either of those without confidence routing either floods humans with trivially-correct findings or trusts the automation blindly. Only the combination addresses all three failure modes at once, which is why a strong answer can name the specific weakness each component removes.

Evaluating trade-offs in the pipeline

Designing the pipeline is an exercise in balancing competing goods. More passes and an independent instance cost more inference and add latency, so on a tiny change you might collapse to a single independent pass, while on a sprawling refactor the full per-file and cross-file structure earns its keep. The routing threshold trades coverage of errors against human workload: lower it to escalate more borderline findings and catch more mistakes, raise it to automate more and move faster. Precision and recall pull against each other here too, and the right balance depends on the stakes of a missed defect versus the cost of a false alarm.

There is no single optimal pipeline; there is the pipeline that fits the risk profile of what is being reviewed. A security-sensitive service warrants more human routing and tighter thresholds; a low-risk internal tool can automate aggressively. Evaluating a design therefore means asking what failure would cost, how much human time is available, and how much latency the workflow can absorb, and then justifying the component choices against those answers. That reasoning, rather than a fixed recipe, is what the exam is probing.

independent

removes author bias from review

multi-pass

removes attention dilution on large changes

routing

allocates scarce human review by confidence

The assembled pipeline

The pieces compose into a clear flow. Code arrives, an independent instance reviews it through per-file and cross-file passes, each finding carries a calibrated confidence, and the router splits findings into auto-accepted and human-queued. The human decisions both finalise the uncertain cases and generate labelled data that keeps the confidence threshold honest over time.

A multi instance review pipeline

Loading diagram...

Independence, multi-pass depth, and confidence routing combine into one quality gate with a human-feedback loop.

Running it inside Claude Code and CI

In practice this pipeline often lives in continuous integration. Claude Code can run as part of a CI workflow, for example as a GitHub Action that triggers on each pull request, reviews the diff, and posts its findings as review comments. That gives you the execution surface; the architecture is what you build on top of it. A naive setup asks one invocation to write and approve a change, which is the self-review anti-pattern wearing a CI costume. A designed pipeline instead uses a review invocation that is independent of whatever produced the code, structures the review into passes, and gates merge on the routed outcome so that low-confidence findings block until a human signs off.

Because the CI context makes cost and latency visible, it is a natural place to reason about the trade-offs. You can decide that the full multi-pass structure runs only on pull requests above a certain size, that the human queue has a service-level target, and that recalibration runs on a schedule using the labels CI reviewers produce. The exam likes Scenario 5 precisely because it forces these concrete decisions, and a confident answer ties each pipeline component back to the failure it prevents.

Implementing fresh-context review in Claude Code

The independence the pipeline depends on has a concrete realisation in Claude Code. Anthropic's best-practices guidance treats a fresh-context review as a step you run before considering work done, and it ships a bundled /code-review skill that reviews the current diff in a separate subagent and returns findings. Running the review in a subagent is the whole point: the reviewing context is isolated from the implementation thread, so the reviewer sees the work with less of the context contamination that makes a same-thread self review rationalise its own choices.

That same guidance supplies the other instances a multi instance review pipeline draws on. Parallel sessions let several Claude instances work at once when you want to scale output or compare solutions, and subagents keep isolated research or review off the main conversation so it stays focused. A persistent CLAUDE.md, which Claude reads at the start of every conversation, carries the repository's standing rules and review criteria into each instance, so adding independence does not cost you consistency.

The guidance is also blunt about the gate at the end: always provide verification through tests, scripts, or screenshots, and if you cannot verify the work, do not ship it. When you write your own review prompt instead of using the bundled skill, you still have to specify what work to check, what plan to check it against, and what counts as a finding, so the reviewer's job is unambiguous rather than left to interpretation.

How this is tested on the Claude Certified Architect exam

As the evaluate-level capstone of task statement 4.6, this knowledge point shows up as a design-judgement question. You might be shown a review pipeline and asked which weakness most undermines it, or given a goal (catch more interaction bugs, reduce human load, stop confident-but-wrong merges) and asked which component to add or tune. The strongest distractors are partial pipelines that fix one failure while leaving another, so you must reason about which weakness the scenario actually exhibits.

The reliable instinct is to check for all three properties. Is the reviewer independent of the author? Is the review structured into passes so depth holds on large changes? Are uncertain findings routed to humans on a calibrated threshold? A pipeline missing any one of these has a predictable failure, and naming that failure, then prescribing the specific component that closes it, is exactly the evaluation the exam rewards. The single most common wrong design, and the one to reject on sight, is a pipeline whose only quality gate is the model reviewing its own output.

Worked example

A team proposes a CI review pipeline: one Claude invocation writes the change and, satisfied with it, approves its own diff in the same run; large pull requests are reviewed in a single sweep; everything the model approves is merged with no human step.

Evaluate it against the three properties. Independence fails first: the approving invocation is the author, so it inherits self-review bias and rubber-stamps its own work. Multi-pass fails next: large pull requests get one diluted sweep, so defects in the files changed last and mismatches that span files go unseen. Routing fails last: there is no confidence gate and no human queue, so even findings the model is unsure about are merged automatically. This pipeline has all three weaknesses at once, which is why defects reach production despite a review step existing on paper.

Now redesign it. Separate the reviewer from the author so an independent instance judges the diff. Structure that review into per-file passes for uniform depth plus a cross-file pass for interactions. Have the reviewer attach a confidence to each finding, calibrate a threshold against a labelled set of past pull requests, and block merge on any low-confidence finding until a human resolves it. Each change targets one named failure: independence removes the rubber stamp, multi-pass restores depth on big changes, and routing stops uncertain findings from auto-merging while keeping the human queue small.

The evaluation skill is not inventing a novel mechanism; it is recognising which of the three guarantees a proposed pipeline is missing and prescribing the matching fix. A pipeline that satisfies all three is trustworthy; one that satisfies two has a predictable hole, and you should be able to point straight at it.

Sizing the pipeline to the change

A strong design does not apply the same heavyweight pipeline to every input; it scales the structure to the stakes and size of the change. For a one-line fix, a single independent pass with a confidence check may be the whole pipeline, because multi-pass structure adds cost without much benefit on a trivial diff. For a large refactor touching many files, the full per-file and cross-file structure earns its keep, and the routing threshold may be tightened so more findings reach a human.

The evaluate-level skill is matching the pipeline to the situation: reading what is being reviewed, judging the cost of a missed defect against the cost of inference and human time, and turning the dials accordingly. A reviewer of payment code and a reviewer of an internal logging tweak should not run the same pipeline, and being able to justify why is exactly the architectural judgement the exam is measuring. The components stay the same; what changes is how much of each you spend, and that allocation is the design decision the question is really asking about.

Misconceptions to avoid

Misconception

A review pipeline is sound as long as a capable model reviews the code before merge, even if that model also wrote it.

What's actually true

If the reviewer is the author, the pipeline has no independence and inherits self-review bias regardless of model strength. A trustworthy pipeline reviews with a separate instance and adds multi-pass depth and confidence routing on top.

Misconception

Once you have an independent reviewer, multi-pass structure and confidence routing are optional extras.

What's actually true

They address different failures. Independence alone still dilutes attention on large changes and still floods or starves human review. The components are complementary, so omitting one leaves a specific, predictable gap in coverage or human oversight.

Check your understanding

You are asked to evaluate a CI review pipeline where a single Claude invocation both writes and approves each change, large diffs are reviewed in one sweep, and approved diffs merge with no human step. Which redesign best addresses the pipeline's weaknesses?

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

No videos curated for this concept yet

We are still curating the best official and community videos for this topic.

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying

Multi Instance Review Pipeline Design for Claude Code