Iterative Refinement in Multi-Agent Systems

In short: Iterative refinement is a coordinator pattern where, after subagents return, the coordinator evaluates the synthesised output for gaps in coverage or quality and re-delegates targeted follow-up tasks to fill them. It repeats until coverage is sufficient, rather than accepting a single-pass result.

What iterative refinement adds

Iterative refinement is the discipline of not trusting the first draft. In a multi-agent system, the coordinator gathers subagent outputs and synthesises an answer, but instead of shipping that synthesis immediately, it evaluates the result against the goal, looks for gaps in coverage or quality, and re-delegates targeted follow-up tasks to close them. It loops, improving the answer each pass, until coverage is good enough to finish. Anthropic's research system works this way: the lead agent "synthesizes these results and decides whether more research is needed, if so, it can create additional subagents or refine its strategy."

This knowledge point is rated apply, so the exam expects you to recognise where iterative refinement belongs and to design a coordinator loop that uses it. It is the quality-assurance counterpart to good decomposition: decomposition tries to assign everything up front, and iterative refinement catches whatever the first pass still missed.

Iterative refinement: A coordinator loop that evaluates synthesised multi-agent output for gaps in coverage or quality, re-delegates targeted follow-ups to fill them, and repeats until the result is sufficient before finalising.

Why single-pass synthesis falls short

The alternative to iterative refinement is single-pass synthesis: run the subagents once, stitch their outputs together, and call it done. It is tempting because it is simple and fast, but it is brittle. A first pass often surfaces partial coverage, uneven depth, or a contradiction between two subagents, and a single-pass coordinator has no step that would ever notice. It ships whatever came back.

The deeper reason is that quality in a multi-agent system is emergent and hard to predict before you see the assembled result. You cannot always know in advance which facet will come back thin. Iterative refinement turns that uncertainty into a managed loop: produce a draft, inspect it for what it lacks, and specifically fix that. The willingness to look at the draft critically is precisely what single-pass designs skip.

Put plainly, single-pass synthesis optimises for speed and simplicity at the cost of reliability, and that trade is often a bad one for anything a user will depend on. The first assembly of several subagents' work is a draft, not a finished product, and treating it as finished is how confident-but-incomplete answers reach production. Iterative refinement is the modest discipline of giving that draft a second look before it ships, and on most non-trivial tasks, that second look is where a noticeable share of the final quality comes from.

How re-delegation stays targeted

A crucial subtlety is that re-delegation is surgical, not wholesale. When the coordinator spots a gap, say, the synthesis covers benefits thoroughly but barely touches risks, it does not rerun the entire pipeline. It issues a focused follow-up: a subagent invocation aimed at exactly the missing piece, with the relevant context passed in because of context isolation. The new findings are folded into the synthesis, and the coordinator evaluates again.

Keeping re-delegation targeted is what makes the loop affordable. A full redo each iteration would be slow and wasteful; a precise follow-up costs little and converges quickly. The coordinator is effectively doing editorial review, identify the specific shortfall, commission exactly that, integrate, repeat, rather than starting from scratch every round.

The evaluate-and-re-delegate loop

Loading diagram...

Each loop inspects the draft and commissions only the missing piece, converging on a complete answer.

Evaluation is the step that makes it work

The pivot of the whole pattern is the evaluation step, the moment the coordinator stops and judges its own draft. Everything else is mechanics; this is the judgment. A coordinator that re-delegates blindly is no better than one that does not, because without a clear-eyed look at what the draft lacks, it cannot aim the follow-up. So the quality of the loop rises and falls with the quality of the coordinator's self-assessment against the original goal.

Good evaluation asks concrete questions: does the draft address every facet the request implied, is each facet covered to adequate depth, and do any two parts contradict each other? Those checks turn a vague sense of "this could be better" into a specific list of gaps, and a specific gap is something a targeted follow-up can fill. This is also why the coordinator, with its whole-task view, is the only agent positioned to run the evaluation, a single spoke sees only its slice and cannot judge the assembled whole.

Cost, latency, and the limits of refining

Iterative refinement is powerful but not free, and an applied architect weighs its cost. Each pass adds at least one more round of subagent work and one more synthesis, which means more latency and more tokens. For a high-value report the trade is clearly worth it; for a quick, low-stakes answer a single pass may be the right call. The skill is matching the depth of refinement to what the task actually warrants rather than reflexively looping on everything.

There is also a point of diminishing returns. The first refinement pass usually closes the obvious gaps and delivers most of the benefit; the second catches subtler ones; beyond that, additional passes tend to polish rather than meaningfully improve. A sensible coordinator therefore treats refinement as a small, bounded loop, enough passes to reach sufficiency, with a cap to prevent an endless chase for marginal gains. Knowing where to stop is as much a part of the pattern as knowing how to refine, which is the subject of the next section.

Knowing when to stop

Iterative refinement needs a termination condition, or it never finishes. The loop continues while meaningful gaps remain and stops when coverage and quality are sufficient for the goal. A practical coordinator judges "sufficient" against the original requirements: have all the important facets been addressed to adequate depth, and are there no glaring contradictions? When the answer is yes, it finalises.

This mirrors the broader principle from single-agent loops, where stopping should be driven by a real signal rather than an arbitrary count. Here the signal is coverage sufficiency rather than stop_reason, but the lesson rhymes: stop because the work is genuinely done, and keep a sensible cap on iterations only as a backstop against an endless polish loop. Most tasks converge in a small number of passes.

evaluate

inspect the synthesis for gaps

re-delegate

targeted follow-up only

repeat

until coverage is sufficient

Worked example: refining a competitive analysis

Worked example

A coordinator assembles a competitive analysis from several subagents. The first synthesis is strong on features but vague on pricing.

The subagents return, the coordinator stitches together a draft, and then, rather than shipping, it evaluates the draft against the brief. The feature comparison is thorough, but the pricing section is a single hand-wavy sentence and one competitor's pricing is missing entirely. That is a quality gap the first pass did not resolve.

Instead of rerunning every subagent, the coordinator issues one targeted follow-up: a pricing-focused subagent invocation, passed the list of competitors and the specific instruction to gather current pricing tiers. The follow-up returns concrete numbers, which the coordinator folds into the synthesis. It evaluates again, finds coverage now adequate, and finalises. Two passes, one surgical re-delegation, and a markedly better answer than a single-pass coordinator would have produced.

The economics of this episode are instructive. The targeted pricing follow-up was a fraction of the cost of the original multi-subagent run, yet it lifted the report from "mostly useful" to "complete." Had the coordinator instead rerun every subagent to fix one thin section, it would have paid for a full second round to repair a small gap. And had it shipped the first draft unexamined, it would have delivered a report that quietly failed the brief on pricing. The evaluate-then-target loop captured most of the upside for a small marginal cost, which is exactly why it is the pattern the exam favours for improving reliability without ballooning latency.

Misconceptions to correct

Misconception

Once the subagents return their findings, the coordinator should synthesise once and deliver the answer.

What's actually true

Single-pass synthesis frequently misses gaps in coverage or quality. A robust coordinator evaluates the draft, re-delegates targeted follow-ups to fill specific gaps, and only finalises once coverage is sufficient.

Misconception

To refine a multi-agent answer, the coordinator should rerun the whole pipeline from the start.

What's actually true

Re-delegation should be targeted, not wholesale. The coordinator commissions only the specific missing piece and integrates it, which converges quickly and cheaply. Rerunning everything each iteration is slow and unnecessary.

Refinement complements parallel breadth

A common confusion is to treat "run more subagents in parallel" and "add iterative refinement" as competing answers to the same problem. They address different things. Parallel breadth, dispatching several subagents at once, improves how much ground the first pass covers. Iterative refinement improves how well the assembled draft is checked and completed after that first pass. A system can have excellent parallel breadth and still ship gaps, because breadth without an evaluation step never inspects the synthesis.

The strongest designs use both: wide parallel coverage to gather a rich first draft quickly, then a refinement loop to evaluate that draft and patch what slipped through. On the exam, watch for a question that offers "add more parallel subagents" as a tempting answer to a reliability problem that is really about unexamined gaps. More breadth helps the first pass; only refinement closes the loop. Recognising which lever a scenario actually needs is the applied judgment being tested.

Refinement is stateful by design, not automatic

A practical detail trips up architects who assume the platform remembers things for them: iterative refinement is stateful at the system level, not at the API call level. In the Claude Agent SDK, a one-shot query() is a single-turn interaction with no conversation memory, so a fresh call does not implicitly know what an earlier pass produced. The refinement loop works only because the application explicitly carries prior findings, constraints, and identified gaps forward into each new turn or each new subagent invocation, exactly as context isolation already requires.

The takeaway is that iterative refinement is a system-design pattern you build, not a parameter you switch on. There is no flag that makes a coordinator re-examine its own draft; you implement the evaluate-and-re-delegate loop yourself, and you persist the state it depends on, whether in the ongoing conversation, an external file, or a structured artifact. Treating refinement as something the model does for free is the mistake; treating it as an explicit loop over preserved state is the design the exam rewards.

How this is tested on the exam

Task 1.2 questions present a multi-agent system whose answers are competent but slightly incomplete, and ask how to improve reliability. The right answer introduces an iterative refinement loop, evaluate the synthesis, re-delegate for the gaps, rather than a one-shot synthesis or a brittle full rerun. Watch for distractors that propose adding more parallel subagents up front (helpful for breadth, but no substitute for inspecting the assembled draft) or that suggest accepting the first pass to save time. Because this knowledge point applies the coordinator's responsibilities to quality assurance, designing the evaluate-and-re-delegate loop is the applied skill being assessed.

Check your understanding

A multi-agent report is usually accurate but often omits a minor-but-important angle the team only notices after delivery. What design change best prevents this?

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

No videos curated for this concept yet

We are still curating the best official and community videos for this topic.

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying

Iterative Refinement in Claude Multi-Agent Systems