Multi Agent Error Propagation

In short: Multi-agent error propagation is the discipline of having each subagent recover from failures it can handle locally and pass only unresolvable errors up to the coordinator. When it does propagate, it includes partial results and a record of what it already attempted, so the coordinator can decide without redoing the work.

What multi agent error propagation means

In a coordinator-subagent system, multi agent error propagation is the set of rules deciding which failures a subagent fixes itself and which it forwards to the coordinator, and what context it attaches when it does forward one. The design goal is a system that absorbs ordinary failures at the edges and only surfaces, to the hub, the problems that genuinely need a hub-level decision. Done well, the coordinator spends its attention on routing and judgement; done badly, it drowns in noise that a subagent should have handled.

The shape of multi agent error propagation follows directly from the four error categories. A subagent can and should resolve transient failures locally by retrying with backoff, and it can fix its own validation errors. What it usually cannot resolve alone are business and permission failures, or a transient failure that persists past its retry budget. Those are the ones that earn a trip up the chain.

Local recovery is the subagent's first job

The first principle is that recovery starts where the failure happened. A research subagent whose web-search tool times out should retry it, not immediately tell the coordinator "search failed." A synthesis subagent that hits a brief rate limit should back off and try again. Anthropic's own account of building a multi-agent research system describes exactly this blend: pairing the model's ability to adapt when a tool is failing with deterministic safeguards like retry logic and regular checkpoints, so the system handles issues gracefully rather than collapsing at the first error.

Local recovery matters because the subagent has the most context about its own task. It knows what it was trying to do, what it already retrieved, and whether another attempt is reasonable. The coordinator, by design, holds a higher-level view and lacks those local details. Pushing recovery down to where the context lives is both more efficient and more accurate than centralising it.

Propagate only what cannot be resolved

The second principle is restraint: forward only the errors you cannot fix. A subagent that propagates every transient hiccup turns the coordinator into a triage desk, forcing it to re-reason about failures the subagent was perfectly positioned to absorb. The anti-pattern the exam targets is exactly this, propagating every error to the coordinator without attempting local recovery first.

When a failure truly is unresolvable locally, a permission denial, a business rule, an outage that outlasts the retry budget, propagation is correct and necessary. The skill is in the threshold. Retryable and self-correctable failures stay local; deliberate refusals and exhausted retries go up. That threshold keeps the coordinator's queue small and meaningful.

Carry partial results and what was attempted

The third principle is about how you propagate. An error that arrives at the coordinator as a bare "I failed" forces the coordinator to start the subagent's work over. A well-formed propagated error instead carries two extra things: any partial results the subagent already gathered, and a record of what it attempted. Now the coordinator can decide intelligently, finish with what exists, reroute the missing piece to a different subagent, or escalate, without discarding completed work.

This is what makes propagation durable. Anthropic's research-system write-up stresses resuming from where an agent was when an error occurred rather than restarting from the beginning, because restarts are expensive and frustrating. Partial results plus an attempt log are precisely the breadcrumbs that make resumption, rather than restart, possible at the coordinator level.

Where recovery happens, and what travels up

Loading diagram...

Edges absorb the recoverable failures; the hub only sees what truly needs it, with context attached.

Worked example

A multi-agent research system has a coordinator dispatching three subagents to gather sources on a market-sizing question.

The first subagent searches an academic index. Its search call returns a 503, the index is briefly overloaded. This is transient, so the subagent retries twice with backoff; the second retry succeeds. Nothing is propagated, because the subagent resolved it locally. The coordinator never even learns it happened.

The second subagent queries a paid data provider, but the credential it was given lacks access to the historical dataset, returning a 403. This is a permission failure the subagent cannot fix on its own. So it propagates, but not as a bare failure. It sends up: the permission error, the partial results it did manage to collect from the free tier, and a note that it attempted the historical dataset and was denied. The coordinator can now reroute the historical-data request to a subagent that holds the right credential, while keeping the free-tier findings.

The third subagent finishes cleanly. When the coordinator assembles the final answer, it has two complete contributions and one partial one with a clear account of the gap. Because each subagent recovered what it could and propagated only the unresolvable failure with full context, the coordinator made one targeted reroute instead of restarting the whole research run. That is multi-agent error propagation working as designed.

Common misreadings to avoid

Misconception

Subagents should report every error to the coordinator so the coordinator has full visibility and stays in control.

What's actually true

Forwarding every transient blip floods the coordinator with noise it must triage and recreates work the subagent could fix itself. Subagents should recover retryable and self-correctable failures locally, and propagate only what they genuinely cannot resolve, so the coordinator's attention is reserved for real decisions.

Misconception

A propagated error just needs to say that the subtask failed; the coordinator can redo the work.

What's actually true

A bare failure forces the coordinator to restart the subagent's task from scratch. A propagated error should carry partial results and a record of what was attempted, so the coordinator can resume, reroute the missing piece, or escalate without discarding completed work.

The retry budget belongs to the subagent

A practical consequence of local-recovery-first is that the retry budget lives with the subagent, not the coordinator. The subagent is the actor that knows it just hit a 503, knows how many times it has already retried this specific call, and knows whether another attempt is reasonable given its remaining work. Centralising retry counting at the coordinator would force every transient blip to round-trip to the hub just to ask permission to try again, the exact flooding the pattern is meant to avoid.

Keeping the budget local also keeps it bounded sensibly. Each subagent caps its own attempts for its own calls, so a system with several subagents does not multiply into an unbounded retry storm. When a subagent exhausts its budget on a particular call, that exhaustion is itself the trigger to propagate: the failure has graduated from "recoverable here" to "needs a decision above me." The budget, in other words, is not just a throttle; it is the dividing line between local recovery and escalation.

Propagation is not the same as failure

It is easy to read "propagate up" as "the subagent failed," but the two are different. A subagent that returns partial results plus a clear note about an unresolved gap has not failed; it has done as much of its job as it could and reported the remainder precisely. From the coordinator's perspective that is a useful, partially-complete contribution, not a dead end. Framing propagation as a normal, expected outcome, rather than an exceptional crash, is what lets the system degrade gracefully instead of collapsing whenever any single piece cannot be finished.

This framing changes how you design the subagent's return. Instead of a binary success-or-failure, the subagent reports what it accomplished, what it could not, and why. The coordinator then composes a whole answer from a mix of complete and partial contributions. A research run where one of five sources was unreachable is still a successful run with a known, bounded gap, provided the gap was propagated honestly rather than hidden as an empty success.

Artifacts: when a subagent should bypass the coordinator

Not every subagent result has to travel back through the coordinator as conversational text. Anthropic's account of its multi-agent research system describes an artifact pattern: when a subagent produces something self-contained, such as a generated document, a dataset, or a block of code, it can write that output directly to an external store and return only a lightweight reference to the coordinator. The coordinator then passes the reference along instead of re-reading and re-summarising the whole payload.

The reason this belongs in a discussion of error propagation is that forcing every result through the hub creates a "game of telephone," where each hand-off risks distortion and adds cost. Each time a coordinator re-describes a subagent's work in its own words, detail can be lost or subtly altered. Artifact persistence sidesteps that: the authoritative output lives in the store unchanged, and only a pointer crosses the boundary, so there is nothing for an intermediary to garble.

For error handling specifically, this means a subagent that completed real work before hitting an unresolvable failure can persist that work as an artifact and propagate just the reference plus the unresolved error. The coordinator then has durable, faithful access to the partial results without the subagent having to inline a large payload into a message that might itself be truncated or summarised. Self-contained output goes to the store; only the decision-relevant signal goes up the chain.

How this interacts with context isolation

Subagents typically run with isolated context: the coordinator cannot see a subagent's internal scratch work, its retries, or its reasoning. The only thing the coordinator sees is what the subagent chooses to return. That isolation is what makes the content of a propagated error so important. If the subagent silently retried, gave up, and returned a bare "failed," the coordinator has lost the entire history of what was tried. It cannot tell a transient outage from a permission wall, or know which partial results already exist.

Because the propagated message is the single channel across that isolation boundary, it must carry everything the coordinator needs to act: the category of the unresolved error, any partial results, and a record of what was attempted. This is the same lesson as structured error metadata, applied one level up. Where a tool structures its failure for an agent to read, a subagent structures its propagated failure for a coordinator to read. The boundary is different; the discipline is identical.

Designing the propagated message

Concretely, a well-formed propagation has four parts. It states the unresolved error and its category, so the coordinator knows whether the problem is a transient one it might reroute around or a permission wall that needs different credentials. It includes the partial results already gathered, so completed work is not discarded. It records what the subagent attempted, so the coordinator does not blindly repeat the same failing approach. And, where the subagent can offer one, it suggests a next step, reroute to a peer with different access, finish with what exists, or escalate to a human.

Assembled that way, a propagated error is less a complaint and more a handoff. The coordinator reads it and makes one targeted decision rather than reconstructing the situation from scratch. That is the difference between a multi-agent system that absorbs failures and one that amplifies them: the quality of the message that crosses the boundary when something genuinely cannot be fixed where it happened.

How this is tested

This is an apply-level knowledge point anchored to the multi-agent research scenario, so exam items hand you a coordinator-subagent setup and ask how a particular failure should flow. The correct answers consistently reward local recovery first and disciplined, context-rich propagation second. A stem describing subagents that forward every timeout to an overwhelmed coordinator is pointing at the no-local-recovery anti-pattern; a stem about a coordinator forced to redo work is pointing at propagation that dropped partial results. The skill assessed is placing recovery at the right layer and packaging escalations so the hub can act without repeating effort.

Check your understanding

In a multi-agent research system, a subagent's web-search tool returns a transient 503. The team is deciding how the subagent should behave. Which approach best follows multi-agent error propagation?

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

No videos curated for this concept yet

We are still curating the best official and community videos for this topic.

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying

Multi Agent Error Propagation: Recover Locally, Escalate Rarely