Multi Agent Error Handling and Routing

In short: Multi agent error handling is the pattern where the coordinator owns error routing between subagents: each subagent recovers locally from transient failures, and only errors it cannot resolve are propagated to the coordinator, with any partial results attached so the coordinator can decide how to proceed.

What multi agent error handling means

Things go wrong in distributed work: a data source times out, a tool returns garbage, a subagent cannot complete its slice. Multi agent error handling is the design that decides who deals with each failure and how information about it moves. The governing principle follows from hub and spoke architecture: the coordinator owns error routing, because it is the only node with a view of the whole task. Subagents recover what they can locally; everything else flows up to the hub.

The exam rates this at the apply level, so it expects you to design or critique an error-handling scheme, not just define one. The two ideas you must wield are the split between local recovery and escalation, and the rule that propagated errors carry partial results. Get those right and a multi-agent system degrades gracefully instead of failing opaquely.

Multi agent error handling: A pattern where the coordinator routes errors across a multi-agent system: subagents recover locally from transient failures and escalate only unrecoverable ones to the coordinator, attaching any partial results so the coordinator can decide how to proceed.

Local recovery versus escalation

The first design decision is where a given failure should be handled. Transient, recoverable problems belong to the subagent: a momentary network blip, a rate-limited call, a tool that succeeds on retry. Pushing every such hiccup up to the coordinator would drown it in noise and waste the round trip. So subagents implement local recovery for the failures they can plausibly resolve themselves.

Escalation is reserved for failures a subagent cannot resolve locally, a missing capability, a hard dependency that is down, a task it genuinely cannot complete. Those propagate to the coordinator, which has options a subagent does not: it can reroute the work to a different subagent, retry with altered context, proceed with partial coverage, or surface the problem. The division keeps recovery close to the problem while reserving cross-cutting decisions for the only node that can make them.

Routing a subagent failure

Loading diagram...

Local recovery for transient errors; escalation, carrying partial results, for what the subagent cannot fix.

Why propagated errors must carry partial results

When a subagent does escalate, the worst thing it can do is throw away the work it already completed. A research subagent that gathered four of five sources before failing should propagate those four sources alongside the error, not a bare failure. Partial results let the coordinator salvage progress: it might accept the four, re-delegate only the missing fifth, or decide the four are enough. Discarding them forces wasteful rework or, worse, silent loss of coverage.

Attaching structured context about what failed is equally important. "Timed out fetching source 5 after three retries" tells the coordinator far more than a generic error, and it is the seam that connects to the broader error-propagation patterns covered in Domain 2. The richer the escalation, the better the coordinator's recovery decision, which is exactly why structured error propagation across tool and agent boundaries becomes its own knowledge point later in the certification.

local

subagent handles transient errors

escalate

only what it cannot resolve

partial results

always attached on propagation

Why the split mirrors good distributed design

The local-recovery-versus-escalation split is not a Claude-specific quirk; it is the same principle that governs resilient distributed systems generally. Handle what you can where the problem occurs, and surface only what you cannot to the layer with broader authority. A subagent is close to its own failure, it knows the call it made and can cheaply retry it, so it is the right place to absorb transient noise. The coordinator is far from the individual failure but holds the wide view, so it is the right place to make decisions that affect the whole task.

Mapping this onto the multi-agent system clarifies why each layer owns what it owns. The subagent owns recovery that needs only local knowledge; the coordinator owns recovery that needs whole-task knowledge, such as whether another subagent can cover for a failed one or whether partial coverage is acceptable for this particular goal. Putting a decision at the wrong layer, escalating a transient blip, or expecting a subagent to decide whether the overall answer can ship without its slice, is the design smell the exam wants you to catch.

Designing the escalation payload

When a subagent does escalate, what it sends matters as much as the fact that it escalated. A useful escalation is structured: it names the operation that failed, summarises why (timeout, missing dependency, invalid data), and attaches whatever partial results the subagent gathered before failing. That payload is what lets the coordinator make an informed choice rather than a blind one. A bare "I failed" forces the coordinator to either discard the slice or re-run it from scratch, both of which waste the partial progress.

This is also the bridge to the broader Domain 2 treatment of error propagation, where structured error context becomes a first-class design concern across tool and agent boundaries. The habit to build here is to treat an error as data to be reported precisely, not a dead end. A subagent that says "fetched 4 of 5 sources; source 5 timed out after 3 retries; partial findings attached" hands the coordinator a recoverable situation. The richer and more structured the escalation, the better every downstream decision becomes.

The failure mode to design out: silence

The single most dangerous pattern in multi-agent error handling is the silent failure, a subagent that errors without the coordinator ever finding out. Because the coordinator is the hub that owns synthesis and quality, an error it never hears about becomes an invisible hole in the final answer, indistinguishable from a coverage gap. The system looks like it succeeded while quietly omitting a chunk of the work.

Designing silence out means making escalation mandatory for unrecoverable failures and making the coordinator's logging the place every failure is recorded. The coordinator must be aware of every error it did not delegate away, so it can choose to retry, reroute, or at least flag the gap to the user. An error the hub knows about is a managed risk; an error it does not is a latent defect.

This connects back to why the hub and spoke topology routes everything through one node. Because the coordinator owns synthesis, it is also the only place that can reconcile a failure against the overall answer, deciding, for instance, that a missing strand is serious enough to surface to the user rather than paper over. A subagent that swallows its own error denies the coordinator that judgment entirely. So "no silent failures" is not merely good hygiene; it is what preserves the coordinator's ability to make the whole-task decisions that justify having a coordinator in the first place. A multi-agent system is only as reliable as the coordinator's awareness of what actually happened across its spokes.

Worked example: a research run with a downed source

Worked example

A coordinator dispatches three research subagents. One subagent's primary data source is unreachable after several retries.

The affected subagent first does the right local thing: it retries the flaky source a few times, because transient outages often clear. When the source stays down, the failure is no longer transient and is beyond what the subagent can fix alone, so it escalates. Critically, it propagates the two findings it did manage to gather from secondary sources, plus a structured note: "primary source unreachable after three attempts; returning two partial findings."

The coordinator now has a real decision to make, which is its job. It sees partial coverage and a clear reason, and it re-delegates a narrow follow-up to a different subagent pointed at an alternative source, then folds the recovered data into the synthesis. Had the subagent failed silently, the coordinator would have synthesised a confident report missing that entire strand, with no idea anything was wrong. Local recovery, escalation with partial results, and coordinator-owned routing turned a failure into a manageable detour.

Misconceptions to correct

Misconception

Every error a subagent encounters should be sent straight up to the coordinator to handle.

What's actually true

Transient, recoverable errors should be handled locally by the subagent, a retry on a flaky call, for instance. Escalating every minor hiccup floods the coordinator. Only failures the subagent cannot resolve locally should propagate to the hub.

Misconception

When a subagent fails, it should return a clean error and discard whatever partial work it did.

What's actually true

A subagent should attach any partial results when it propagates an error, along with structured context about what failed. Partial results let the coordinator salvage progress and re-delegate only the missing piece, rather than redoing everything.

Partial results turn failures into decisions

The single habit that most improves multi-agent reliability is treating every failure as something that still carries value. A subagent rarely fails having done nothing; it usually fails partway, holding real progress. When that progress travels upward with the error, a failure stops being a dead end and becomes a decision point: the coordinator can keep the good parts and commission only the missing remainder. When the progress is discarded, the same failure forces a full redo or a silent gap.

This reframing is what separates a brittle system from a graceful one. Brittle systems treat a failure as binary, success or nothing, and so they either redo expensive work or lose coverage. Graceful systems treat a failure as a partial success plus a precise description of what is still needed, which is almost always cheaper to finish than to restart. Designing every escalation to carry its partial results is therefore not a nicety; it is the mechanism by which a multi-agent system degrades smoothly under the inevitable failures of real-world tools and data sources.

Retries, checkpoints, and resuming after failure

Routing failures to the coordinator is the structural rule; pairing the model's judgment with deterministic safeguards is what makes that routing dependable. Anthropic's guidance for long-running multi-agent work is to combine the model's adaptability with mechanical guarantees such as explicit retry logic and regular checkpoints, so a transient outage triggers a bounded retry rather than a cascade, and progress is saved at known-good points. The aim is a system that can resume from where it was when an error occurred instead of restarting the whole run, which matters most once the work has already consumed significant time and tokens.

Checkpointing also keeps the coordinator's context manageable as a run grows. Rather than re-reading every subagent's full output to recover, the coordinator can hold lightweight references to durable artifacts, such as files a subagent wrote, and reload only what a given recovery actually needs. That same context discipline curbs the compounding errors and token overhead that creep into long agent loops. The pattern to remember is layered: subagents absorb transient errors locally, the coordinator owns escalation and rerouting, and checkpoints plus retries give the whole system a way to fail partway and continue rather than collapse.

How this is tested on the exam

Task 1.2 error-handling questions describe a multi-agent system reacting to a subagent failure and ask which design is sound. The correct answer keeps transient recovery local, escalates unrecoverable failures to the coordinator, and carries partial results upward. Distractors typically either centralise every trivial error at the coordinator, let subagents fail silently, or have the failing subagent discard its partial work. Because this knowledge point applies coordinator responsibilities to the failure path and bridges into Domain 2's error propagation, designing graceful, observable failure routing is the practical competency being assessed.

Check your understanding

A subagent gathering data fails on one of its three sources after exhausting local retries. How should a well-designed multi-agent system handle this?

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

No videos curated for this concept yet

We are still curating the best official and community videos for this topic.

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying

Multi Agent Error Handling and Routing with a Coordinator