AI Skill Certs
Context Management & Reliability·Task 5.3·Bloom: understand·Difficulty 2/5·7 min read·Updated 2026-06-07

Workflow Termination Anti Pattern: One Failure Should Not Kill the Pipeline

Implement error propagation strategies across multi-agent systems

SUBy Solomon UdohReviewed by Solomon UdohAI-assisted · human-reviewed
In short
The workflow termination anti pattern is killing an entire pipeline because one step failed. It throws away the partial results that successful subagents already produced, so a single timeout in a five-source research job destroys four good answers. The resilient alternative is to isolate the failure, keep the good results, and continue with graceful degradation.

What the workflow termination anti pattern is

The workflow termination anti pattern is the over-reaction to a failure: one subagent or one tool call fails, and the code responds by tearing down the whole job. A single source times out and the entire research pipeline aborts. One extraction throws and a batch of two hundred documents is abandoned. The instinct feels safe, stop everything the moment something goes wrong, but in a multi-step, multi-agent system it is usually the most wasteful thing you can do, because it throws away all the work that actually succeeded.

This knowledge point lives in Task Statement 5.3, implement error propagation strategies across multi-agent systems, and it is the mirror image of the silent suppression anti-pattern you met earlier. Suppression hides failures so the pipeline never stops when it should; termination over-reacts so the pipeline always stops when it should not. Both are failures of propagation. They handle the error at the wrong altitude. Its prerequisite is structured error context, because deciding whether a failure is fatal requires knowing what kind of failure it was and what was lost.

Workflow termination anti pattern
Killing an entire pipeline because a single step failed, discarding the partial results that successful steps already produced. It treats every failure as fatal regardless of whether the failed task was on the critical path.

Why throwing away partial results is so costly

In a single-call program, aborting on error is cheap. You lost one operation. In an agentic pipeline, the cost is everything that ran before the failure. Multi-agent jobs are often fan-outs: a coordinator dispatches several subagents in parallel, each doing expensive, token-heavy work against a different source. By the time the fifth subagent times out, the other four may have spent thousands of tokens producing genuinely useful findings. Terminate the workflow and you bin all of it, then start again from zero on the retry, paying the whole cost a second time.

Anthropic's engineering team makes this point directly in their account of building a multi-agent research system: when errors occur they do not restart from the beginning, because restarts are expensive and frustrating. Instead they build systems that resume from where the agent was, combining the adaptability of the model with deterministic safeguards like retry logic and regular checkpoints. The principle that follows is simple to state: a partial graph is better than no graph. Saving and using partial results is not a nice-to-have; it is the difference between a system that degrades and one that collapses.

critical path?
the question that decides if a failure is fatal
isolate, do not abort
the resilient response to a branch failure
partial > none
why successful results must be preserved

How to decide whether a failure is actually fatal

Not every failure is equal, and the architect's job is to tell them apart. The deciding question is whether the failed task sits on the critical path, whether everything downstream genuinely depends on its output. If a coordinator cannot proceed at all without a particular subagent's result, that failure may justify stopping and surfacing the problem. But most failures in a fan-out are not like that. One of five independent sources timing out leaves four-fifths of the answer intact, and the right move is to isolate the failure to its branch, keep the four good results, and let the synthesis proceed with an honest note about what is missing.

The trap to avoid is letting an upstream dependency turn a non-critical failure into a critical one by accident. If a later stage blindly assumes every earlier stage produced output, then a missing branch will crash that stage and you are back to a cascade, even though the original failure was tolerable. Designing for partial results therefore means every consumer of a subagent's output has to handle the case where that output is absent. A stage that reads four results instead of five should carry on; a stage that summarises whatever it was given should summarise four; a stage that genuinely needs all five should be the only place that escalates. Build that tolerance in from the start and a single timeout costs exactly one branch, which is the entire point of refusing to terminate.

Terminating versus isolating a single subagent failure
Loading diagram...
The same single timeout either destroys the whole job or costs only one branch, depending on whether the pipeline terminates or isolates.

Isolation, retry, and continuation

Avoiding the anti-pattern is mostly about where you put your error boundary. Wrap each subagent's work so a failure in one cannot propagate up and abort its siblings. As each subagent finishes, persist its result, a checkpoint, so a later failure cannot retroactively erase it. When a branch fails, give it a bounded retry; the failure type from your structured error context tells you whether a retry is even worth attempting, since a transient timeout may succeed but a permission error will not. If the retry also fails, the branch is skipped, and the pipeline continues with whatever completed.

The one obligation that comes with continuing is honesty, and that is where the next knowledge point picks up. Continuing with partial results is only safe if the final output says so. Silently dropping the failed section would convert this resilient pattern back into suppression. Coverage annotations in synthesis is the discipline of marking exactly which parts of the answer are well-supported and which are missing, so that graceful degradation never becomes quiet incompleteness.

Worked example

A five-source research orchestrator loses one source to a timeout after the other four have already returned high-quality findings.

A coordinator is compiling a competitive-landscape report and fans the work out to five subagents, each assigned a different data source: a news index, a filings database, a patents search, a social-listening tool, and a niche industry portal. Four return within the budget with substantial findings. The industry portal hangs and eventually times out.

The naive implementation has a single try block around the whole fan-out. The timeout raises, the exception unwinds past all five subagents, and the orchestrator returns nothing. Four expensive, successful research passes are discarded, and on retry the entire job, including the four sources that worked perfectly the first time, runs again from scratch, doubling the cost and the latency.

The resilient implementation puts the error boundary around each subagent. The portal timeout is caught at its own branch and converted into structured error context (transient timeout, source recorded, no partial data, alternative: retry once or skip). The orchestrator retries the portal once; it times out again, so that branch is marked missing and skipped. Synthesis runs over the four completed sources, and the final report carries a coverage note that the industry-portal source could not be reached. The stakeholder gets four-fifths of a strong report immediately, instead of nothing, and the system spent its tokens once rather than twice.

Setting the error boundary correctly

Avoiding termination is mostly a question of where you draw your error boundaries, and a few concrete practices turn the principle into working code. The first is to give each subagent its own boundary rather than wrapping the whole fan-out in one. If a single try block surrounds all five branches, the first exception unwinds past every branch and the job is gone; if each branch has its own boundary, a failure stays contained to the branch that produced it and the siblings never notice. The boundary is the difference between a contained fault and a cascade.

The second practice is to checkpoint results as they arrive. The moment a subagent returns useful findings, persist them somewhere durable before any later step runs. Then a failure two steps downstream cannot retroactively erase the work that already succeeded, and a retry can resume from the checkpoint instead of starting the whole job over. This is the resume-rather-than-restart principle that resilient systems are built on, and it is what makes preserving partial results practical rather than aspirational.

The third practice is to bound retries and make them idempotent. A branch should retry a transient failure a small, fixed number of times with a short backoff, then give up and mark itself missing rather than retrying forever. And because a retry re-runs an operation, the operation must be safe to run twice. For read-only lookups that is automatic; for anything with side effects, an idempotency key or a check-before-write protects against duplicate effects when a call that actually succeeded is retried after a slow response.

The fourth practice is explicit critical-path analysis at design time. Before the system runs, decide which subagent results the final answer genuinely cannot be produced without, and treat only those as fatal when they fail. Everything else is, by definition, tolerable to lose with an annotation. Doing this analysis up front means the runtime decision is already made: a non-critical failure is isolated and skipped, a critical one stops and escalates, and nothing depends on an engineer making the right call under pressure during an incident.

Common misconceptions

Misconception

Aborting the whole pipeline on any failure is the safe, conservative choice.

What's actually true

It is conservative about correctness of the failed step but reckless about everything else. Termination discards all successful work and pays the full cost again on retry. Isolating the failure and continuing with partial results is both safer and cheaper for non-critical failures.

Misconception

If you keep partial results, you no longer need to report the failure.

What's actually true

You do. Continuing without disclosing the gap is just silent suppression wearing a different hat. Graceful degradation requires a coverage annotation so consumers know which parts of the result are complete and which are missing.

How it shows up on the exam

This knowledge point is almost always tested through the Multi-Agent Research System scenario. A question will describe a pipeline where one subagent fails and ask what the system should do. The tempting distractors will say to terminate the run to be safe, or to retry the entire job from the start. The correct answer isolates the failure, preserves the results from the subagents that succeeded, and continues, typically with a note about the missing piece. If a question instead frames continuing as risky, look for the qualifier: continuing is correct with an honest coverage annotation, never as a silent drop.

Check your understanding

In a multi-agent research pipeline, four of five source subagents return strong findings and the fifth times out repeatedly. The orchestrator currently aborts the entire run on any subagent failure. What should the architect change?

People also ask

Should one agent failure stop the whole pipeline?
Usually not. It depends on whether the failed task is on the critical path. A non-critical subagent timing out should isolate to its own branch rather than terminate the entire job.
How do you preserve partial results when a subagent fails?
Checkpoint each subagent result as it completes, put the error boundary around individual branches, and let the coordinator synthesise from whatever finished. A partial graph beats no graph.
What is graceful degradation in a multi-agent system?
It is continuing to produce a useful, honestly-scoped result when some component fails, instead of collapsing entirely, while clearly reporting what could not be completed.

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

Steven Ge

Building Multi-Agent AI Research Systems at Anthropic

Why watch: Details Anthropic's multi-agent research pipeline where one subagent timing out must not abort the whole run, so partial results from successful subagents are preserved.

More videos for this concept

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying