Batch Failure Handling and Retry

In short: Batch failure handling and retry is the pattern of reading a finished batch, isolating the requests whose result type is not succeeded, correcting just those inputs, and resubmitting only the failures rather than the entire batch. Because one failed request never affects the others, recovery is targeted, and the cost of failure is minimised by testing prompts on a small sample before running the full batch.

What batch failure handling and retry really means

Batch failure handling and retry is the operational discipline that turns a one-shot batch into a reliable pipeline. When a batch of thousands of requests finishes, some of them may not have succeeded, and the naive reaction is to rerun the whole job. That is both wasteful and unnecessary. Anthropic processes each request independently, so the failure of one has no bearing on the rest; the successes are genuinely done. The correct response is to recover surgically: find the failures, understand why each one failed, fix the cause, and resubmit only those.

This is an apply-level knowledge point, so the exam expects you to enact the pattern, not just describe failures. The pattern has three moves that always appear in the same order: isolate the failures by reading result types, repair the inputs according to the failure cause, and resubmit just the repaired set. Everything else, including the cost discipline of testing first, exists to keep that failure set small.

Targeted resubmission: Recovering from a partial batch failure by resubmitting only the requests that did not succeed, identified by their custom_id, after correcting whatever caused each failure. Successful requests are left untouched because they are already complete and already paid for.

Isolate failures by result type

Recovery starts at the results file, where every request carries one of four result types. A succeeded result is done and needs nothing. The other three signal trouble in different ways. An errored result means the request hit a problem and produced no message; an expired result means the 24-hour ceiling arrived before the request ran; a canceled result means the batch was stopped before that request was sent. A reassuring detail that the exam sometimes checks: you are not billed for errored, expired, or canceled requests, so failures cost you nothing except the need to rerun them.

Because each result is tagged with its custom_id, isolating the failures is a single pass: collect the ids of everything whose type is not succeeded, and you have the exact work list for recovery. This is the payoff of disciplined custom_id correlation: without good ids you could not even name which inputs to retry.

succeeded: complete; leave it alone.
errored: failed with no message; inspect the error type before retrying.
expired: never ran within 24 hours; safe to resubmit unchanged.
canceled: stopped before sending; resubmit if still needed.

Repair before you resubmit

Not every failure should be retried the same way, and this is where architects show judgment. Within the errored bucket the error type matters. An invalid_request_error is a problem with the request itself, for example a document that exceeded the context window or a malformed parameter. Resubmitting it unchanged will fail again identically, so it must be fixed first, perhaps by chunking an oversized document or correcting the payload. A transient server error, by contrast, is not your fault; the same request can be resubmitted as-is and will usually succeed on a second pass. The distinction is documented in Anthropic's error reference, and recognising it is the difference between a retry loop that converges and one that spins forever.

Expired requests are the easy case: they never ran, so resubmitting them unchanged is correct, ideally in a fresh batch with more headroom or at a quieter time so they are not starved again. The unifying rule is simple to state and easy to apply: match the repair to the cause. Fix what is broken, retry what was merely unlucky, and never blindly resend a request whose own contents caused the failure.

The targeted recovery loop

Loading diagram...

Only the non-succeeded requests re-enter the loop; the successes exit once and are never reprocessed.

Prevent failures before they happen

The cheapest failure is the one that never occurs, which is why this knowledge point pairs recovery with prevention. Running a brand-new prompt against all 50,000 records and discovering a systematic flaw afterwards is the expensive way to learn. The disciplined approach is to refine and test the prompt on a small representative sample first, using ordinary synchronous calls where you get instant feedback, until the sample passes cleanly. Only then do you commit the full batch. A prompt that works on a careful sample of 50 will usually succeed on the first pass for the vast majority of the 50,000, shrinking the failure set you have to manage afterwards to a handful of genuine edge cases.

Testing first also protects you from a subtle batch property: validation of each request happens asynchronously, and validation errors only surface when the batch ends. A flaw you could have caught in seconds synchronously might otherwise hide for up to 24 hours inside the batch before revealing itself across thousands of requests at once. Sampling moves that discovery to the front, where it is cheap.

Worked example

You batch 5,000 legal documents for summarisation. When the batch ends, 4,780 succeeded, 200 returned errored with invalid_request_error because the documents were too long, and 20 returned expired.

Resist the urge to rerun all 5,000. The 4,780 successes are complete and already paid for; touching them again would double their cost for no benefit. Instead, read the results and partition by custom_id and result type.

Handle the 200 invalid-request failures by their cause. The error says the documents exceeded the context window, so resubmitting them unchanged would fail again in exactly the same way. The repair is to chunk each oversized document into smaller sections, possibly summarising the sections and then combining them, and to assign new but still meaningful custom_ids such as doc-3391-part-2. These 200, now fixed, go into a new batch.

Handle the 20 expired requests differently. They never ran, so nothing about them is broken. You add them to the same recovery batch unchanged, and because the recovery batch is far smaller than the original, they have ample headroom to complete well within the window this time.

The result is a recovery batch of roughly 220 requests instead of a needless rerun of 5,000. You paid only for what succeeded, you fixed only what was actually broken, and the pipeline converges in a second short pass. That targeted discipline is precisely what batch failure handling and retry is meant to deliver.

The economics that make targeted retry obvious

The reason targeted resubmission is not merely tidier but genuinely cheaper comes down to billing. You are charged only for requests that succeed; anything that errors, expires, or is canceled costs nothing. So when a batch of thousands comes back with a small slice of failures, you have already paid for exactly the successful work and not a cent for the rest. Rerunning the entire batch would therefore mean paying a second time for every request that already succeeded, purely to recover a handful that did not. The arithmetic makes the wasteful option obvious the moment you see it: a full rerun can easily cost more than the original job, all to recover a few percent of it.

Targeted retry inverts that arithmetic. By resubmitting only the failures, your recovery spend is proportional to the size of the failure set, not the size of the job. A 50,000-request batch with 300 failures recovers for the price of 300 requests, a rounding error against the original. This is why the efficient recovery and the correct recovery are the same thing here, and why just rerun it is always the wrong instinct on a workload large enough to have justified batching in the first place.

Designing a retry loop that actually converges

A recovery loop needs a stopping condition, or a stubborn input can trap you in an endless cycle. The discipline is to bound the retries and to treat repeated failure as a signal rather than something to brute-force. A transient server error deserves a second and perhaps a third attempt, because it usually clears. But a request that keeps returning an invalid-request error after you have fixed the obvious cause is telling you something the loop cannot fix on its own, and the right move is to set it aside for human inspection rather than resubmit it forever. Each pass should shrink the failure set; if a pass does not, that is the cue to escalate the survivors.

Thinking about persistent failures this way connects batch recovery to the wider reliability picture, where unrecoverable errors must be surfaced and routed rather than silently retried. A well-designed batch loop converges quickly to either complete success or a small, clearly-labelled set of genuine exceptions that a person can examine. That bounded, honest outcome is what separates a production-grade recovery design from a script that simply hopes every request eventually works. A practical guard is to cap the number of recovery passes and to record, for every input that exhausts them, the last error it returned. That record turns a frustrating dead end into actionable triage: the engineer who picks it up sees immediately whether the survivors share a cause, such as a single malformed template or a class of documents that always overflows the context window, and can fix the real problem once rather than chasing the same failures batch after batch.

How this knowledge point shows up on the exam

Questions here describe a partly failed batch and ask for the most efficient correct recovery. The trap answer is almost always "resubmit the entire batch," which is wrong because it reprocesses and re-pays for thousands of already-successful requests. A second trap is retrying an invalid_request_error unchanged, which simply fails again. The right answer isolates the failures by custom_id, distinguishes a fixable invalid request from a retryable transient error, and resubmits only the corrected failures. When a scenario mentions documents that were "too long" or a job that "timed out," it is cueing you to chunk-and-fix or to resubmit-unchanged respectively. This recovery loop is also a building block of the broader batch processing strategy design knowledge point.

Misconception

If part of a batch fails, the safest fix is to resubmit the whole batch so nothing is missed.

What's actually true

Each request is processed independently, so the successes are already complete and already billed. Resubmit only the failed requests, identified by custom_id. Rerunning the whole batch wastes money and time on work that already succeeded.

Misconception

Any failed batch request can simply be retried as-is until it eventually succeeds.

What's actually true

It depends on the cause. A transient server error can be retried unchanged, but an invalid_request_error, such as a document that exceeded the context window, will fail again identically unless you fix the input first, for example by chunking it.

Check your understanding

A nightly batch of 8,000 invoices finishes with 7,600 succeeded, 350 errored with invalid_request_error for exceeding the context window, and 50 expired. What is the most efficient correct recovery?

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

No videos curated for this concept yet

We are still curating the best official and community videos for this topic.

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying

Batch Failure Handling and Retry in the Claude Batches API