- In short
- Batch processing strategy design is the architect-level skill of assembling a complete, defensible Claude batch pipeline: routing blocking work to the synchronous API and latency-tolerant work to batch, testing prompts on a sample before committing the full run, correlating results with meaningful custom_ids, and recovering from partial failures by resubmitting only what failed. It is an evaluate-level competency because it weighs cost, latency, and reliability trade-offs rather than applying a single rule.
What batch processing strategy design demands
Batch processing strategy design is where the four earlier knowledge points stop being separate facts and become a single architecture. The exam frames this at the evaluate level because there is no one rule to apply; you are weighing cost against latency against reliability and defending the balance you strike. A strategy that only chases the 50% discount but ignores failure recovery is fragile. A strategy that is perfectly reliable but batches work people are waiting on is broken. The architect's job is to compose all the constraints into a pipeline that is cheap and timely and trustworthy, and to justify why each choice is right for the workload in front of them.
That is why this knowledge point sits at the top of the task statement, with hard prerequisites on both the synchronous versus batch decision rule and batch failure handling and retry. You cannot design the whole pipeline until you can route a workload correctly and recover from a partial failure. Here, those skills become stages in one coherent flow.
- Batch processing strategy
- An end-to-end design for running a bulk Claude workload that decides which steps are synchronous and which are batched, validates the prompt on a sample, correlates results with custom_ids, and recovers from partial failures by resubmitting only what failed.
Stage one: route the portfolio
Every strategy begins by separating the workload into blocking and latency-tolerant work, because that split decides which API each step uses. A real system rarely lives entirely on one side. A nightly compliance pipeline, for instance, might generate thousands of document reviews overnight, which is pure batch territory, yet still expose a synchronous endpoint so a human reviewer can re-run a single flagged document on demand the next morning. Designing the strategy means naming each step and asking the blocking question of it individually, not reaching for one global setting.
Getting this stage right is what separates a strategy from a shortcut. The manager who declares "batch everything" has skipped it, and the result fails the latency requirements of the interactive steps. The architect who routes each step deliberately captures the discount on the bulk work, where the volume and therefore the saving actually live, while keeping the interactive paths responsive. The cost win is large precisely because the bulk work is large; the interactive work is a small fraction of calls and loses little by staying synchronous.
Stage two: de-risk with a sample
Before committing a full batch you validate the prompt on a small, representative sample using synchronous calls. This stage is cheap insurance against an expensive class of mistake. A new extraction or classification prompt can carry a systematic flaw, a format the model misreads, an instruction that is ambiguous, an output schema that does not fit edge cases, and discovering that flaw only after 50,000 requests have run is the worst possible time to learn it. Worse, batch validation is asynchronous, so a defect can hide for up to 24 hours before surfacing across the entire batch at once.
Sampling moves discovery to the front. You refine against 30 to 100 representative items synchronously, where feedback is instant, until the sample passes cleanly, and only then do you scale to the full batch. The payoff is a high first-pass success rate, which keeps the eventual failure set small and the recovery cheap. Skipping this stage is the most common way a technically valid batch strategy still ends up slow and costly in practice.
Stage three: build in correlation and recovery from the start
A reliable strategy bakes in result handling before the first request is sent, not after the results arrive. That means assigning each request a meaningful custom_id derived from a stable business key, so the output file joins straight back to your records despite batch results returning in any order. And it means planning the recovery loop up front: read the results, partition by result type, keep the successes, and resubmit only the failures after fixing whatever caused them. Because one failed request never affects the others, recovery is always a small, targeted second pass rather than a full rerun.
These two habits are what make a strategy trustworthy rather than merely cheap. Without correlation, a cost-optimised batch can silently misalign outputs with inputs and corrupt the dataset it was meant to produce. Without a recovery plan, a partial failure forces an expensive full rerun or, worse, quietly drops data. An architect evaluating a proposed strategy should check for both: where is the correlation key, and what happens to the requests that do not succeed? A design that cannot answer those questions is not yet a strategy. The full lifecycle, from creation to result retrieval, is laid out in Anthropic's batch processing guide.
- Route: classify each step as blocking or latency-tolerant before choosing an API.
- De-risk: validate the prompt on a sample synchronously before the full run.
- Correlate: assign meaningful custom_ids so results join back to records.
- Recover: plan targeted resubmission of failures, never a full rerun.
Stage four: weigh the trade-offs and defend the design
Because this is an evaluate-level skill, the final stage is judgment. You hold cost, latency, and reliability in tension and decide where each matters most for this workload. If a bulk job has a contractual deadline tighter than 24 hours, you may keep it synchronous despite the cost, because the batch window cannot guarantee that deadline. If failures would propagate into a downstream system, you invest more in recovery and validation, connecting this design to broader error propagation strategy thinking. If the prompt is mature and stable, you can shrink the sampling stage; if it is brand new, you expand it. The strategy is not a fixed recipe but a set of defensible choices, and the exam rewards the candidate who can say why each lever was set where it was.
That defensibility is the heart of the competency. Two architects can produce different valid strategies for the same workload, and both can be correct if each can justify the balance they struck. What is never correct is an undefended absolute, batching everything for the discount, or refusing batch entirely out of caution, because both ignore the structure of the actual workload.
Worked example
A bank must review 60,000 loan files for missing-disclosure flags every night, store the results for an analyst team that arrives at 8 a.m., and let analysts re-run any single file on demand during the day. Leadership wants the lowest possible cost without missing the morning deadline.
Start by routing the portfolio. The nightly review of 60,000 files is latency-tolerant: a scheduler launches it after hours and the analysts only need results by 8 a.m., a window the 24-hour ceiling sits comfortably inside. That is the bulk of the spend, so it goes to batch and earns the 50% discount. The on-demand single-file re-run, however, is blocking: an analyst is waiting at their desk, so it must stay synchronous. Already the design mixes both APIs deliberately, and the cost win lands on the 60,000-file bulk where it is largest.
Next, de-risk. Before the first production night, validate the disclosure-flag prompt on a sample of perhaps 80 representative files synchronously, refining until it reads the documents reliably. This keeps the first-pass success rate high so the nightly failure set is small.
Then build in correlation and recovery. Each file goes into the batch with a custom_id like loan-2026-06-07-114829, so every result joins back to the exact file regardless of return order. After the batch ends, the pipeline partitions results: successes are written to the analyst store, oversized files that errored are chunked and resubmitted, and any expired requests are re-queued in a small recovery batch, all well before 8 a.m. given the early start.
Finally, defend the trade-offs. Cost is minimised by batching the 60,000-file bulk; the morning deadline is protected by starting early and keeping the recovery batch small; reliability is assured by custom_id correlation and targeted retry; and analyst responsiveness is preserved by keeping the on-demand re-run synchronous. Every lever is set with a reason, which is exactly what an evaluate-level answer must demonstrate.
Right-size the batch and split very large datasets
Part of designing the pipeline is choosing how much work goes into a single batch. The hard ceilings are 100,000 requests or 256 MB, whichever comes first, but the practical size is usually well below the maximum. For a very large dataset, Anthropic's guidance is to split the work across several right-sized batches rather than forcing everything into one, because smaller batches are easier to monitor, complete and clear sooner, and contain the blast radius when something goes wrong.
The reasoning is operational. You poll each batch for status, so a handful of moderate batches gives you finer-grained progress and lets a problem in one run be diagnosed and resubmitted without holding up the others. A single monolithic batch, by contrast, is all-or-nothing to observe and slower to recover from. Splitting also lets independent batches make progress in parallel within your rate limits, which can shorten the wall-clock time to finish a huge job. The judgment call is to make each batch large enough to capture the throughput benefit yet small enough to stay manageable, and to defend that size as one more deliberate lever in the strategy.
Treat the strategy as a living design
A batch strategy is not finished the night it first runs; the best ones are maintained as the workload and the prompt evolve. Volumes drift upward, source documents change shape, and a prompt that sailed through last quarter's sample can start producing a higher failure rate as the data shifts beneath it. An architect who treats the strategy as a living design watches the per-run mix of succeeded, errored, and expired results as a health signal. A creeping rise in invalid-request errors hints that inputs are growing past the context window and the chunking stage needs revisiting; a rise in expirations hints that the batch is competing for capacity and might run earlier or split into smaller pieces.
Capturing the reasoning behind each choice, why a step is batched, why the sample is a given size, how failures are routed, also pays off when someone else inherits the pipeline or when leadership questions the design under cost pressure. A short written rationale turns an implicit set of decisions into something a team can review, challenge, and adjust deliberately rather than by guesswork. This is the difference the evaluate level is really probing: not whether you can stand up a batch once, but whether you can own a strategy that stays correct, cheap, and timely as the conditions around it keep moving.
How this knowledge point shows up on the exam
Strategy questions are scenario-rich and ask you to choose or critique a complete design rather than a single setting. The strongest answer is almost never an absolute; it routes blocking and latency-tolerant work to different APIs, tests before committing, correlates with custom_ids, and recovers selectively. Weak answers betray a missing stage: they batch interactive work, skip sample testing, ignore how results will be matched back, or respond to a partial failure with a full rerun. When a scenario stresses both cost and a deadline, the examiner is checking whether you can hold those in tension and still produce a defensible pipeline rather than collapsing to one priority.
Misconception
A good batch strategy is simply moving the largest workloads to the Batch API to maximise the 50% discount.
What's actually true
Misconception
Because batch and synchronous are different APIs, a single pipeline should commit to one of them for consistency.
What's actually true
An architect proposes this nightly pipeline for tagging 30,000 documents: run the whole job as one batch to get the discount, read the results file top to bottom to apply tags, and if anything fails, rerun the entire batch the next night. Cost is the only stated priority. What is the most important flaw to correct?
People also ask
How do I design a batch processing strategy?
Should I always test prompts before running a full batch?
Can one system mix synchronous and batch processing?
What makes a batch strategy reliable rather than just cheap?
Watch and learn
Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.
Claude Certified Architect: Ep 17 | Batch API & Multi-Pass Review | Full Course Series
Why watch: Certification-focused episode that frames the end-to-end batch strategy, including when to choose batch versus synchronous and how to structure failure handling.
More videos for this concept
References & primary sources
Master this concept with Archie
Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.