- In short
- The Message Batches API is an asynchronous Claude endpoint that processes large volumes of Messages requests at 50% of standard token prices. In exchange for that discount it gives up real-time delivery: results arrive when the whole batch finishes or after a 24-hour ceiling, with no latency guarantee and no way to run an interactive tool-calling loop inside a single request.
What the Message Batches API actually is
The Message Batches API is Anthropic's asynchronous path for running large volumes of Claude requests at a discount. Instead of calling the model once and blocking for an answer, you hand it a list of independent requests, the system processes each one on its own schedule, and you come back later to download the results. It is purpose-built for bulk work that nobody is sitting and waiting on: overnight evaluations, content moderation sweeps, dataset enrichment, and large-scale summarisation.
For the Claude Certified Architect exam you are not asked to write batch code from memory. You are asked to recall the constraints that make batch the right or wrong tool for a scenario. Every one of those constraints flows from a single design choice: the endpoint is asynchronous, so it can schedule your work whenever capacity allows rather than answering on demand. Understand that and the cost, the window, and the limitations all follow logically.
- Message Batches API
- An asynchronous Anthropic endpoint that accepts a list of independent Messages requests, processes each one separately at 50% of standard token prices, and returns the combined results once the batch ends or a 24-hour ceiling is reached.
The cost trade is the whole point
Batch usage is charged at exactly half of the standard Messages API price, for both input and output tokens. That 50% discount is the reason the feature exists, and it is the single fact the exam most reliably tests. If a scenario describes a finance team running a nightly report over thousands of records and asks how to cut spend, the discount is the lever. There is no quality penalty for choosing batch; the model and the prompt behave identically. You are only trading away when you receive the answer.
The saving compounds with prompt caching, because the two discounts stack. A batch that reuses a large shared system prompt across every request can pay half price on tokens and still benefit from cache hits on the repeated prefix. For the exam, the headline to hold is simpler: batch is the half-price option for work that can wait.
The processing window and why there is no SLA
When you submit a batch it starts in a processing status of in_progress. The system works through the requests as fast as capacity allows, and most batches finish in well under an hour. Crucially, though, there is no latency guarantee. You may retrieve results when every request has finished or after 24 hours, whichever comes first, and any request still unprocessed at the 24-hour mark simply expires. Expired requests are never billed, but they are also never answered, so a batch is not a promise of completion by a deadline you control.
This is exactly the trap the exam sets for this knowledge point. A candidate who assumes batch results will be ready in minutes, or who wires a batch into a workflow that needs an answer within a fixed SLA, has misread the contract. The honest mental model is "sometime in the next 24 hours, probably soon, but I cannot bank on it." Results, once produced, remain downloadable for 29 days before they age out.
The hard limits you must recall
A single batch is capped at 100,000 Message requests or 256 MB of total payload, whichever it hits first. If you need to process more than that, you split the work across multiple batches. Two smaller details round out the constraint set and occasionally surface in questions: every batched request must set max_tokens to at least 1, and a batch is scoped to the Workspace whose API key created it, so visibility and spend tracking stay inside that boundary.
- Volume ceiling: up to 100,000 requests per batch.
- Size ceiling: up to 256 MB of request payload per batch.
- Time ceiling: results within 24 hours or the remainder expires.
- Token floor:
max_tokensof at least 1 on every request.
Rate limits apply on both sides of a batch
The size caps are not the only ceilings to respect, and the exam occasionally probes whether you know that rate limits apply on two separate layers. First, the ordinary HTTP rate limits still govern your calls to the Batches API itself: creating, listing, and polling batches all count against your request limits like any other API traffic, so a tight polling loop can throttle you just as it would on the synchronous API. Second, a distinct limit caps how many requests across your in-flight batches can be waiting to be processed at once, which bounds how much work you can have queued regardless of how many batches you have created.
The practical reading is that submitting one enormous batch does not let you sidestep throughput limits, because the queued requests still count against that in-flight ceiling. One more floor sits alongside these: a batched request cannot set max_tokens to 0, so the cache pre-warming trick some teams use synchronously is explicitly unsupported inside a batch. Plan your submission cadence and your polling interval with both rate-limit layers in mind, so a large job does not throttle the very calls you rely on to track it.
The constraint that catches architects: no interactive tool loop
The subtlest limitation of the Message Batches API is what it means for agentic work. Almost any single Messages request you can make synchronously can also go in a batch, including system prompts, vision, extended thinking, and even tool definitions. What you cannot do is run a multi-turn tool-calling loop inside one batched request. An interactive loop requires the model to emit a tool call, your code to execute that tool, and the conversation to continue with the result fed back in. Because a batch request is processed as a single asynchronous shot with no channel back to your code mid-flight, that round trip has nowhere to happen.
The practical reading for the exam: batch is for single-shot completions, not for autonomous agents that drive their own tool usage across turns. If a scenario describes an agent that must call a tool, inspect the result, and decide what to do next, that work belongs on the synchronous API, no matter how cost-sensitive the team is. You can read the full set of supported and unsupported parameters in Anthropic's batch processing guide.
Worked example
A data team must classify 40,000 archived support tickets by topic. The output feeds a quarterly trends report due next week, and the budget is tight.
Walk the constraints against the requirement. First, latency: the report is due next week, and no human is blocked waiting on any individual classification, so the 24-hour window is comfortably inside the deadline. That alone makes the work latency-tolerant, which is the precondition for batch.
Second, cost: 40,000 classifications is a meaningful spend at full price, and the half-price batch rate cuts it cleanly in two with no change to the prompt or the model. That is the lever the scenario is hinting at.
Third, volume: 40,000 requests sits well under the 100,000 ceiling, so the whole job fits in a single batch as long as the combined payload stays under 256 MB. If the tickets were larger and pushed past the size cap, the only adjustment would be to split them across two batches.
Fourth, shape: each ticket is a single-shot classification. There is no tool call, no follow-up turn, no agent deciding what to do next. So the no-interactive-loop limitation never bites. Every constraint lines up in favour of batch, which is why this is a textbook batch workload and a poor fit for synchronous calls.
What a batch returns
Once a batch reaches the ended status, each request carries one of four result types, and recalling them helps you reason about the others in this task statement. A succeeded result includes the model message. An errored result means the request failed, for example an invalid request or a server error, and you are not billed for it. A canceled result appears when you stopped the batch before that request ran. An expired result means the 24-hour ceiling arrived first. Because results can come back in any order, each one is tagged with the custom_id you assigned, which is the thread that ties results back to inputs.
You do not need to memorise the JSON shape for the foundations exam, but you should recognise that a batch never silently swallows failures. It reports a clear per-request status, which is what makes the failure-handling and correlation patterns in the rest of this task statement possible.
A drop box, not a phone call
The most durable way to remember every constraint at once is to picture the right mental model. A synchronous Claude call is a phone call: you ask, you wait on the line, and you get your answer before you hang up. The asynchronous batch path is a drop box: you post a stack of envelopes, walk away, and come back later to collect whatever has been processed. Almost every property you need to recall falls out of that single image. You do not stand at the drop box waiting, which is why there is no latency guarantee. The box is emptied on its own schedule, usually quickly but with a firm 24-hour collection deadline after which unposted items are returned unprocessed. And because you posted a whole stack at once, the replies come back jumbled, which is why each envelope needs a label to find its way home.
The drop box also explains the discount. Handling post in bulk on a flexible schedule is cheaper than answering every call the instant it rings, so the provider passes that efficiency back as the half-price saving. It explains the volume cap too: a single drop can hold a great many envelopes, up to 100,000 of them or 256 MB, but not an unlimited number. And it explains the no-interactive-loop rule, because a drop box cannot hold a back-and-forth conversation; it takes your stack and returns answers, with no chance for you to read one reply and post a follow-up before the rest are handled. Hold the image and the facts stop being a list to memorise and become a story that hangs together, which is exactly the kind of durable recall the exam is checking for on this foundational knowledge point.
How this knowledge point shows up on the exam
Domain 4 (Prompt Engineering and Structured Output) is weighted at 20%, and batch processing is its most operational corner, surfacing in the structured-extraction and continuous-integration scenarios. Questions on this knowledge point are recall-and-recognition: they describe a workload and expect you to know whether the constraints fit. The reliable answer pattern is to test the scenario against three facts in order. Is anyone waiting on the result? If yes, batch is wrong. Is the work a single-shot completion rather than an interactive loop? If no, batch is wrong. Does cost matter at volume? If yes, batch is the half-price win. Master those three checks and you can answer almost any constraint question this knowledge point can pose, and you are ready to apply the synchronous versus batch decision rule that builds directly on top of it.
Misconception
The Message Batches API returns results within a few minutes, like a slightly slower synchronous call.
What's actually true
Misconception
Because tool use can appear in a batch, I can run an autonomous agent inside one batched request.
What's actually true
A team wants to halve the cost of an agent that books travel by calling a search tool, reading the results, then calling a booking tool within the same conversation. They ask whether moving this agent to the Message Batches API will cut the bill by 50%. What is the correct response?
People also ask
How much does the Claude Message Batches API cost?
How long does a Claude batch take to process?
How many requests can a single Claude batch hold?
Does the Batches API support multi-turn tool calling?
Watch and learn
Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.
How to Reduce Claude API Costs with Batch Processing
Why watch: Walks through the Message Batches API and its core value proposition of cutting Claude API costs by 50% versus single synchronous requests, grounding the constraint that batch trades latency for cost.
More videos for this concept
References & primary sources
Master this concept with Archie
Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.