- In short
- Per category accuracy validation measures extraction accuracy separately for every document type and every field segment, then automates only the categories that clear the threshold on their own. It turns one go or no-go decision into a grid of independent decisions, so strong categories ship while weak ones stay under review.
What per category accuracy validation requires
Per category accuracy validation is the disciplined answer to the question the aggregate metrics trap leaves open: if one blended number cannot be trusted, what do you measure instead? You measure accuracy for every category on its own. A category here is a meaningful slice of the work: a document type (invoice, contract, claim form) and, within it, a field segment (dates, monetary amounts, names). The validation produces not a single score but a grid, and each cell of that grid is an independent verdict on whether that particular type-and-field combination is reliable enough to run without a human. This is an apply-level knowledge point in Task 5.5, because the skill is building and acting on the grid, not just agreeing that averages can mislead.
The reframing that makes this powerful is treating automation as a per-cell decision rather than a single switch. The naive model is binary: the pipeline is either good enough to automate or it is not. Per category accuracy validation replaces that with a portfolio: each category is evaluated against the threshold by itself, the ones that pass are automated, and the ones that fail stay under review or go back for improvement. You stop asking "is the system ready?" and start asking "which parts of the system are ready?", which is almost always a more useful and more honest question.
- Per-category accuracy validation
- Measuring extraction accuracy separately for each document type and field segment, producing a grid of independent results, and automating only the cells that meet the threshold on their own while the rest remain under human review.
Two axes: document type and field segment
The grid has two axes, and the exam expects you to use both. The first axis is document type, because different layouts and vocabularies fail in different ways: a clean digital invoice is easy, a scanned contract is hard, and a handwritten form is harder still. Validating per type stops a roomful of easy invoices from vouching for a stack of difficult contracts. The second axis is field segment, because even within one document type the fields are not equally extractable: a boxed total is reliable while a date buried in prose is not, and a single per-type score would average those together and hide the weak field. Crossing the two axes gives you the true unit of reliability, the cell, and it is the cell that gates automation.
Reasoning at the cell level is what separates this KP from a vague instinct that "you should test more." It tells you exactly what to test (every type crossed with every field), exactly what to compute (accuracy in isolation per cell), and exactly what to do with the result (automate the passing cells, hold the rest). Anthropic's evaluation guidance points the same way: it stresses designing evals that mirror the real task distribution and deliberately cover edge cases, which in extraction means building labelled test data for each type and field rather than one generic mixed set that quietly over-weights the easy majority.
The decision rule: meet the threshold in isolation
The rule that operationalises the grid is short and strict: a category may be automated only if it meets the accuracy threshold on its own, with no help from the corpus average. This is the exact inverse of the aggregate metrics trap. There, a strong majority lifted a weak minority over the line; here, every category must stand on its own feet. A contract type at 78% does not get automated because invoices are at 99%; it stays under review until it improves, regardless of how good the rest of the pipeline looks.
That isolation rule is also what makes partial automation safe and defensible. Because the decisions are independent, automating invoices tells you nothing about the safety of automating contracts, and you do not have to wait for the weakest category to catch up before capturing value from the strongest. You ship the cells that pass, you protect the cells that fail, and you can point at the grid to justify exactly why each cell is where it is. The same per-slice rigor connects this KP to field-level confidence calibration, which decides routing within a category, and to stratified random sampling, which keeps re-measuring the cells after launch so a passing category cannot silently slip below its threshold.
Worked example
A shared-services team wants to automate extraction across four document types: purchase orders, invoices, vendor contracts, and expense receipts. Leadership wants a single yes-or-no on automation by Friday.
The request is framed as a binary, and an architect who has internalised per category accuracy validation reframes it before answering. They build a labelled test set for each of the four types, then score each type and break each score down by field. The grid tells a far more useful story than a yes or no. Purchase orders and invoices clear the threshold on every field. Expense receipts pass on vendor and amount but fail on date, because receipts photograph poorly and dates blur. Vendor contracts pass on party names but fail on renewal terms and effective dates, which hide in dense clauses.
So the honest answer to "should we automate?" is "yes for two types fully, partially for a third, and not yet for the fourth." Purchase orders and invoices are fully automated. Expense receipts are automated except for the date field, which is routed to a human. Vendor contracts stay under review while the team improves the two failing fields, perhaps with a retry-with-error-feedback pattern or better prompting, and re-validates. Crucially, the team captures the large value of automating the two strong types immediately instead of holding the whole project hostage to the weakest one.
Contrast the binary the leadership asked for. A single yes would have automated the failing contract fields and produced exactly the expensive errors per category validation exists to prevent; a single no would have thrown away the easy, safe wins on purchase orders and invoices. The grid converts a false dilemma into a precise, staged rollout, which is the apply-level outcome the exam rewards.
Common misreadings to avoid
The exam tests this KP by tempting you to make one decision for the whole pipeline, in either direction.
Misconception
If the pipeline's overall accuracy clears the threshold, it is fine to automate every document type at once.
What's actually true
Misconception
If any important category is not yet accurate enough, you should hold the entire pipeline back from automation until everything passes.
What's actually true
Building a labelled test set for each category
Per category validation is only as trustworthy as the labelled data behind it, and that data has to be built per category rather than scraped from one generic pile. For each document type you assemble a set of examples with verified ground-truth values, large enough that the accuracy you measure for each field has a usable confidence interval rather than bouncing on a handful of cases. A rare document type needs proportionally more attention here, because a category you see infrequently is exactly the one a thin test set will misjudge, and it is often the high-stakes one. The set should also reflect the real variety within the category: different vendors, layouts, scan qualities, and the awkward edge cases the production stream actually contains, not a tidy selection of the easiest examples.
These sets are living assets, not artefacts you build once. As new templates appear and the input distribution shifts, the test set has to be refreshed, ideally with the very items that ongoing sampling and human review surface as fresh ground truth. A category that passed against a stale set can quietly fall below threshold against current reality, which is why validation and monitoring are two halves of one discipline rather than separate steps.
When a category sits right on the threshold
The cleanest decisions involve categories well above or well below the bar. The judgement that earns the evaluate-adjacent difficulty is what to do with a category sitting right on it, where the measured accuracy is close to the threshold and the confidence interval straddles the line. The disciplined default is to treat a borderline category as not yet passing and keep it under review, because the cost of wrongly automating a failing category usually dwarfs the saving from automating a marginal one. You narrow the uncertainty by enlarging that category's test set until the interval clears the line one way or the other, rather than acting on a noisy point estimate. Treating the threshold as a hard line tested with real statistical confidence, not an approximate target you eyeball, is what keeps partial automation genuinely safe instead of merely plausible.
When the category is a classifier, not a document type
Per category accuracy validation is framed here around document types and field segments, but the same discipline governs the other place extraction pipelines make categorical decisions: a classification or routing step that sorts each document into a type before extraction even begins. If that classifier is wrong, every downstream per-category number is measured against the wrong category, so the router itself has to clear per-category accuracy before you automate around it. Anthropic's classification guidance gives the concrete mechanism: assemble a labelled test set, run the classifier over it, and compare the predicted category to the expected category for each item, reporting accuracy per class rather than only overall.
Two details from that guidance matter for the exam. First, make the classification deterministic: Anthropic's classifier example sets temperature to 0.0 so the same input yields the same label, which is what makes a measured per-class accuracy reproducible rather than a moving target. Second, make the label space explicit, defining each category unambiguously in the prompt (Anthropic uses structured, XML-tagged category definitions) so the model and your test set agree on what each class means. A fuzzy category boundary produces disagreement that looks like model error but is really a specification gap.
The broader eval guidance reinforces the per-category instinct. Anthropic recommends balanced problem sets that include cases where a behaviour should and should not occur, and it stresses reading the actual transcripts of failures before trusting any automated grade. Both map straight onto category-level validation: a balanced set stops a dominant class from hiding a failing rare one, and transcript review is how you learn whether a category is failing because the model is weak or because the category was ill-defined.
How this is tested on the exam
In Scenario 6 (Structured Data Extraction), Task 5.5 questions present a multi-type pipeline and a pressure to give a single automation verdict. The strongest answer breaks accuracy down by document type and field, then automates only the categories that individually pass while keeping the rest under review. Distractors push the two failure modes this KP names: automate everything because the overall number is high (the optimistic version of the aggregate metrics trap), or automate nothing until every category is perfect (the pessimistic version that throws away safe value). A weaker distractor will validate by type but forget the field axis, missing that a type can pass on average while one of its fields fails.
Per category accuracy validation is a hard prerequisite of the capstone human review workflow, where the grid becomes the foundation: it decides which categories are automated at all, before calibration decides routing within them and sampling keeps watching them. Bring the grid to the exam, two axes, isolation rule, partial automation, and these questions become a matter of reading which cell the scenario is really asking about.
A team must decide whether to automate extraction across purchase orders, invoices, and vendor contracts. Testing shows purchase orders and invoices clear the accuracy threshold on every field, but contracts fail on renewal-term and effective-date fields. What is the best course of action?
People also ask
What is per-category accuracy validation?
How do you validate extraction accuracy by document type?
Should you automate every document type at once?
What accuracy threshold is needed to automate?
Watch and learn
Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.
LLM evaluation methods and metrics
Why watch: Explains ground-truth based evaluation and accuracy metrics, the foundation for validating extraction accuracy per document type and field before automating.
More videos for this concept
References & primary sources
Master this concept with Archie
Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.