Stratified Random Sampling for QA

In short: Stratified random sampling is an ongoing quality check that draws random samples from within each stratum of your output: every document type and every confidence band, including the high-confidence one. Because it deliberately reviews extractions the model felt sure about, it catches novel error patterns that a low-confidence-only review would never see.

What stratified random sampling actually does

Stratified random sampling is a verification technique that refuses to treat your output as one undifferentiated pile. Instead it partitions the output into strata, meaningful groups such as each document type and each confidence band, and then draws a random sample from within every stratum. The result is a small, representative cross-section that guarantees coverage of slices a naive random sample would under-represent or miss. On the Claude Certified Architect exam this is an apply-level knowledge point in Task 5.5, and it builds directly on the aggregate metrics trap: once you accept that reliability lives in the slices, you need a method that keeps watching the slices in production.

The word "random" inside each stratum matters as much as the word "stratified". Stratification guarantees you look at every group; randomness within the group guarantees the items you look at are not cherry-picked or clustered, so the per-stratum accuracy you estimate is unbiased. Together they give you a defensible, ongoing read on quality at a fraction of the cost of reviewing everything. You are not trying to inspect every output; you are trying to keep an honest, continuous estimate of how each part of the system is doing.

Stratified random sampling: A quality-monitoring method that divides extraction output into strata (document types and confidence bands) and randomly samples within each, ensuring every slice is observed and giving an unbiased per-stratum accuracy estimate over time.

The counterintuitive rule: sample what the model is sure about

The instinct most teams bring is to spend their limited review budget on the extractions the model flagged as low-confidence. That instinct is half right and half dangerous. Routing low-confidence items to a human is the job of field-level confidence calibration, and it is correct as far as it goes. But if low-confidence is the only thing you ever review, you have built a blind spot exactly where it hurts most: the high-confidence band. A model that is confidently wrong produces an extraction that looks trustworthy, carries a high score, sails past any confidence gate, and gets automated without a second look. Those are the errors that reach production and cause incidents.

Stratified random sampling is the antidote because it deliberately spends a slice of the review budget on the high-confidence stratum. You are not reviewing high-confidence items because you expect them to be wrong; you are reviewing them because the only way to discover a systematic, novel failure mode (a new vendor template, a layout change, a field the model has started to hallucinate) is to look where you assumed everything was fine. Anthropic's guidance on reducing hallucinations makes the same point in the language of high-stakes validation: you verify critical outputs rather than assuming a generally strong system is strong everywhere. The high-confidence sample is how that verification stays alive after launch.

every band

strata sampled, including high-confidence

random

selection within each stratum, to stay unbiased

continuous

runs in production, not just pre-launch

Why it is ongoing, not a launch gate

It is tempting to treat sampling as a one-time pre-launch evaluation, run it, see good numbers, and switch it off. That conflates two different jobs. Per-category accuracy validation is the launch gate: it decides, once, whether a category is good enough to automate. Stratified random sampling is the smoke detector that stays on afterwards. Real-world input distributions drift: new document templates appear, an upstream system changes a format, a vendor redesigns an invoice, and a category that passed its gate three months ago starts failing without anyone touching the model. A continuous stratified sample is what turns that silent drift into a visible signal while the damage is still small.

This is why the technique is framed as an apply-level skill rather than a fact to recall. The exam wants to see that you can design the sampling regime, choosing the strata, setting a rate, and keeping it running, rather than just defining the term. The same monitoring instinct connects this KP to the rest of Domain 5: just as a persistent case-facts block keeps critical information observable across a long conversation, a persistent sampling stream keeps quality observable across a long deployment.

Setting the sampling rate

A flat "review 2% of everything" is the wrong default, because not every stratum carries the same risk or diversity. The sampling rate should be proportional to two things: the cost of error in the stratum and how heterogeneous the stratum is. A high-stakes category (regulated forms, anything that triggers a payment) earns a higher rate because a missed error there is expensive. A diverse category, one with many templates or a long tail of formats, also earns a higher rate because there is more surface area for a novel failure to hide. A low-stakes, highly uniform category can be sampled lightly. Rare-but-critical strata may need oversampling relative to their volume precisely because random sampling proportional to volume would hand them too few items to say anything trustworthy.

The practical design therefore allocates a fixed human-review budget across strata by risk and diversity rather than by raw volume, and it sets each rate high enough that the per-stratum accuracy estimate has a usable confidence interval. You do not need exhaustive review; you need enough sample in each meaningful slice that, if that slice starts to degrade, your numbers move before your customers notice.

Stratified sampling across types and confidence bands

Loading diagram...

Every stratum, including the high-confidence band, contributes a random sample so novel and systematic errors surface even where the model felt certain.

Worked example

A claims processor has automated extraction for three document types and reviews only items the model scored below 0.7. Six weeks in, customers report wrong payout amounts on documents the system was confident about.

The team's review policy looked sensible: spend scarce human attention on the low-confidence items. But the complaints are landing on high-confidence extractions, which means the failures are invisible to the current process by construction. An architect redesigns the verification around stratified random sampling. The strata are the three document types crossed with three confidence bands, giving nine cells, and crucially the high-confidence cells are now sampled rather than trusted blindly.

Running it surfaces the root cause within days. One vendor changed its statement layout so that the payout field moved next to a similar-looking "prior balance" field. The model extracts the wrong number, but it extracts a clean, well-formatted number, so it reports high confidence. Because the old policy never sampled high-confidence items, the error had been compounding silently for six weeks. The new stratified sample caught it because it draws from the high-confidence band on purpose. The team also tunes the rates: the changed vendor's document type, now known to be diverse and high-stakes, gets a higher sampling rate than the stable, uniform types.

The lesson the scenario teaches is the one the exam rewards. Reviewing only low-confidence output optimises for the errors the model already suspects, while the expensive incidents come from the errors it does not suspect. Stratified random sampling is the only one of the Task 5.5 techniques that systematically looks where confidence is high, which is exactly why it belongs in the workflow alongside, not instead of, confidence-based routing.

Common misreadings to avoid

The exam baits this KP with the same shortcut every team is tempted by: skip the high-confidence band to save effort.

Misconception

To use review capacity efficiently, you should only sample the low-confidence extractions, since the high-confidence ones are almost certainly correct.

What's actually true

Reviewing only low-confidence items blinds you to confident errors, which are the ones that reach production and cause incidents. Stratified random sampling deliberately includes the high-confidence band so novel, systematic failures are caught even where the model felt sure.

Misconception

A single random sample of the whole corpus is just as good as stratifying, as long as it is large enough.

What's actually true

A plain random sample is dominated by the high-volume, easy strata and under-represents the rare, high-stakes ones, recreating the aggregate metrics trap. Stratifying guarantees every meaningful slice is observed, then randomising within each keeps the per-stratum estimate unbiased.

Keeping the sample honestly random

Stratification guarantees coverage of every slice, but the sample inside each stratum still has to be genuinely random, and several everyday shortcuts quietly break that. The most common is convenience sampling: a reviewer opens the queue and checks the first twenty items of the morning. Those items are not random; they are whatever arrived first, which often correlates with a particular customer, batch, or upstream system, so the estimate they produce is biased toward that source. A sample taken by hand from the top of a list is a sample of the list ordering, not of the population.

Two further pitfalls deserve naming. The first is clustering: if you draw whole batches rather than individual items, a single bad batch can swing the estimate wildly in either direction, because items within a batch are correlated. Selecting items independently across batches keeps the estimate stable. The second is time: input distributions move, so a sample taken only in a quiet first week tells you little about a busy month with new templates. An honest regime spreads its draws across time so seasonal and template drift are represented in the numbers. Reviewer fatigue is a quieter bias still, since attention flags late in a shift and marginal errors get waved through. Rotating reviewers and capping daily review load keeps the human judgements that anchor the whole sample trustworthy, which is the entire point of sampling in the first place. None of these guards are exotic; they are simply the difference between a number you can act on and one that merely looks like measurement.

Grounding the sample in structured outputs

Stratified sampling assumes you can cleanly slice output by field and by confidence band, and that assumption only holds if the extraction itself returns structured, parseable data. This is where Claude's structured output features do the heavy lifting. Defining the target fields as a schema and using JSON output (the output_config.format control) makes every extraction land in the same shape, so a sampler can pull the date field or the total field out of thousands of documents without bespoke parsing. When schema fidelity is critical, strict tool use (setting strict: true on the tool) goes further and validates the tool name and input against the schema, so a malformed or hallucinated field cannot silently enter the population you are trying to audit.

It helps to keep the two controls distinct, because the exam does. JSON output governs the shape of Claude's response, while strict tool use validates the parameters of a tool call. They solve different problems and can be combined. Anthropic's tool-use guidance also warns that every optional parameter increases the grammar's complexity, so the practical advice is to mark only the genuinely critical tools as strict and to keep the schema lean. A tidy, validated schema is what makes the strata well-defined in the first place, because you cannot reliably stratify by a field that arrives in an unpredictable place.

One thing structured output does not give you is the sampling itself. Anthropic's API has no built-in stratified-sampling feature; the strata, the per-band draw, and the drift alerting are all application-level machinery you build around the model. Claude's job is to emit clean, schema-valid records, and your job is to sample them honestly. Keeping that boundary clear is what stops architects from assuming a managed QA primitive exists where there is only a well-shaped output to sample.

How this is tested on the exam

In Scenario 6 (Structured Data Extraction), Task 5.5 questions describe a verification scheme and ask you to improve it or to spot why errors are escaping. The strongest answer almost always introduces stratification and, specifically, sampling the high-confidence band, because that is the gap most naive schemes share. Distractors offer a flat overall sampling rate, or propose reviewing only low-confidence items, or suggest a one-off pre-launch check with no ongoing monitoring. Each is wrong in a way the KP names: the flat rate recreates the aggregate metrics trap, the low-confidence-only scheme is blind to confident errors, and the one-off check cannot catch drift.

Carry the apply-level instinct into the exam: design the regime, do not just define the word. Choose strata that matter (type and confidence band), set rates by risk and diversity, sample the band you were tempted to trust, and keep it running. That design feeds straight into the capstone human review workflow, where sampling is one of the instruments you compose into a complete reliability process.

Check your understanding

An extraction pipeline routes every item scored below 0.7 to human review and automates the rest. Quality looked fine at launch but wrong values are now reaching customers on high-confidence documents. Which change best addresses the gap?

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

No videos curated for this concept yet

We are still curating the best official and community videos for this topic.

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying

Stratified Random Sampling for Extraction QA