Precision vs Recall in Claude Review Prompts

In short: Precision is the share of a reviewer's flagged issues that are real; recall is the share of all real issues it manages to catch. They trade off: broadening criteria to raise recall tends to raise false positives and lower precision, so designing a review prompt means choosing and justifying an operating point rather than maximising both.

What the precision vs recall trade-off means for a review prompt

The precision vs recall trade-off is the central judgement call in designing an automated reviewer, and it is why Task Statement 4.1 sits at the evaluate level of Bloom's taxonomy rather than apply. Precision asks: of everything the reviewer flagged, what fraction was a genuine problem? Recall asks: of every genuine problem present, what fraction did the reviewer catch? A perfect tool would score high on both, but in practice moving one moves the other, and your job as an architect is to choose where to sit and defend the choice.

The trade-off appears the moment you tune a review prompt. Broaden the criteria, add more finding categories, lower the bar for raising a comment, and recall climbs because the reviewer catches more real issues. The same change also drags in more false positives, so precision falls. Tighten the criteria and the opposite happens. Understanding precision vs recall is understanding that these two dials are connected, so you cannot turn one without considering its effect on the other.

Precision and recall: Precision is the proportion of a reviewer's flagged issues that are genuine; recall is the proportion of all genuine issues that the reviewer flags. Raising recall by widening criteria typically lowers precision by admitting more false positives, and vice versa.

Why low precision quietly destroys recall in practice

It is tempting to treat recall as the safe default, on the logic that catching more is always better. In a human-read review system that logic backfires. Every false positive spends a developer's attention: they read the comment, investigate, conclude it is wrong, and dismiss it. A handful is tolerable. At volume it produces alert fatigue, the well-documented failure where reviewers stop engaging with a noisy tool altogether.

That is the cruel twist of the trade-off. A reviewer tuned for maximum recall but poor precision looks excellent on paper and fails in the field, because once developers stop reading it, its true findings go unread too. Effective recall collapses to near zero no matter what the metric says. This is the same dynamic the false positive trust problem describes, and it is why most production review tools deliberately favour precision: a tool that is trusted and read beats a tool that is comprehensive and ignored.

The lever that beats the trade-off: explicit criteria with examples

The trade-off is real, but it is not a straight line you can only slide along. Some interventions shift the whole curve outward, buying precision without paying it back in recall. The most reliable is the one this task statement keeps returning to: explicit criteria backed by concrete examples. Vague criteria force the model to guess at the boundary, and guesses are wrong in both directions, costing precision and recall at once. Sharpen the criteria with examples and the model stops guessing, so it catches the real issues (recall holds) while flagging fewer phantom ones (precision rises).

This is why designing explicit criteria for review systems is a hard prerequisite here. Anthropic's guidance on defining success criteria makes the measurement side concrete: state your target as something like an F1 score on a held-out test set, then iterate against it. F1 is the harmonic mean of precision and recall precisely because it refuses to let you win on one by tanking the other, which is exactly the discipline this knowledge point demands.

precision

share of flags that are real

recall

share of real issues caught

the metric that balances both

How the trade-off plays out across a review pipeline

Seeing the causal chain makes the trade-off easier to reason about under exam pressure. A change to the criteria flows through the reviewer to the developers and, crucially, loops back to affect how much of the output is ever read.

How a criteria change propagates through precision and recall

Loading diagram...

Maximising recall through broader criteria can backfire via alert fatigue; explicit examples shift the whole curve instead of sliding along it.

A third option: temporarily disabling a problem category

Choosing an operating point is not always a global decision. Often the precision problem lives in one or two finding categories that generate most of the false positives, while the rest of the reviewer is healthy. In that situation a valid and underused move is to disable the noisy category temporarily, restoring overall precision and trust while you work on it, rather than degrading the entire prompt to compensate for one bad actor.

Disabling a category is a deliberate, reversible trade: you knowingly drop some recall in that narrow area to protect precision everywhere else, then re-enable it once it has been fixed. Evaluating when that trade is worth making, and being able to justify it, is the evaluate-level skill this knowledge point assesses. It is also the on-ramp to category-specific prompt iteration, where the disabled category gets repaired in isolation before it rejoins the live review.

A worked example: justifying an operating point

Worked example

A security team runs a Claude reviewer across a monorepo. Leadership wants every vulnerability caught, so the team maximises recall, and within a month developers are ignoring the bot.

The team's first instinct was recall at all costs: broad categories, a low bar for raising a comment, and instructions to err on the side of flagging. Measured against a labelled test set, recall was an impressive 0.94. Precision, though, was 0.31, meaning roughly two of every three comments were noise. Developers learned within weeks that most red flags were nothing and began merging without reading the reviewer.

Asked to evaluate the situation, the team reframed it. The real-world goal was not the recall number; it was vulnerabilities actually fixed, which requires developers to read the tool. They rewrote the criteria explicitly, added boundary examples for the categories driving the false positives, and temporarily disabled one category that was almost pure noise. Precision rose to 0.82 while recall on the remaining categories held at 0.88, and engagement recovered because people trusted the output again.

Their justification is the model answer for this knowledge point: an operating point is chosen against the true objective, not against a single metric. For a human-read reviewer that objective is fixes shipped, which depends on trust, which depends on precision. They could articulate why a slightly lower headline recall produced far more real-world value, and that act of justification is exactly what the evaluate level is testing.

Common misreadings to avoid

Because this is an evaluate-level skill, the exam rewards judgement over recall of definitions. The two traps below are the ones that catch people who memorised the terms but not the trade.

Misconception

A good reviewer should always maximise recall so that no issue is ever missed.

What's actually true

Maximising recall usually tanks precision, and the resulting false positives cause alert fatigue until developers ignore the tool, which collapses effective recall anyway. The right operating point depends on the cost of a miss versus the cost of a false alarm, not on recall alone.

Misconception

Precision and recall are independent dials you can turn up at the same time.

What's actually true

With a fixed prompt they trade off: widening criteria to catch more raises false positives. Only a structural change, such as adding explicit criteria with examples, can lift one without lowering the other, which is why that lever matters so much.

Measuring the two numbers before you tune

You cannot manage a trade-off you have not measured, and this is where many review systems go wrong: they are tuned by vibe, on the strength of whichever recent comment annoyed someone most. The fix is a small labelled sample, a held-out set of diffs where you already know which findings are real. Run the reviewer over it and you can compute both metrics as plain fractions, and you can watch them move as you change the prompt. Anthropic's guidance on developing test cases stresses building this kind of representative set up front, because without it every tuning decision is a guess.

The sample does not have to be elaborate to be decisive. A few dozen diffs that span your common defect types give enough signal to see whether a change traded coverage for accuracy or genuinely improved both. The discipline is to change one thing, re-run the sample, and read the numbers together, never celebrating a gain on one side without checking what happened to the other.

When recall should win: the high-stakes exception

Favouring accuracy is the default for a human-read reviewer, but it is not a universal law, and the exam likes to probe whether you understand the exception. Some review gates guard against failures so costly that a single miss outweighs a great deal of noise. A check that must never let a known class of security vulnerability reach production is one of these: a missed finding is a potential breach, while a false alarm is merely an annoyance, so the rational operating point leans hard toward catching everything.

The judgement, then, is not one metric good and the other bad; it is matching the operating point to the asymmetry of the costs. Weigh the cost of a miss against the cost of a false alarm for this specific gate, and let that asymmetry decide. An architect who can name when to flip the usual preference, and justify it from the costs rather than a slogan, is demonstrating exactly the evaluate-level reasoning this knowledge point exists to test.

Why F1 keeps you honest

When you need a single number to track, the F1 score, the harmonic mean of the two metrics, is the one to reach for. Its value is that it refuses to be fooled by a lopsided result: a reviewer with brilliant coverage and dismal accuracy scores poorly on F1, just as the field would judge it. Reporting all three figures together for each tuning pass keeps the conversation grounded in evidence and stops anyone from cherry-picking the single number that flatters their change.

When a more capable model genuinely shifts the curve

Sharpening the prompt is the lever this task statement emphasises, but it is not the only thing that moves the whole curve rather than sliding along it. Model capability does too. Anthropic reports that its newer frontier models score higher on both precision and recall for review-style tasks, which means upgrading the model can lift accuracy and coverage together instead of trading one for the other. Model choice therefore belongs in an architect's toolkit alongside prompt design, not beneath it.

The catch, and the reason exam stems so often list switching to a bigger model as a distractor, is that capability is rarely the binding constraint when a reviewer is drowning in noise. A vague, example-free prompt produces false positives on any model, because the boundary it is guessing at is undefined for all of them. Anthropic's prompting guidance is blunt that the first move is to make the bar concrete, for example reporting bugs that could cause incorrect behaviour, a test failure, or a misleading result while omitting pure style and naming nits. Reach for a stronger model when a well-specified prompt has hit a genuine capability ceiling, not as a substitute for writing the criteria down in the first place.

How it shows up on the exam

Inside the continuous-integration review scenario, this knowledge point is where the exam stops asking what to build and starts asking what you would choose. A question typically hands you a reviewer with a lopsided profile, sky-high recall and dismal precision, or the reverse, and asks for the change that best serves the stated goal. The credit-earning answers reason about the trade-off: favour precision for a human-read tool because trust drives engagement, lift both with explicit criteria and examples, or disable a noisy category temporarily rather than wreck the whole prompt. Distractors usually propose maximising one metric in isolation or tweaking a sampling parameter, which signals a candidate who knows the words but not the judgement. If you can defend an operating point against the real objective, this family of questions is yours.

Check your understanding

A Claude code reviewer scores 0.95 recall and 0.30 precision on a labelled test set, and developers have begun ignoring its comments. Leadership still wants strong coverage. Which response best evaluates the trade-off?

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

No videos curated for this concept yet

We are still curating the best official and community videos for this topic.

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying

Precision vs Recall Trade-off in Claude Review Prompts