AI Skill Certs
Prompt Engineering & Structured Output·Task 4.1·Bloom: apply·Difficulty 3/5·9 min read·Updated 2026-06-07

Severity Calibration with Code Examples for Claude Reviews

Design prompts with explicit criteria to improve precision and reduce false positives

SUBy Solomon UdohReviewed by Solomon UdohAI-assisted · human-reviewed
In short
Severity calibration is the practice of defining each review severity level (critical, major, minor) with a concrete code example of what that level looks like, rather than a prose adjective such as "high impact". The examples become a shared anchor, so the model classifies the same finding the same way on every run instead of guessing where a vague label applies.

What severity calibration means in a review prompt

Severity calibration is the step where you tell an automated reviewer not just what to look for but how bad each thing is, using examples it cannot misread. When you ask Claude to review a pull request, it will find issues. The hard part is ranking them, because a flood of findings that are all marked the same way is almost as useless as no review at all. Calibration is how you make the ranking reliable.

The mistake almost everyone makes first is to describe severity in prose: "critical means high business impact, minor means low impact." Those words feel precise to the author and read as fuzzy to the model. "High impact" is a judgement that depends on context the prompt never supplied, so the model fills the gap with its own guess, and that guess changes from run to run. Severity calibration removes the guess by pinning each level to a concrete code example, which is the unambiguous anchor this knowledge point is built on.

Severity calibration
Defining each review severity level with a concrete code example of what that level looks like, instead of a prose adjective, so the model assigns the same finding to the same level on every run.

Why prose descriptions of severity fail

Prose severity definitions fail for the same reason vague instructions fail everywhere in prompt engineering: they assume the reader shares your mental model. A senior engineer reading "critical" pictures an exploitable security hole. A junior engineer pictures a failing test. The model, having read millions of codebases with millions of conventions, has the widest range of all and will pick whichever interpretation the surrounding tokens nudge it toward. The result is a reviewer that is internally inconsistent.

That inconsistency is corrosive because severity is what humans triage on. If the same category of bug is labelled critical in one file and minor in another, developers lose the ability to trust the top of the list. They start reading every comment to be safe, which defeats the purpose of ranking, or they start ignoring the labels entirely. Anthropic's own prompting guidance on being clear and direct makes the underlying point: spell out exactly what you mean and show it, because the model does what you say, not what you hoped you implied.

Code examples as the calibration anchor

The fix is to stop describing severity and start demonstrating it. For each level in your scheme, include a short, realistic snippet that earns that label, and say why. A critical example might be a query built by string concatenation that is open to SQL injection. A major example might be an unhandled error path that can crash a request handler. A minor example might be a variable named tmp where the house style wants a descriptive name. The model now has three fixed points to interpolate between, and new findings get measured against those points rather than against an adjective.

This is the same mechanism that makes few-shot prompting so powerful, applied to grading rather than formatting. Anthropic documents that a handful of concrete examples is one of the most effective ways to lift accuracy and consistency, because examples communicate a boundary that description cannot. Severity calibration is that principle aimed squarely at the classification step: the examples are not there to teach syntax, they are there to teach the threshold between one level and the next.

1 example
per severity level, minimum
0 adjectives
doing the calibration work alone
same input
should yield the same severity

How a calibrated severity rubric flows through a review

It helps to see where calibration sits in the pipeline. The rubric is not a separate model call; it is part of the same review prompt that receives the diff. The model reads the changed code, matches each finding against the calibrated examples, and emits a severity alongside the comment. Downstream automation then routes on that severity, blocking a merge on a critical finding and merely commenting on a minor one.

A calibrated severity rubric inside one review call
Loading diagram...
The rubric travels with the diff in a single call; the calibrated examples decide which branch each finding takes.

Building the rubric: one anchored example per level

A workable rubric is short. You do not need a taxonomy of forty rules; you need one vivid example per level and a one-line reason. Keep the examples in the same language and style as the code under review so the model is comparing like with like. When a level keeps attracting the wrong findings, the cure is usually a sharper example at the neighbouring level, not more prose around the failing one.

Two design habits keep a calibrated rubric healthy over time. First, make the examples mutually exclusive: a reader should never look at the critical and major snippets and wonder why one is not the other. Second, prefer examples drawn from your own history of real incidents, because they encode the impact your team actually cares about. A severity scheme calibrated on your past production bugs will rank future findings the way your engineers would, which is the entire goal.

A worked example: taming an over-eager reviewer

Worked example

A platform team wires Claude into CI to review every pull request, but the bot marks almost everything critical, so engineers stop reading its output.

The first prompt said only: "Review this diff and rate each issue as critical, major, or minor based on impact." Within a week the team had a backlog of pull requests where formatting nits and genuine security flaws both carried a red critical badge. Trust collapsed, and people began merging without reading the review at all.

The team rewrote the severity section to calibrate it. They added one snippet per level: a critical example showing user input concatenated into a shell command; a major example showing a missing null check before a dereference; a minor example showing a function longer than the style guide prefers. Each snippet got a single sentence explaining why it sat at that level. Nothing else about the prompt changed.

On the next run the distribution shifted sharply. The shell-injection class of finding stayed critical, the missing-check class moved to major, and the style observations dropped to minor where they belonged. Because the examples were fixed, re-running the same pull request produced the same severities, and engineers could once again trust that a critical badge meant stop. The lesson the team took away is the heart of this knowledge point: severity calibration is an act of demonstration, and the demonstration has to be code, not commentary.

Common misreadings to avoid

Severity calibration is an apply-level skill, and the exam tends to probe whether you would actually build the rubric correctly rather than whether you can define the term. The two traps below are the ones that catch people.

Misconception

A clear prose definition such as critical means high business impact is enough to calibrate severity.

What's actually true

Prose adjectives are interpreted differently on every run because high impact depends on context the prompt never gives. Calibration needs a concrete code example per level as the fixed anchor; the adjective alone leaves the threshold to the model to guess.

Misconception

To calibrate severity well you should add as many rules and edge cases as possible.

What's actually true

Over-stuffing the rubric makes levels overlap and confuses the boundary. One vivid, mutually exclusive example per level is more reliable than a long list of qualifiers, because the model interpolates between clear reference points.

Where this sits in the knowledge graph

Severity calibration builds directly on explicit categorical criteria: once you have decided your categories, calibration is how you rank findings within them. It is also the foundation for the next two skills in this task statement. You cannot write a clean report-versus-skip boundary in designing explicit criteria for review systems until your levels are calibrated, and you cannot run category-specific prompt iteration productively if you cannot tell whether a category got better or worse between runs.

It connects sideways, too. Calibrated examples are a special case of the few-shot pattern explored in few-shot as the highest-leverage technique, and the inconsistency that calibration cures is the same instinct behind the false positive trust problem. Seeing those connections is what turns a list of tactics into a coherent design philosophy for review prompts.

Severity calibration versus asking for a confidence score

A tempting shortcut is to skip calibration entirely and ask the model for a numeric confidence on each finding, then sort by that number. It rarely holds up. A confidence score is another undefined quantity: the model has no fixed scale to map onto, so its 0.8 on one run is not its 0.8 on the next, and two findings of identical real severity can draw very different numbers. You have replaced one vague signal, the adjective, with another, the float, and gained nothing in consistency.

Calibrated severity is sturdier because it is categorical and anchored. Instead of inventing a position on an invisible scale, the model matches each finding against fixed example points and chooses the nearest level. That is the broader Domain 4 lesson, that explicit categorical criteria beat vague confidence thresholds, applied to ranking. When an exam item offers you a confidence-score knob as one option and a set of calibrated examples as another, the calibrated examples are almost always the better design for exactly this reason.

Keeping a severity rubric healthy as the codebase grows

A calibrated rubric is not a one-time artefact; it drifts out of date as the system it guards changes. New classes of defect appear, an old example stops resembling anything in the current code, and a level slowly loses its anchor. The maintenance habit is to treat the rubric like code: review it whenever a finding lands in the wrong bucket, and when it does, add or sharpen the example at the neighbouring level rather than piling on more prose. A misclassification is a signal that a boundary needs a clearer demonstration, not a longer description.

It also helps to source the examples from real incidents your team has lived through. An example drawn from an outage carries the impact your engineers actually felt, so the level it anchors ranks future findings the way those engineers would. Over time the rubric becomes a compact record of what has hurt you before, which is both a calibration tool and a quiet form of institutional memory that survives staff turnover.

Calibrating severity in Claude's managed Code Review

The same calibration idea ships in Anthropic's managed Code Review, where the control surface is a pair of repository files rather than a hand-written rubric. By default the reviewer reads your project's CLAUDE.md files and treats any newly introduced violation as a nit-level finding. To change what gets flagged, at what severity, and how it is reported, you add a REVIEW.md file, which carries the highest priority in the pipeline because it is injected into every agent doing the review. REVIEW.md is where an architect performs severity calibration in production.

Anthropic's guidance uses a deliberately small, two-tier scale: Important and Nit. Reserve Important for findings that would break behaviour, leak data, or block a rollback, such as incorrect logic, an unscoped database query, PII written to logs or error messages, or a migration that is not backward compatible. Keep style, naming, and refactoring suggestions at Nit at most. The calibration runs both ways: you can also escalate, for instance by declaring that any CLAUDE.md violation should be treated as Important rather than the default nit. It is the same act of pinning levels to concrete classes of code, expressed in the product's own vocabulary.

One further control matters for false positives: the verification bar. Rather than let the reviewer infer a defect from a variable's name, you can require evidence before a class of finding is posted, such as a file-and-line citation in the source that demonstrates the behaviour. Pairing a calibrated severity scale with an evidence requirement is what keeps a high-severity label meaning what your team needs it to mean.

How it shows up on the exam

Domain 4 carries twenty percent of the exam, and Task Statement 4.1 is its precision-and-false-positives anchor, usually surfacing inside the continuous-integration code-review scenario. Questions on this knowledge point rarely ask you to recite a definition. Instead they hand you a misbehaving reviewer, often one that grades everything the same way, and ask what change would fix it. The credit-earning answer is almost always to replace prose severity descriptions with concrete code examples for each level, and the distractors are plausible-sounding moves like adding more prose, lowering a confidence threshold, or switching models, none of which addresses the real cause. If you can explain why a code example is a fixed anchor and an adjective is not, you can answer the whole family of these questions.

Check your understanding

A team's Claude-powered CI reviewer labels both a SQL-injection risk and an inconsistent variable name as critical. The prompt currently says: rate each issue critical, major, or minor by impact. Which change most directly fixes the inconsistent severities?

People also ask

What are the severity levels in code review?
Most review systems use a small ordered set such as critical, major, and minor, sometimes adding blocker or info. The labels are only useful once each one is anchored to a concrete code example of what earns it.
How do you define critical versus minor issues in a review prompt?
Show the model a short snippet for each level. A critical snippet might be an unparameterised SQL query, a minor snippet an inconsistent name. The contrast between snippets teaches the boundary far more reliably than adjectives like high or low impact.
Why use code examples instead of prose to set severity?
Prose adjectives are re-interpreted on every run, so severity drifts. A code example is a fixed point of reference, which keeps the same finding in the same bucket and lets developers trust the ranking.

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

Peace Of Code

Claude Certified Architect: Ep 14 | Prompt Engineering: Explicit Criteria & False Positives

Why watch: The exam-aligned episode on defining concrete criteria, reinforcing why severity must be defined with explicit examples rather than vague prose.

More videos for this concept

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying