Contradictory Findings in AI Code Review

In short: Contradictory findings detection is recognising when a single-pass review flags a code pattern as a problem in one file while approving the identical pattern in another, and correctly diagnosing that as a symptom of attention dilution rather than model randomness. The cure is structural: a multi-pass architecture that gives every file consistent depth prevents the contradiction from arising.

Reading the symptom: what contradictory findings detection means

Contradictory findings detection is the skill of looking at a review that disagrees with itself and correctly naming why. The classic signature is a review that flags a particular code pattern as problematic in one file and, in the same run, approves the very same pattern in another file. To a developer reading the output it looks like the reviewer cannot make up its mind, and the tempting conclusion is that the model is simply inconsistent. This knowledge point, which sits at the analyse level in Domain 4, exists to stop you reaching for that wrong conclusion, because the conclusion you draw determines the fix you apply.

The contradiction is real and observable, but it is a clue, not the disease. When the same pattern receives two different verdicts in one review, something made the reviewer treat the two occurrences differently even though the code was identical. The analytical work is to figure out what that something is. Get the diagnosis wrong and you will spend effort on remedies that cannot help; get it right and the cure is a structural change you already know from the multi-pass architecture.

Contradictory findings detection: Identifying when a review reaches opposite verdicts on identical code in different files, and diagnosing the cause as attention dilution rather than randomness. Because the root cause is uneven depth across a single overloaded pass, the remedy is a multi-pass structure, not a retry.

Why it happens: attention dilution, not randomness

The root cause is attention dilution. When a single pass is asked to review many files at once, its finite attention is spread unevenly across them. Some files, often the ones it reaches early, get close examination; others, often the ones it reaches late, get a skim. If a risky pattern appears in two files, the reviewer may study it carefully where it looked closely and flag it, then glide past the same pattern where it skimmed and approve it. The two verdicts are not a random coin flip; they are the deterministic result of two different depths of attention applied to two occurrences of the same code.

This distinction matters because the obvious explanations point you the wrong way. Calling it randomness suggests you should retry the review or lower the temperature to make it more deterministic, but the problem was never sampling noise, so neither helps. Calling it model inconsistency suggests you need a smarter model, but a smarter model asked to review the same overloaded batch will dilute its attention the same way. Anthropic's guidance on chaining prompts captures the underlying principle: breaking a task into smaller steps so each call processes a more manageable input is how you give every part full attention, and it is the spread of attention, not the quality of the model, that produces the contradiction.

Tracing the cause to the right remedy

Analysing the symptom correctly leads straight to the cure you already know. If the contradiction comes from uneven depth in one pass, then the remedy is the multi-pass architecture: review each file in its own per-file pass with identical criteria, so the same pattern is judged at the same depth everywhere it appears, and add a cross-file pass to reconcile patterns that span files. Under that structure the second occurrence of a risky pattern cannot be skimmed, because it gets its own focused pass, so it is flagged just as the first one was. The contradiction disappears not because the model became more consistent but because no occurrence is ever starved of attention.

This is why contradictory findings detection has the multi-pass architecture as a hard prerequisite. The knowledge point is essentially the diagnostic mirror image of that design: 4.6.2 tells you to split passes to prevent dilution, and 4.6.5 trains you to recognise the dilution when you see its fingerprint in a contradictory report and to reach for that same split. The skill being assessed is the trace from observed symptom to true cause to matching fix, which is precisely what an analyse-level question demands.

same pattern

flagged in one file, approved in another

uneven depth

the actual cause, not randomness

per-file pass

judges every occurrence at equal depth

A picture of the contradiction and its fix

The diagnosis is clearest when you see the two paths side by side: one overloaded pass produces inconsistent depth and a contradiction, while per-file passes produce uniform depth and a consistent verdict.

How dilution produces a contradiction, and how multi-pass removes it

Loading diagram...

Uneven attention in one sweep yields opposite verdicts; equal depth per file yields one consistent verdict.

Distinguishing dilution from genuine differences

Analysis also means not over-applying the diagnosis. Sometimes two occurrences of a pattern genuinely deserve different verdicts because their context differs: a raw SQL string is dangerous when built from user input but harmless when built from a constant, and a missing null check matters where the value can be null but not where it cannot. A careful reviewer should distinguish a true context-dependent difference from a false contradiction. The tell is whether the surrounding context actually differs. If the two sites are materially different, opposite verdicts are correct and there is no contradiction to fix. If the code and its relevant context are the same and the verdicts still differ, you are looking at dilution.

Making that distinction is the analytical core of the knowledge point. It stops you from forcing uniform verdicts where the world is genuinely non-uniform, and it stops you from excusing real dilution as if it were context sensitivity. The exam can test either direction: a scenario where identical code in identical context gets opposite verdicts (dilution, fix with multi-pass) or one where seemingly contradictory verdicts are actually justified by different contexts (not a contradiction, no structural fix needed).

How Claude Code surfaces findings, and apparent contradictions

There is a third kind of apparent contradiction that has nothing to do with attention and everything to do with how a reviewer reports. Claude Code's review behaviour does not put every finding in one place: findings can appear as inline comments, as annotations on the Files changed view, in the Checks tab check-run details, and in the review body under Additional findings. The check-run details include a severity table that lists every finding with its file, line, and summary, even when an inline comment was not accepted, so a finding missing from the inline thread is not necessarily a finding the reviewer dropped.

A genuine source of apparent inconsistency is a stale diff. If a pull request is updated while a review is still running, some findings refer to lines that no longer exist in the current diff, and those are surfaced as Additional findings rather than as inline comments. A reviewer reading only the inline thread can mistake this for the model contradicting itself when in fact the code moved under an in-flight review. Before diagnosing attention dilution, an analyst should rule this out by checking whether the diff changed mid-review.

Claude Code also closes the loop on noisy findings differently from a structural fix: reviewers mark each finding as useful or as wrong and noisy, and those reactions are collected after merge and used to tune the reviewer over time. That feedback loop, rather than a user-facing contradiction flag, is the documented mechanism for driving down findings that conflict with the final outcome.

How this is tested on the Claude Certified Architect exam

In Scenario 5, Claude Code for Continuous Integration, a contradictory review is a natural exam prompt. You are shown a review that flags a pattern in one file and approves it in another and asked for the most likely cause or the best remedy. The trap answers invoke randomness, temperature, or model capability, each of which sends you toward retries or upgrades that cannot fix uneven attention. The correct answer identifies attention dilution from a single overloaded pass and prescribes a multi-pass structure that gives every file consistent depth.

Because this is an analyse-level item, the question rewards causal reasoning rather than recall. You succeed by tracing the visible contradiction back to the mechanism that produced it and then selecting the remedy that targets that mechanism. Keep the chain explicit in your head: contradictory verdicts on identical code mean uneven depth, uneven depth means a diluted single pass, and a diluted single pass is cured by splitting the work into per-file passes plus a cross-file pass.

Worked example

A CI review of a fifteen-file change flags an unparameterised SQL query as an injection risk in one file but approves a byte-for-byte identical query in another, and the team is about to lower temperature and re-run.

Start by checking the obvious wrong diagnosis. The team assumes the model was random, so their instinct is to lower temperature and retry. But the two queries are identical and their surrounding context is the same, so randomness does not explain why one was flagged and one approved. Re-running with lower temperature would just produce another diluted pass, possibly contradicting itself on a different pair of files.

Now trace the real cause. The review ran as one sweep over fifteen files. The flagged query lived in a file the model examined early and deeply; the approved one lived in a file it reached late and skimmed. That is attention dilution: the same pattern got two depths of attention and therefore two verdicts. The fingerprint, identical code with opposite verdicts and no contextual difference, points squarely at uneven depth rather than sampling noise.

Apply the matching fix. Re-run the review as per-file passes so each file, including the one that was skimmed, is analysed with the same criteria at full depth; the second query is now flagged exactly like the first. Add a cross-file pass to confirm the pattern is treated consistently across the change. The contradiction is gone, not because the model became steadier, but because no occurrence was left under-examined. Diagnosing before prescribing is what turned a fruitless retry into the correct structural fix.

Reading a contradictory report like an analyst

The analyse-level habit this knowledge point builds is to slow down at the moment of the contradiction and ask a sequence of questions rather than jumping to a label. Are the two code sites actually identical, or do they merely look similar at a glance? Is the relevant surrounding context the same, or does one site take user input while the other takes a constant? Was the review a single pass over many files, which makes uneven depth likely, or already structured into per-file passes, which makes dilution unlikely? Only after answering these do you commit to a diagnosis.

If the code and context match and the review was one overloaded sweep, attention dilution is the strong conclusion and a multi-pass structure is the fix. If the contexts differ, the verdicts may both be correct and there is nothing to repair. Working through the questions in order is what separates a real diagnosis from a guess, and it is the reasoning an exam item at this level is checking for.

The same pattern beyond code review

Although the canonical example is code review, the contradiction signature appears wherever a single pass judges many similar items at once. A document-extraction job that approves a malformed date in one record but flags the identical format in another, or a content-moderation sweep that allows a phrase on an early page and removes it on a later one, shows the same fingerprint of uneven attention. The diagnosis transfers directly: identical inputs in identical context receiving opposite verdicts point to dilution across an overloaded pass, and the remedy is the same decomposition into focused per-item passes. Recognising the pattern in a new domain, rather than only in the code-review form you first studied, is what an analyse-level understanding gives you, and the exam may dress the scenario in any of these clothes.

Misconceptions to avoid

Misconception

Contradictory findings mean the model is unreliable and randomly inconsistent, so retrying or lowering temperature will resolve them.

What's actually true

The contradiction is usually attention dilution, not sampling noise. Retries and lower temperature do not give the skimmed file more attention, so the inconsistency persists. The fix is a multi-pass structure that reviews every file at equal depth.

Misconception

Any time a reviewer reaches opposite verdicts on similar code, it is a contradiction that must be eliminated.

What's actually true

Not necessarily. If the surrounding context genuinely differs, opposite verdicts can be correct, for example a query is unsafe with user input but safe from a constant. A true contradiction requires identical code and identical relevant context; otherwise the difference is justified.

Check your understanding

A single-pass CI review of a large change flags an unparameterised query as an injection risk in one file but approves a byte-for-byte identical query, in identical context, in another. What is the most accurate diagnosis and fix?

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

No videos curated for this concept yet

We are still curating the best official and community videos for this topic.

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying

Contradictory Findings Detection in LLM Code Review