Detected Pattern Fields for Prompt Learning

In short: Detected pattern fields are a piece of provenance you add to every structured finding: a label naming the code construct or text pattern that triggered the model to flag it. Capturing that label turns a flat list of findings into analysable data, so when developers dismiss findings you can see which patterns drive the noise and refine the prompt against evidence rather than guesswork.

What detected pattern fields are

Detected pattern fields are the small but decisive habit of tagging every structured finding with the reason it exists. When a Claude-powered reviewer flags something, whether a bug in a code review, a clause in a contract, or a risk in a document, it does not only emit the finding text and a severity; it also emits a label naming the construct or pattern that triggered the judgement. That label is provenance. It records why this finding fired, and recording it is what later turns a pile of findings into something you can learn from.

The technique answers a question that quietly defeats most review systems: which of our findings are actually useful? A reviewer that produces a flat list gives you no way to answer that, because every dismissal looks identical. Once each finding carries a detected pattern label, a dismissal stops being noise and becomes a data point attached to a category, and categories are something you can count, rank, and act on.

Detected pattern field: A provenance label attached to each structured finding that names the code construct or text pattern which caused the model to raise it, enabling downstream analysis of which patterns are accepted or dismissed and, in turn, evidence-based refinement of the prompt.

Why a findings list without provenance cannot improve

Imagine a review tool that has been running for a month and developers dismiss roughly two in five of its findings. That single number, a 40 percent dismissal rate, is almost useless. It does not tell you whether the noise comes from one overzealous rule or from a hundred small misjudgements spread evenly. You cannot tell whether the fix is a one-line prompt edit or a fundamental rethink, because the data has no internal structure. You are reduced to acting on whichever complaint was loudest in the last standup, which is exactly the anecdote-driven loop good engineering tries to escape.

The root problem is that the findings were collected without recording what produced them. Anthropic's broader reliability guidance leans on making model outputs auditable and grounded in evidence; its advice to have the model cite the basis for each claim is the same instinct applied to truthfulness. Detected pattern fields apply that instinct to improvement: an output you can trace is an output you can systematically get better at.

Adding the field to your schema

Because Domain 4 already has you enforcing structured output with tool use and JSON schemas, adding provenance is a schema change, not a new system. You extend each finding object with a field, commonly an enumerated one, that the model must populate with the triggering pattern. Using an enum keeps the labels stable and groupable rather than free-form prose that never aggregates cleanly.

{
  "name": "report_findings",
  "input_schema": {
    "type": "object",
    "properties": {
      "findings": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "message": { "type": "string" },
            "severity": { "type": "string", "enum": ["low", "medium", "high"] },
            "detected_pattern": {
              "type": "string",
              "enum": [
                "comment_contradicts_code",
                "unhandled_error_path",
                "string_concatenation_in_loop",
                "hardcoded_secret",
                "missing_input_validation"
              ]
            }
          },
          "required": ["message", "severity", "detected_pattern"]
        }
      }
    }
  }
}

The enum is the design decision that pays off. Free text would let the model describe the trigger a hundred different ways, defeating aggregation, whereas a fixed vocabulary means every instance of the same trigger lands under one banner you can count.

1 label

per finding records the trigger

offline

value is in analysis, not inference

Apply

Bloom level for KP 4.4.3

Why such a cheap technique is so often skipped

The striking thing about detected pattern fields is how little they cost relative to what they unlock. Adding one enumerated property to a schema you already maintain is a few minutes of work, it adds a negligible number of output tokens per finding, and it changes nothing about how the model reasons. Yet most review systems never add it, and the reason is a sequencing mistake: teams treat the prompt as the thing to perfect and the output as a byproduct, when in a system meant to improve over time the output is the instrument that tells you how to perfect the prompt.

By the time a team feels the pain of an unimprovable reviewer, they have usually accumulated months of findings with no provenance, and that history is unrecoverable, because you cannot retroactively label what triggered a finding you logged as plain text. The lesson is to instrument from the first day: the field is cheapest to add before you have any data and most valuable once you have a great deal of it. Building the provenance in early is the difference between a review system that compounds in quality and one that merely accumulates age.

The field only earns its place once you join findings to outcomes. Each time a human accepts or dismisses a finding, you have a labelled example: this pattern, this verdict. Aggregate those over a few weeks and the dismissal rate per pattern tells you where the prompt is misfiring. A pattern accepted almost every time is pulling its weight. A pattern dismissed almost every time is generating noise, and now you know precisely which instruction to change, because the label names the behaviour to suppress or sharpen.

This closes a feedback loop that mirrors Anthropic's evaluator-optimizer pattern, except the evaluator is your team of developers and the optimisation target is the prompt itself. The loop is slow and offline rather than per-request, but it is the mechanism by which a review system gets measurably better instead of merely older.

The prompt-improvement feedback loop

Loading diagram...

Provenance labels make the loop measurable: each noisy pattern points to one instruction to revise.

Joining findings to outcomes in practice

The label on its own is inert; the value appears only when each finding is joined to what a human eventually did with it. That means your pipeline needs to persist two things together: the finding as the model emitted it, including its detected pattern label, and the verdict a reviewer later assigned, accepted or dismissed. In a code-review setting the verdict is often implicit and free to collect: a finding the developer resolves before merge is an acceptance, and one they explicitly wave through or mark as not-an-issue is a dismissal. Capturing that signal turns every review into a labelled training example for your prompt, at no extra cost to the reviewer.

With that join in place, the analysis is a single grouping. Count, for each pattern, how many findings were raised and how many were dismissed, and sort by dismissal rate. The patterns at the top of that list are where your prompt is spending credibility it has not earned, and the ones at the bottom are where it is doing real work. You do not need a sophisticated statistics pipeline for this; a weekly query over a few hundred findings is enough to make the noisy categories unmistakable. The discipline that matters is collecting the data continuously, so that the moment you suspect the reviewer is too noisy, the evidence to act is already sitting there rather than needing a fresh study.

When the taxonomy itself needs to evolve

A pattern taxonomy is not fixed for all time; it is a living interface between the model's behaviour and your ability to reason about it. Two signals tell you it needs to change. The first is a catch-all label, an "other" or "miscellaneous" bucket, that grows to absorb a large share of findings. When that happens, real, distinct failure modes are hiding inside an undifferentiated lump, and you have lost the resolution the field was meant to give you. The fix is to split the bucket: read a sample of the findings inside it, identify the recurring constructs, and promote the common ones to their own labels in the enum.

The second signal is a label that never appears. A pattern in your enum that the model has not used in weeks is either genuinely rare in your corpus or no longer relevant, and carrying it adds noise to the schema and to your analysis. Pruning it keeps the vocabulary tight. Treating the taxonomy as something you revise on the same cadence as the prompt, rather than as a one-time decision, is what keeps the feedback loop sharp as your codebase and your reviewer both change. The label set and the prompt co-evolve: each refinement to one tends to suggest a refinement to the other.

Designing the pattern taxonomy

The labels are only as useful as the taxonomy behind them. Patterns should be mutually distinct, so a finding maps cleanly to exactly one, and they should sit at a level of granularity you can act on. Too coarse, and a label like "style" lumps together fixes that need different treatment. Too fine, and every finding becomes its own category, so nothing aggregates. The sweet spot is a vocabulary where each label names a behaviour you could plausibly write a sentence of prompt about, which keeps the path from analysis to refinement short and concrete. This connects directly to explicit categorical criteria: the same precision that makes review instructions effective makes the pattern taxonomy analysable.

Worked example

A team runs a Claude code-review prompt across pull requests. After a month, developers are dismissing about 40 percent of findings, but the team cannot tell whether the prompt is broadly miscalibrated or noisy in one specific area.

The team adds a detected_pattern enum to the finding schema and re-deploys, changing nothing else about the prompt. Each finding now arrives stamped with the construct that produced it. They also log, for every finding, whether the reviewer marked it resolved or dismissed.

Two weeks of data make the picture sharp. Findings labelled comment_contradicts_code and unhandled_error_path are accepted more than 90 percent of the time, so those instructions are working. But findings labelled string_concatenation_in_loop are dismissed 85 percent of the time: developers consider the micro-optimisation irrelevant in the codebases under review. That single number is the actionable signal the raw 40 percent could never give.

Armed with it, the fix is a one-line prompt edit, instructing the reviewer to skip performance micro-optimisations unless they occur on a hot path, rather than a wholesale rewrite. They ship the edit, watch the overall dismissal rate fall, and keep the loop running to catch the next noisy pattern. Notice that the model behaviour during review never changed because of the field itself; the improvement came entirely from the analysis the field made possible.

Common misconceptions

Misconception

Adding a detected_pattern field makes the model produce more accurate findings.

What's actually true

It does not change inference-time behaviour at all. The model still flags what it would have flagged; the field merely records why. Its value is realised offline, when you aggregate verdicts by pattern and use the result to revise the prompt. The accuracy gain is downstream of human analysis, not an automatic property of the field.

Misconception

A high overall dismissal rate is enough to know the prompt needs work.

What's actually true

An overall rate tells you that something is noisy but not what. Without per-pattern provenance you cannot distinguish one overzealous rule from diffuse miscalibration, so you cannot target a fix. The detected pattern field is what converts an opaque aggregate into a ranked list of specific instructions to change.

How it shows up on the exam

In the continuous-integration and code-review scenario, the exam likes to present a review system that produces too many low-value findings and a team that wants to improve it systematically rather than by trial and error. The assessable insight is that you cannot improve what you cannot attribute: the right move is to add provenance to each finding so dismissals can be analysed by category, after which the prompt edits become obvious and evidence-backed. Distractors will offer global levers, lowering temperature, switching output formats, regenerating findings, none of which give you the per-pattern visibility that systematic improvement requires.

Check your understanding

A team runs a Claude code-review prompt in CI. Developers dismiss about 40 percent of findings, but the team has no way to tell which kinds of findings are the noisy ones, so prompt edits are guesswork. Which change best enables systematic, data-driven improvement?

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

No videos curated for this concept yet

We are still curating the best official and community videos for this topic.

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying

Detected Pattern Fields: Turning Findings Into Prompt Learning