Method·8 min read·23 June 2026

LLM as Judge: Build Reliable Eval Pipelines for Claude Agents

Learn how to use LLM as judge to evaluate Claude agent reliability, design multi-dimensional rubrics, mitigate bias, and harden production eval pipelines.

By Solomon Udoh · AI Architect & Certification Lead

LLM as Judge: Build Reliable Eval Pipelines for Claude Agents

Using an LLM as judge is now the dominant technique for evaluating Claude agents at scale, replacing brittle exact-match checks with rubric-driven assessments that can score instruction following, reasoning quality, and tool efficiency in a single pass. This post explains how to design those rubrics, mitigate the biases that make LLM judges unreliable, and harden the surrounding pipeline for production governance.

What does "LLM as judge" actually mean?

An LLM-as-judge setup routes the output of one model call through a second model call whose sole job is to score or classify that output against an explicit rubric. The judge model returns a structured verdict: a score, a category label, or a pass/fail decision, together with a brief rationale. Because the judge reads natural language, it can evaluate dimensions that deterministic checks cannot, such as whether a response stays on-topic, whether a tool was chosen efficiently, or whether a multi-step reasoning chain is internally consistent.

The technique is particularly important for agentic architecture work, where a single user request may trigger dozens of tool calls and sub-agent delegations. Human review of every trace is not feasible; a well-calibrated judge is.

Why is pass^k a better reliability metric than pass@k?

The distinction matters enormously in production. pass@k asks whether at least one of k independent trials succeeds. pass^k asks whether all k trials succeed. For a task with a 70% single-trial success rate, pass@3 is approximately 97%, which looks excellent. pass^3 is approximately 34%, which reveals a system that fails two out of three times in real deployment where you cannot cherry-pick runs.

For high-stakes agentic tasks, such as financial report generation or compliance checks, pass^k is the correct metric. Reserve pass@k for creative or exploratory tasks where any good output is acceptable. When the CCA-F exam presents a scenario about measuring agent reliability, it consistently rewards deterministic, worst-case framing over optimistic best-case framing.

MetricFormula (approximate)When to use
pass@k1 - (1 - p)^kCreative tasks; any success is acceptable
pass^kp^kProduction tasks; every run must succeed
Mean scoreavg(score_i)Continuous rubric dimensions
Failure rate by severitycount(severity >= N) / totalDeployment gate decisions

How do you design multi-dimensional rubrics for an LLM judge?

A rubric is a structured scoring guide that tells the judge model exactly what to measure and how to weight each dimension. Vague rubrics produce noisy scores; explicit categorical criteria produce consistent ones.

A production rubric for a Claude agent handling customer support might include:

  1. Instruction following (0 to 3): Did the response address every explicit constraint in the system prompt?
  2. Reasoning quality (0 to 3): Is the chain of thought internally consistent and free of logical gaps?
  3. Tool efficiency (0 to 2): Were tools called the minimum number of times needed, with no redundant or hallucinated calls?
  4. Factual accuracy (0 to 3): Are all stated facts verifiable against the provided context?
  5. Tone compliance (0 to 1): Does the response match the required register?

Deliver the rubric to the judge inside a structured system prompt. Use prompt engineering best practices to make each criterion unambiguous: define what a score of 0 looks like and what a score of 3 looks like, with a concrete example for each.

json
{
"rubric": {
"instruction_following": {
"max_score": 3,
"anchors": {
"0": "One or more explicit constraints ignored.",
"1": "All constraints addressed but one partially missed.",
"2": "All constraints addressed; minor ambiguity in one.",
"3": "All constraints addressed precisely and completely."
}
},
"tool_efficiency": {
"max_score": 2,
"anchors": {
"0": "Redundant or hallucinated tool calls present.",
"1": "No redundant calls but a more efficient path existed.",
"2": "Minimum necessary tool calls; correct selection throughout."
}
}
}
}

Pass this JSON block verbatim into the judge's system prompt alongside the conversation trace being evaluated.

What biases affect LLM judges and how do you mitigate them?

Three biases consistently degrade LLM judge reliability:

Position bias: judges tend to favour the first response when comparing two options side by side. Mitigation: run the comparison twice with positions swapped and average the scores.

Verbosity bias: longer responses receive higher scores independent of quality. Mitigation: add an explicit rubric criterion that penalises unnecessary length, and include a short high-quality anchor example in the prompt.

Self-preference bias: a model used as judge tends to rate outputs from the same model family more favourably. Mitigation: use a different model family as judge, or use a smaller model fine-tuned specifically for evaluation.

For the context management dimension specifically, watch for a fourth failure mode: the judge's own context window fills with long traces, causing it to miss errors in the middle of the conversation. This is the attention dilution problem applied to the eval layer. Chunk long traces into segments and run the judge per segment, then aggregate.

Judges should be treated as models with their own failure modes, not as ground truth. Calibrate them against human labels before trusting their scores in a deployment gate.

Anthropic , Claude Documentation (Model Evaluation Guidance)

How should you structure the judge call in code?

Keep the judge call separate from the agent call. Never ask the same model instance that produced the output to also score it; the shared context biases the score upward.

python
import anthropic
client = anthropic.Anthropic()
def judge_response(
system_prompt: str,
user_turn: str,
agent_response: str,
rubric_json: str,
) -> dict:
judge_system = f"""You are a strict evaluator. Score the AGENT RESPONSE
against the RUBRIC. Return valid JSON only, with keys matching rubric
dimension names and an integer score for each.
RUBRIC:
{rubric_json}
"""
judge_user = f"""SYSTEM PROMPT GIVEN TO AGENT:
{system_prompt}
USER TURN:
{user_turn}
AGENT RESPONSE:
{agent_response}
Return your scores as JSON now."""
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=512,
system=judge_system,
messages=[{"role": "user", "content": judge_user}],
)
import json
return json.loads(response.content[0].text)

Note the max_tokens cap: a judge that produces verbose rationale before the JSON wastes tokens and risks truncation. If you need rationale, ask for it in a separate field after the scores.

How do you harden an eval pipeline for production governance?

Basic benchmarking is not a production eval pipeline. A hardened pipeline addresses five concerns:

1. Version management: every judge prompt, rubric JSON, and model version must be pinned and stored in version control alongside the agent code it evaluates. A rubric change is a breaking change; treat it as one.

2. Cost control: judge calls add latency and cost. For a 60-question batch, a single judge pass at claude-haiku-4-5 costs a fraction of a pass at claude-opus-4-5. Use a tiered strategy: cheap model for first-pass triage, expensive model only for borderline cases.

3. Data protection: traces often contain PII or proprietary data. Route eval traffic through the same data-handling controls as production traffic. Do not log raw traces to a shared eval dashboard without redaction.

4. Governance gates: define numeric thresholds that block deployment. A common pattern is: overall mean score >= 2.4/3.0 AND no severity-3 failures in the last 100 runs. Encode these as CI assertions, not manual checks.

5. Regression sets: maintain a frozen set of cases that the agent must always pass. New eval cases can be added; regression cases are never retired unless the task definition itself changes. This prevents the silent backsliding that occurs when teams optimise for aggregate metrics while specific failure modes worsen.

bash
# Example CI gate using a hypothetical eval CLI
eval-runner \
--agent-config ./agents/support-agent.json \
--rubric ./evals/rubric-v3.json \
--judge-model claude-haiku-4-5 \
--escalate-model claude-opus-4-5 \
--escalate-threshold 1.5 \
--pass-gate "mean_score >= 2.4 AND severity_3_count == 0" \
--regression-set ./evals/regression-frozen.jsonl

Why do agents fail even when individual components score well?

The most common production failure mode is goal-plan-action misalignment: the agent correctly interprets the goal, produces a plausible plan, but the actions taken do not implement the plan faithfully. Each step looks reasonable in isolation; the composite output is wrong.

This is why iterative refinement in multi-agent systems separates planning, building, and evaluation into distinct roles. The evaluator role is not an afterthought; it is a first-class agent that checks whether the builder's output matches the planner's specification.

An LLM-as-judge pipeline maps directly onto this architecture. The judge is the evaluator agent. Feed it the original goal (from the planner), the intended plan, and the actual output (from the builder). Score misalignment between plan and output as a distinct rubric dimension, separate from output quality.

When stakes are high, prefer deterministic enforcement over probabilistic prompting. An LLM judge that sometimes misses a severity-3 violation is not a substitute for a programmatic check that never misses it.

Anthropic , Claude Documentation (Agentic and Multi-Agent Frameworks)

How do you build a living eval suite that does not go stale?

A static eval suite degrades in two ways: the agent improves past it (ceiling effect), and the real distribution of inputs drifts away from it (coverage gap). A living suite addresses both.

Difficulty evolution: track per-case pass^k scores. Cases with pass^k above 0.95 for three consecutive releases are candidates for replacement with harder variants. Cases with pass^k below 0.20 are candidates for decomposition into smaller diagnostic cases.

Diversity maintenance: cluster your eval cases by input type, domain, and failure mode. Monitor cluster coverage. When production logs reveal a new failure pattern, add cases to that cluster before the next release cycle.

Regression anchoring: the frozen regression set (described above) is the counterweight. It prevents the living suite from drifting so far toward novelty that you lose signal on the core capabilities the agent was originally built for.

For teams preparing for the CCA-F exam, note that Domain 5 (Context Management and Reliability, 15% of the exam) and Domain 4 (Prompt Engineering and Structured Output, 20%) together cover the material most directly tested by eval pipeline design questions. Our concept library maps all 174 atomic concepts to the five exam domains, including the reliability and structured output patterns that underpin LLM-as-judge work.

What are the most consequential failure modes to gate on?

Not all failures are equal. A response that is slightly verbose is a quality issue. A response that states an incorrect financial figure is a deployment blocker. Structure your severity taxonomy before you write your first rubric.

SeverityDefinitionGate action
3 (Critical)Factual error with financial, legal, or safety consequenceBlock deployment immediately
2 (High)Instruction constraint violated; output unusableBlock if rate > 1% in regression set
1 (Medium)Quality degradation; output usable but suboptimalTrack; alert if trend worsens
0 (Low)Style or verbosity issueLog only

The key insight from production deployments is that overall violation rates can be low while severity-3 rates remain unacceptably high. A system with a 2% overall failure rate but a 0.5% severity-3 rate may be undeployable in a regulated domain. Always report severity-stratified metrics, never aggregate-only.

For multi-agent error handling, the same severity taxonomy should govern routing decisions: severity-3 failures trigger immediate escalation to a human reviewer, not a retry loop.

Frequently asked questions

Can I use Claude to judge its own outputs?
You can, but self-preference bias means scores will skew high. For production gates, use a different model family as judge, or use a smaller model fine-tuned specifically for evaluation. If you must use Claude as judge, calibrate scores against human labels first and apply a correction factor to borderline cases.
How many judge calls do I need per eval run to get stable scores?
For continuous rubric dimensions, a single judge call per case is often sufficient if the rubric is well-anchored with examples. For pass/fail decisions on high-stakes cases, run three independent judge calls and take the majority verdict. This reduces variance from judge stochasticity without tripling cost across the full suite.
What temperature should I use for the judge model?
Set temperature to 0 for the judge model. You want deterministic, reproducible scores, not creative variation. A judge that returns different scores for identical inputs on successive runs is not a reliable gate. Low temperature also makes the judge's rationale more consistent and easier to audit.
How do I handle cases where the judge and human raters disagree?
Disagreement is a calibration signal, not a failure. Log every case where the judge score and human label differ by more than one point. Review those cases weekly. Common causes are ambiguous rubric anchors and domain-specific knowledge gaps in the judge model. Update the rubric or add few-shot examples to address each root cause.
Is LLM-as-judge covered on the CCA-F exam?
The CCA-F exam does not name LLM-as-judge as a labelled topic, but the underlying skills are tested across Domain 4 (Prompt Engineering and Structured Output, 20%) and Domain 5 (Context Management and Reliability, 15%). Scenario questions about eval design, structured output validation, and reliability measurement draw directly on these concepts.
How do I prevent the judge from being manipulated by prompt injection in the agent output?
Wrap the agent output in a clearly delimited block with a fixed XML or JSON envelope before passing it to the judge. Instruct the judge in its system prompt to treat everything inside that block as untrusted data to be evaluated, not as instructions to follow. Never interpolate agent output directly into the judge's instruction text.

People also ask

What is LLM as judge in AI evaluation?
LLM as judge is an evaluation technique where a second language model scores or classifies the output of a first model against an explicit rubric. It replaces brittle exact-match checks with flexible, rubric-driven assessments that can measure instruction following, reasoning quality, and tool efficiency in a single structured pass.
What are the main biases in LLM as judge systems?
The three main biases are position bias (favouring the first option in a comparison), verbosity bias (rewarding longer responses regardless of quality), and self-preference bias (a model rating outputs from its own family more highly). Each can be mitigated through prompt design, swapped-position runs, or using a different model family as judge.
How do you use LLM as judge in a CI/CD pipeline?
Pin the judge prompt, rubric JSON, and model version in version control. Run the judge against a frozen regression set on every pull request. Define numeric pass gates such as mean score above a threshold and zero severity-3 failures. Fail the build automatically if gates are not met, treating eval regressions the same as test failures.
What is the difference between pass@k and pass^k for agent evaluation?
pass@k measures whether at least one of k trials succeeds; pass^k measures whether all k trials succeed. For production agents where every run must be correct, pass^k is the appropriate metric. pass@k can make an unreliable system look strong by hiding the majority of failures behind a single successful run.
How many dimensions should an LLM judge rubric have?
Three to six dimensions is the practical range. Fewer than three dimensions misses important quality axes; more than six increases judge confusion and score noise. Each dimension needs explicit numeric anchors with concrete examples. Instruction following, reasoning quality, and factual accuracy are the most commonly essential dimensions.

About the author

Solomon Udoh

AI Architect & Certification Lead

Solomon Udoh is an AI Architect who designs and ships production agent systems on the Claude API and Claude Code. He built AI Skill Certs' adaptive engine and authored its 174-concept knowledge graph, mapping every Claude Certified Architect - Foundations objective to hands-on, exam-aligned practice.

  • Designs production multi-agent systems on the Claude API and Agent SDK
  • Author of the AI Skill Certs knowledge graph (174 mapped exam concepts)
  • Builds with MCP, Claude Code, structured outputs, and agentic loops daily
  • Reviews every concept page against the official Anthropic exam guide

You might also like

Ready to put it into practice?

Study every exam concept with an adaptive tutor.

Start studying