LLM as Judge: Build Reliable Eval Pipelines for Claude Agents
Learn how to use LLM as judge to evaluate Claude agent reliability, design multi-dimensional rubrics, mitigate bias, and harden production eval pipelines.
By Solomon Udoh · AI Architect & Certification Lead

Using an LLM as judge is now the dominant technique for evaluating Claude agents at scale, replacing brittle exact-match checks with rubric-driven assessments that can score instruction following, reasoning quality, and tool efficiency in a single pass. This post explains how to design those rubrics, mitigate the biases that make LLM judges unreliable, and harden the surrounding pipeline for production governance.
What does "LLM as judge" actually mean?
An LLM-as-judge setup routes the output of one model call through a second model call whose sole job is to score or classify that output against an explicit rubric. The judge model returns a structured verdict: a score, a category label, or a pass/fail decision, together with a brief rationale. Because the judge reads natural language, it can evaluate dimensions that deterministic checks cannot, such as whether a response stays on-topic, whether a tool was chosen efficiently, or whether a multi-step reasoning chain is internally consistent.
The technique is particularly important for agentic architecture work, where a single user request may trigger dozens of tool calls and sub-agent delegations. Human review of every trace is not feasible; a well-calibrated judge is.
Why is pass^k a better reliability metric than pass@k?
The distinction matters enormously in production. pass@k asks whether at least one of k independent trials succeeds. pass^k asks whether all k trials succeed. For a task with a 70% single-trial success rate, pass@3 is approximately 97%, which looks excellent. pass^3 is approximately 34%, which reveals a system that fails two out of three times in real deployment where you cannot cherry-pick runs.
For high-stakes agentic tasks, such as financial report generation or compliance checks, pass^k is the correct metric. Reserve pass@k for creative or exploratory tasks where any good output is acceptable. When the CCA-F exam presents a scenario about measuring agent reliability, it consistently rewards deterministic, worst-case framing over optimistic best-case framing.
| Metric | Formula (approximate) | When to use |
|---|---|---|
| pass@k | 1 - (1 - p)^k | Creative tasks; any success is acceptable |
| pass^k | p^k | Production tasks; every run must succeed |
| Mean score | avg(score_i) | Continuous rubric dimensions |
| Failure rate by severity | count(severity >= N) / total | Deployment gate decisions |
How do you design multi-dimensional rubrics for an LLM judge?
A rubric is a structured scoring guide that tells the judge model exactly what to measure and how to weight each dimension. Vague rubrics produce noisy scores; explicit categorical criteria produce consistent ones.
A production rubric for a Claude agent handling customer support might include:
- Instruction following (0 to 3): Did the response address every explicit constraint in the system prompt?
- Reasoning quality (0 to 3): Is the chain of thought internally consistent and free of logical gaps?
- Tool efficiency (0 to 2): Were tools called the minimum number of times needed, with no redundant or hallucinated calls?
- Factual accuracy (0 to 3): Are all stated facts verifiable against the provided context?
- Tone compliance (0 to 1): Does the response match the required register?
Deliver the rubric to the judge inside a structured system prompt. Use prompt engineering best practices to make each criterion unambiguous: define what a score of 0 looks like and what a score of 3 looks like, with a concrete example for each.
{"rubric": {"instruction_following": {"max_score": 3,"anchors": {"0": "One or more explicit constraints ignored.","1": "All constraints addressed but one partially missed.","2": "All constraints addressed; minor ambiguity in one.","3": "All constraints addressed precisely and completely."}},"tool_efficiency": {"max_score": 2,"anchors": {"0": "Redundant or hallucinated tool calls present.","1": "No redundant calls but a more efficient path existed.","2": "Minimum necessary tool calls; correct selection throughout."}}}}
Pass this JSON block verbatim into the judge's system prompt alongside the conversation trace being evaluated.
What biases affect LLM judges and how do you mitigate them?
Three biases consistently degrade LLM judge reliability:
Position bias: judges tend to favour the first response when comparing two options side by side. Mitigation: run the comparison twice with positions swapped and average the scores.
Verbosity bias: longer responses receive higher scores independent of quality. Mitigation: add an explicit rubric criterion that penalises unnecessary length, and include a short high-quality anchor example in the prompt.
Self-preference bias: a model used as judge tends to rate outputs from the same model family more favourably. Mitigation: use a different model family as judge, or use a smaller model fine-tuned specifically for evaluation.
For the context management dimension specifically, watch for a fourth failure mode: the judge's own context window fills with long traces, causing it to miss errors in the middle of the conversation. This is the attention dilution problem applied to the eval layer. Chunk long traces into segments and run the judge per segment, then aggregate.
Judges should be treated as models with their own failure modes, not as ground truth. Calibrate them against human labels before trusting their scores in a deployment gate.
How should you structure the judge call in code?
Keep the judge call separate from the agent call. Never ask the same model instance that produced the output to also score it; the shared context biases the score upward.
import anthropicclient = anthropic.Anthropic()def judge_response(system_prompt: str,user_turn: str,agent_response: str,rubric_json: str,) -> dict:judge_system = f"""You are a strict evaluator. Score the AGENT RESPONSEagainst the RUBRIC. Return valid JSON only, with keys matching rubricdimension names and an integer score for each.RUBRIC:{rubric_json}"""judge_user = f"""SYSTEM PROMPT GIVEN TO AGENT:{system_prompt}USER TURN:{user_turn}AGENT RESPONSE:{agent_response}Return your scores as JSON now."""response = client.messages.create(model="claude-opus-4-5",max_tokens=512,system=judge_system,messages=[{"role": "user", "content": judge_user}],)import jsonreturn json.loads(response.content[0].text)
Note the max_tokens cap: a judge that produces verbose rationale before the JSON wastes tokens and risks truncation. If you need rationale, ask for it in a separate field after the scores.
How do you harden an eval pipeline for production governance?
Basic benchmarking is not a production eval pipeline. A hardened pipeline addresses five concerns:
1. Version management: every judge prompt, rubric JSON, and model version must be pinned and stored in version control alongside the agent code it evaluates. A rubric change is a breaking change; treat it as one.
2. Cost control: judge calls add latency and cost. For a 60-question batch, a single judge pass at claude-haiku-4-5 costs a fraction of a pass at claude-opus-4-5. Use a tiered strategy: cheap model for first-pass triage, expensive model only for borderline cases.
3. Data protection: traces often contain PII or proprietary data. Route eval traffic through the same data-handling controls as production traffic. Do not log raw traces to a shared eval dashboard without redaction.
4. Governance gates: define numeric thresholds that block deployment. A common pattern is: overall mean score >= 2.4/3.0 AND no severity-3 failures in the last 100 runs. Encode these as CI assertions, not manual checks.
5. Regression sets: maintain a frozen set of cases that the agent must always pass. New eval cases can be added; regression cases are never retired unless the task definition itself changes. This prevents the silent backsliding that occurs when teams optimise for aggregate metrics while specific failure modes worsen.
# Example CI gate using a hypothetical eval CLIeval-runner \--agent-config ./agents/support-agent.json \--rubric ./evals/rubric-v3.json \--judge-model claude-haiku-4-5 \--escalate-model claude-opus-4-5 \--escalate-threshold 1.5 \--pass-gate "mean_score >= 2.4 AND severity_3_count == 0" \--regression-set ./evals/regression-frozen.jsonl
Why do agents fail even when individual components score well?
The most common production failure mode is goal-plan-action misalignment: the agent correctly interprets the goal, produces a plausible plan, but the actions taken do not implement the plan faithfully. Each step looks reasonable in isolation; the composite output is wrong.
This is why iterative refinement in multi-agent systems separates planning, building, and evaluation into distinct roles. The evaluator role is not an afterthought; it is a first-class agent that checks whether the builder's output matches the planner's specification.
An LLM-as-judge pipeline maps directly onto this architecture. The judge is the evaluator agent. Feed it the original goal (from the planner), the intended plan, and the actual output (from the builder). Score misalignment between plan and output as a distinct rubric dimension, separate from output quality.
When stakes are high, prefer deterministic enforcement over probabilistic prompting. An LLM judge that sometimes misses a severity-3 violation is not a substitute for a programmatic check that never misses it.
How do you build a living eval suite that does not go stale?
A static eval suite degrades in two ways: the agent improves past it (ceiling effect), and the real distribution of inputs drifts away from it (coverage gap). A living suite addresses both.
Difficulty evolution: track per-case pass^k scores. Cases with pass^k above 0.95 for three consecutive releases are candidates for replacement with harder variants. Cases with pass^k below 0.20 are candidates for decomposition into smaller diagnostic cases.
Diversity maintenance: cluster your eval cases by input type, domain, and failure mode. Monitor cluster coverage. When production logs reveal a new failure pattern, add cases to that cluster before the next release cycle.
Regression anchoring: the frozen regression set (described above) is the counterweight. It prevents the living suite from drifting so far toward novelty that you lose signal on the core capabilities the agent was originally built for.
For teams preparing for the CCA-F exam, note that Domain 5 (Context Management and Reliability, 15% of the exam) and Domain 4 (Prompt Engineering and Structured Output, 20%) together cover the material most directly tested by eval pipeline design questions. Our concept library maps all 174 atomic concepts to the five exam domains, including the reliability and structured output patterns that underpin LLM-as-judge work.
What are the most consequential failure modes to gate on?
Not all failures are equal. A response that is slightly verbose is a quality issue. A response that states an incorrect financial figure is a deployment blocker. Structure your severity taxonomy before you write your first rubric.
| Severity | Definition | Gate action |
|---|---|---|
| 3 (Critical) | Factual error with financial, legal, or safety consequence | Block deployment immediately |
| 2 (High) | Instruction constraint violated; output unusable | Block if rate > 1% in regression set |
| 1 (Medium) | Quality degradation; output usable but suboptimal | Track; alert if trend worsens |
| 0 (Low) | Style or verbosity issue | Log only |
The key insight from production deployments is that overall violation rates can be low while severity-3 rates remain unacceptably high. A system with a 2% overall failure rate but a 0.5% severity-3 rate may be undeployable in a regulated domain. Always report severity-stratified metrics, never aggregate-only.
For multi-agent error handling, the same severity taxonomy should govern routing decisions: severity-3 failures trigger immediate escalation to a human reviewer, not a retry loop.
Frequently asked questions
Can I use Claude to judge its own outputs?
How many judge calls do I need per eval run to get stable scores?
What temperature should I use for the judge model?
How do I handle cases where the judge and human raters disagree?
Is LLM-as-judge covered on the CCA-F exam?
How do I prevent the judge from being manipulated by prompt injection in the agent output?
People also ask
What is LLM as judge in AI evaluation?
What are the main biases in LLM as judge systems?
How do you use LLM as judge in a CI/CD pipeline?
What is the difference between pass@k and pass^k for agent evaluation?
How many dimensions should an LLM judge rubric have?
About the author
AI Architect & Certification Lead
Solomon Udoh is an AI Architect who designs and ships production agent systems on the Claude API and Claude Code. He built AI Skill Certs' adaptive engine and authored its 174-concept knowledge graph, mapping every Claude Certified Architect - Foundations objective to hands-on, exam-aligned practice.
- Designs production multi-agent systems on the Claude API and Agent SDK
- Author of the AI Skill Certs knowledge graph (174 mapped exam concepts)
- Builds with MCP, Claude Code, structured outputs, and agentic loops daily
- Reviews every concept page against the official Anthropic exam guide
You might also like
Ready to put it into practice?
Study every exam concept with an adaptive tutor.