- In short
- Provenance and uncertainty evaluation is the judgement skill of scoring a finished multi-source synthesis against four criteria: every claim carries source, excerpt, and date; conflicts are annotated with both values rather than arbitrarily resolved; temporal context distinguishes real contradictions from time differences; and content is rendered in its appropriate format. A synthesis that misses any one of these is incomplete, however confident it reads.
What evaluating provenance and uncertainty involves
Evaluating provenance and uncertainty is the point in Task Statement 5.6 where you stop building a synthesis and start judging one. It is an evaluate-level skill, which means the exam is not asking you to produce attribution or annotate a conflict; it is handing you a finished output and asking whether it is good enough, and why. That shift matters, because a synthesis can be articulate, well-organised, and completely wrong about the things this task statement cares about. Your job is to look past the polish and apply a rubric.
The rubric has four criteria, and they are exactly the four earlier knowledge points turned into checks. Complete provenance: does every claim carry its source, excerpt, and date? Conflict handling: are disagreements annotated with both values rather than arbitrarily resolved? Temporal context: are differing figures distinguished as time differences rather than mislabelled contradictions? And rendering: is each piece of content presented in the format its type demands? A synthesis earns trust only by passing all four; failing any single one is disqualifying, no matter how good the writing is.
- Provenance and uncertainty evaluation
- The judgement of a completed multi-source synthesis against four criteria, complete provenance, annotated conflicts, temporal context, and content-appropriate rendering, to determine whether it is trustworthy and complete rather than merely fluent.
The four-point scorecard
Treat the criteria as a scorecard you run top to bottom, because each one catches a different and common failure.
- Complete provenance. Every claim has a source, an exact excerpt, and a date. This is the claim source mapping made into an audit: a single unattributed assertion fails the criterion.
- Annotated conflicts. Where credible sources disagree, both values appear with their attribution. A synthesis that silently presents one number where two existed has resolved a conflict it had no right to resolve.
- Temporal context. Differing figures are checked against their dates first, so a real contradiction is distinguished from ordinary change over time. A false contradiction is as much a defect as a hidden one.
- Content-appropriate format. Financial data is tabular, narrative is prose, findings are lists. Correct content in the wrong format still costs the reader.
Fluency is the trap, not the test
The hardest thing about evaluating a synthesis is resisting the pull of good prose. A confident, well-structured paragraph signals quality to the reader's instincts, and a sloppy one signals the opposite, but neither instinct tracks the criteria. An output can be beautifully written and quietly resolve a genuine conflict by picking whichever source it read first; another can be clumsy but scrupulously attribute and annotate everything. The evaluation skill is to score the second higher than the first, because the rubric measures trustworthiness, not readability.
This is also where uncertainty handling meets escalation. When a synthesis confronts a conflict it cannot honestly resolve, the right output is not a manufactured single answer; it is a surfaced disagreement and, where a decision is genuinely required, an escalation to a human. An evaluator should reward an agent that says the evidence is divided and reject one that smoothed the division away to look decisive. Decisiveness about contested facts is a vice here, not a virtue.
Running the evaluation
The gates are ordered for convenience, not priority, a real evaluation checks all four regardless of where the first failure appears, because the author needs the complete list of defects to fix, not just the first one. But the flow makes the standard concrete: passing is the conjunction of four conditions, and most flawed syntheses fail quietly at exactly one of them.
Worked example: reviewing a junior agent's market synthesis
Worked example
You are the reviewing step in a pipeline. A research agent has produced a polished two-page market synthesis for a client, and your task is to accept it, return it for revision, or escalate.
The writing is excellent, which is the first thing to set aside. You run the scorecard. Provenance: most claims cite a dated source and excerpt, but two key figures appear with no attribution at all, first gate, partial fail. Conflict handling: the synthesis reports a single market-size figure, yet your spot-check finds the underlying notes contained two credible, conflicting estimates, and the agent simply used the first one it retrieved, second gate, fail, and the exact trap this knowledge point names. Temporal context: a growth figure is flagged as contradicting an earlier report, but the two were measured a year apart, so the contradiction is false, third gate, fail. Rendering: the quarterly figures are buried in prose rather than tabulated, fourth gate, fail.
Your verdict is not subtle: despite the fluent writing, the synthesis fails three of four criteria and partially fails the fourth. You return it with a specific, criterion-by-criterion list, attribute the two orphan claims, surface and annotate the market-size conflict, re-check the growth figures against their dates, and tabulate the quarterly data. Where the conflict cannot be resolved by evidence and a decision is needed, you note that the right move is to escalate, not to let the agent pick a number. A reviewer who had judged on polish alone would have shipped a confident, untrustworthy report.
Common misreadings to avoid
Misconception
When two sources conflict, resolving it by going with the first source found keeps the output clean and decisive.
What's actually true
Misconception
If a synthesis is well-written and confident, it has probably handled provenance and uncertainty correctly.
What's actually true
Diagnose, do not just grade
An evaluation that produces only a verdict, pass or fail, is half-finished. The point of running a rubric is to return a precise, criterion-by-criterion diagnosis the author can act on, because a synthesis that fails three gates needs three specific fixes, not a vague instruction to do better. This is why the four criteria are checked in full even after the first failure is found: an evaluator who stops at the first defect hands back an incomplete punch-list and guarantees another round of revision. The discipline is to score every gate, name every gap, and tie each one to the knowledge point it violates.
That habit also makes the evaluation defensible. Saying this synthesis is weak is an opinion; saying two claims lack attribution, a market-size conflict was silently resolved, a growth figure was mislabelled a contradiction across a year, and the quarterly data should be a table is a diagnosis anyone can verify against the output. The evaluate-level skill the exam is testing is exactly this move from impression to itemised judgement, and it is what separates a reviewer who improves a pipeline from one who merely reacts to it.
Distinguishing fixable defects from genuine uncertainty
Not every gap a rubric surfaces is the author's fault, and a mature evaluation says which is which. Some failures are defects of execution, a missing excerpt, an untabulated figure, a false temporal contradiction, and the right response is a revision request. Others reflect irreducible uncertainty in the world: two equally credible sources genuinely disagree and no date, definition, or revision resolves them. For that second kind, the synthesis cannot be faulted for failing to produce a single answer, because there is no honest single answer to produce. Faulting it would be demanding the very arbitrary resolution the task statement forbids.
The evaluator's job there is to confirm the uncertainty was handled honestly, surfaced, annotated, and where a decision is required, routed to a human through the escalation pathway, rather than to insist it be resolved. This is the most sophisticated judgement in Domain 5: knowing when a clean answer is missing because the agent failed and when it is missing because the evidence itself is divided. An evaluator who cannot tell those apart will either pass concealment as confidence or punish honesty as indecision, and the exam is precisely interested in candidates who can hold the distinction.
The rubric is a conjunction, and that is the point
The single most important property of the four-criterion rubric is that it is a logical conjunction: a synthesis passes only if it satisfies all four conditions, and fails if it misses any one. This is deliberately unforgiving, and it is unforgiving for a reason. The four criteria are not a weighted scorecard where strength on three can offset weakness on the fourth, because each one defends against a distinct way a synthesis can mislead. A perfectly attributed report that hides a conflict is still dangerous; a perfectly formatted one that mislabels a temporal difference still misinforms. There is no average that makes a hidden conflict acceptable, so there is no averaging in the rubric.
Holding that line is what makes the evaluation meaningful, and it is also what the exam probes with its hardest distractors. A tempting wrong answer will list the genuine strengths of a synthesis, clear writing, mostly-complete citations, clean tables, and conclude that it is good enough, inviting you to trade a strength against the one fatal gap. The disciplined response refuses the trade: it names the failed criterion and fails the synthesis on it, regardless of how much else is right. Internalising that the rubric is a conjunction rather than a sum is, in the end, the whole evaluate-level lesson of Task Statement 5.6, because it is the difference between judging a synthesis on its weakest load-bearing property and being charmed by its strongest cosmetic one.
Calibrated uncertainty: supported, inferred, and unknown
The uncertainty half of this knowledge point asks for more than annotating conflicts; it asks that a synthesis be honest about how well each claim is grounded. A useful discipline is to sort claims into three kinds: those directly supported by a cited source and excerpt, those inferred by the agent from supported facts but not stated by any source, and those that remain unknown because the evidence does not settle them. A trustworthy synthesis keeps these visibly distinct rather than presenting an inference or a guess in the same confident register as a sourced fact.
This reflects what Anthropic asks of the model itself. Its published constitution says Claude should assert only what it believes to be true and should acknowledge uncertainty or the limits of its knowledge rather than projecting false confidence. An evaluator applying the rubric should extend that standard to the whole synthesis: reward an output that marks an inference as an inference and names what it could not determine, and penalise one whose fluent prose flattens supported, inferred, and unknown claims into a single decisive voice. Because the model's own text is untrusted until verified, calibrated uncertainty is not only an honesty property but a safety one, it tells the reader exactly which claims still need checking before anyone acts on them.
How this shows up on the exam
As the evaluate-level peak of Task Statement 5.6, this knowledge point appears as scenario questions that hand you a complete synthesis and ask you to judge it or pick the best critique. The decisive skill is applying the four-criterion rubric and refusing to be swayed by how authoritative the output sounds. The most common correct answer identifies a hidden or arbitrarily resolved conflict, a missing attribution, a false temporal contradiction, or a format mismatch, and the most common wrong answer praises the synthesis for its clarity while missing one of those defects. Because it draws on every other 5.6 knowledge point at once, doing well here confirms you have internalised the whole task statement, not just its parts.
You are evaluating a fluent, confident market synthesis. It cites most claims with dated sources, presents data in clear tables, and reads professionally, but it reports a single market-size figure where the source notes contained two credible, conflicting estimates the agent did not reconcile by any dated evidence. How should you score it?
People also ask
How do you evaluate the quality of a multi-source synthesis?
What makes a synthesis output trustworthy?
When should an agent escalate instead of resolving uncertainty?
Watch and learn
Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.
Ask an AI about your documents and have them QUOTE it! Claude Citations | Unscripted Coding
Why watch: Shows complete provenance in practice, with every claim grounded in a source document and its exact excerpt, the core of evaluating whether a synthesis has full source attribution.
More videos for this concept
References & primary sources
Master this concept with Archie
Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.