- In short
- Workflow enforcement scenario analysis is the evaluate-level skill of examining a concrete production scenario, judging the severity, reversibility, and blast radius of an enforcement failure, and selecting the matching mechanism: programmatic for financial, security, or compliance consequences, and prompt based for cosmetic ones.
What workflow enforcement scenario analysis asks of you
The earlier knowledge points in task statement 1.4 give you the pieces: the difference between a prompt and a gate, the decision rule for stakes, and how to build a gate. Workflow enforcement scenario analysis is where you put them together under judgement. At the evaluate level of Bloom's taxonomy, you are no longer recalling a definition or applying a fixed rule; you are weighing a concrete, sometimes messy scenario and defending the enforcement choice you make for it.
That shift matters because real scenarios are rarely labelled. A stem will describe an agent, a workflow, and a failure, and it will not announce "this is a compliance operation." You have to read the consequences out of the situation yourself, then reason about which mechanism is proportionate. The skill is part classification and part trade-off analysis, and the exam rewards candidates who can do both quickly and explain why.
- Workflow enforcement scenario analysis
- The evaluate-level practice of analysing a specific scenario to pick an enforcement mechanism. You assess the severity, reversibility, and reach of a potential failure, classify the operation as high or low stakes, and choose programmatic enforcement or a prompt accordingly, justifying the choice against the scenario's consequences.
A useful way to frame the evaluate level is that the earlier knowledge points give you correct answers to clean questions, while this one gives you the judgement to handle messy ones. A clean question states the stakes; a messy scenario makes you infer them, weigh competing concerns, and sometimes accept that two designs are both defensible while one is clearly better. That is why this knowledge point sits at difficulty four and at the top of Bloom's hierarchy for the task statement: it is not testing whether you know the rule but whether you can wield it when the situation refuses to announce which rule applies. Approach each scenario as a small argument you have to win, not a fact you have to recall.
The three lenses for judging a scenario
A reliable analysis looks at a failure through three lenses before deciding.
- Severity, how bad is one failure? Money moved, data exposed, or a law breached is severe; a clumsy sentence is not. Severity is the primary signal and usually settles the question on its own.
- Reversibility, can the failure be undone? A wrong refund can sometimes be clawed back; a leaked medical record cannot. The less reversible a failure, the stronger the case for a hard gate.
- Blast radius, how many people or accounts does one failure touch? A misformatted reply affects one customer; a flawed compliance check can affect every user in a jurisdiction. A wide blast radius pushes toward determinism.
When any lens shows real harm, financial, security, or compliance, the analysis lands on programmatic enforcement. When all three are benign, a prompt is the proportionate answer. The lenses also explain the rarer mistake the exam can test: over-enforcement, where someone hard-gates a cosmetic rule and pays in rigidity and cost for a failure that never mattered.
Why the prompt based option keeps looking right
The reason these questions are hard is that the distractors are engineered to be tempting. For a genuinely high stakes scenario the exam will present a more detailed system prompt, a set of few-shot examples, and a classifier that screens risky requests, each described in confident, professional language. They look like the careful work of a thoughtful engineer, and under time pressure the most elaborate option is seductive.
Scenario analysis defeats the temptation by anchoring on consequence rather than craft. Once your three lenses show that a failure moves money or breaks a law, every probabilistic option collapses into the same category: it reduces the failure rate but cannot reach zero, and zero is what the consequence demands. Anthropic's own guidance reinforces the instinct from the other direction, advising that you add complexity only when it demonstrably improves the outcome, so you neither under-enforce a dangerous step nor over-engineer a harmless one.
Worked example: evaluating a mixed set of scenarios
The evaluate skill is best practised across several scenarios at once, because the contrast is the lesson.
Worked example
Evaluate four behaviours from a single banking assistant and decide, with justification, which need programmatic enforcement and which can stay on a prompt.
Scenario one: the assistant must permanently delete a customer's data on request under privacy law. Severity is high, reversibility is nil, and the blast radius is regulatory. Every lens screams high stakes, so this needs programmatic enforcement, a gate that only deletes after the request is validated and logged. A prompt here would be negligent.
Scenario two: the assistant must never move funds between accounts without a confirmed recipient. Money, irreversible once sent, potentially large. Programmatic again, a gate on the transfer tool keyed to a confirmation flag. Any prompt based option in this slot is a distractor no matter how detailed.
Scenario three: the assistant should use plain language and avoid jargon when explaining fees. Severity is cosmetic, fully reversible (the customer can ask again), blast radius of one mildly confused reader. This is a textbook prompt: an instruction to write plainly is proportionate, and gating it would be absurd over-enforcement.
Scenario four: the assistant should end each chat by asking if there is anything else. Trivial severity, reversible, tiny blast radius. Prompt based, and not worth a line of enforcement code. Laid side by side, the four make the discipline obvious: the first two earn determinism because a single failure is costly and irreversible, while the last two stay on prompts because a failure costs nothing. The analyst's job is to read those consequences out of the scenario and spend enforcement effort only where the lenses justify it.
Misconceptions at the evaluate level
Misconception
If an enforcement question offers a very detailed, professional prompt based solution, that thoroughness makes it the strongest answer.
What's actually true
Misconception
The safest move is always to enforce every rule programmatically, so picking the code option is never wrong.
What's actually true
Reading hidden stakes from an unlabelled scenario
Because exam scenarios do not carry tidy labels, the first real skill is detecting stakes that the wording disguises. Train your eye on cues. Money verbs, refund, charge, transfer, credit, payout, signal a financial consequence. Access and identity nouns, verify, permission, credential, recover, reset, signal security. Regulatory and data words, delete, disclose, consent, log, retain, jurisdiction, signal compliance. A scenario can sound routine while quietly containing any of these, and spotting one is usually enough to settle the analysis.
Irreversibility is the second cue to hunt for. A failure you can undo is less demanding than one you cannot, so phrases like "permanently," "cannot be recalled," or "non-refundable" raise the bar toward a hard gate even if the headline action seems modest. When you read a scenario, underline the verbs and nouns that describe an effect on the world, then ask whether any of them moves money, changes access, invokes a rule, or cannot be taken back. That habit converts an unlabelled vignette into a labelled one you can judge with confidence.
When both mechanisms belong together
Evaluation is not always a clean either-or. Sometimes the strongest design uses both mechanisms at once, a gate for safety and a prompt for experience, and recognising that prevents you from treating them as mutually exclusive. Consider an agent that must never issue a refund without verified ownership but should also explain warmly to the customer why a verification step is needed. The gate guarantees the safety property; the prompt shapes how the agent talks about it. Neither replaces the other, and a good answer can rightly involve both layers.
What evaluation rules out is using the prompt in place of the gate for the safety-critical part. Defense in depth means the prompt steers the model toward the right path while the gate guarantees it cannot leave that path, so even a perfectly worded prompt is backed by a deterministic floor. When an exam option offers a prompt as the sole control for a high stakes step, it fails; when it pairs a prompt with a gate, or reserves the prompt for the cosmetic layer, it can be exactly right. Telling those two situations apart is the evaluate-level judgement on display.
Justifying the verdict
At the evaluate level you are expected not just to pick a mechanism but to defend it, and a tight justification follows a fixed shape: name the consequence of a failure, pass it through severity, reversibility, and blast radius, then state the mechanism that the consequence demands. "Cancelling a dispatched shipment triggers a non-refundable penalty, which is financial and irreversible, so it requires a deterministic gate rather than a prompt" is the kind of one-sentence argument that demonstrates real understanding rather than pattern-matching.
This justification habit also guards you against the over-enforcement trap from the other side. If you cannot articulate a real consequence for a rule, money, access, or law, then you have likely found a cosmetic behaviour that a gate would only burden. Being able to say out loud why a rule is or is not high stakes is the clearest sign you are evaluating rather than guessing, and it is the same disciplined reasoning the other scenario-analysis knowledge points across the exam reward.
How this knowledge point is tested
This is the capstone of task statement 1.4, and it is where the exam's scenario-based style is most explicit. A question hands you an unlabelled situation and four options, and your task is to evaluate rather than recall. Because the Bloom level is evaluate, the difficulty is in the judgement: spotting that an innocuous-sounding step is actually a compliance action, or that a frightening-sounding step is actually cosmetic, and choosing the proportionate mechanism with a reason you could defend.
Run every such question through the same loop: name the consequence of a failure, pass it through the severity, reversibility, and blast-radius lenses, then select the mechanism that matches and reject the rest. That loop generalises beyond enforcement to the other scenario-analysis knowledge points across the exam, such as session management scenario analysis and error response scenario analysis, all of which reward the same habit of reasoning from consequences rather than surface cues.
An exam item describes a logistics agent that can cancel shipments. One behaviour under review: the agent must not cancel a shipment that has already been dispatched, because doing so triggers a non-refundable carrier penalty. Four solutions are offered. Evaluating by consequence, which is correct?
People also ask
How do you choose an enforcement mechanism for a given scenario?
What makes a prompt based answer a distractor in enforcement questions?
Can over enforcing a low stakes rule be a mistake?
Watch and learn
Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.
How We Build Effective Agents: Barry Zhang, Anthropic
Why watch: Anthropic's Barry Zhang explains when to lean on the model versus deterministic code, the core judgment for choosing an enforcement mechanism by consequence severity.
More videos for this concept
References & primary sources
Master this concept with Archie
Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.