Plan Mode vs Direct Execution Classification

In short: Mode selection classification is the judgement of whether a given task belongs in plan mode or direct execution, decided by its scope, complexity, and uncertainty. Broad, multi-file, or architecturally uncertain work earns planning; a contained change with a known fix is executed directly.

What classifying plan mode vs direct execution involves

Choosing plan mode vs direct execution is, at heart, a classification problem: you read a task description and decide which mode the work belongs in, then defend the choice. This knowledge point sits at the top of Task Statement 3.4 because it integrates everything below it. Recognising what plan mode is for, recognising what direct execution is for, and understanding the hybrid pattern all feed into a single evaluative act, looking at an unfamiliar task and assigning it correctly. The exam pitches this at the evaluate level, so the questions are less "what is plan mode" and more "here is a task; which mode, and why."

The reason this deserves its own concept is that the signals can conflict and have to be weighed, not merely counted. A task might touch one file yet hide a real design decision; another might touch thirty files yet be a purely mechanical find-and-replace. A rubric that just counts files gets both wrong. The architect's job is to read past surface features to the underlying scope, complexity, and uncertainty, and that judgement is what an evaluate-level item is designed to reward.

Mode selection classification: The evaluative judgement of whether a task belongs in plan mode or direct execution, made by weighing its scope, complexity, and uncertainty rather than any single surface feature.

The three-axis rubric

A reliable way to classify is to score the task on three axes drawn from the best-practices guidance. The first axis is scope: does the change live in one contained place, or does it spread across many files and modules? The second is complexity: is the work mechanical and well-trodden, or does it force a structural decision about how the system is shaped? The third is uncertainty: is the correct approach already known, or are there several plausible designs that have to be weighed first?

The decision rule is asymmetric. If all three axes read low, contained scope, mechanical complexity, known approach, the task is direct execution, and planning would only restate the obvious. But a high reading on any single axis is enough to justify planning, because broad scope, architectural weight, or genuine uncertainty each independently make a half-built, improvised change expensive to unwind. You are not averaging the axes; you are looking for any one of them to cross the line.

3 axes

scope, complexity, uncertainty

any high

one high axis justifies planning

all low

direct execution

Worked classifications from the exam's own examples

The task statement supplies three canonical cases, and running each through the rubric shows how the axes resolve. A monolith restructuring scores high on scope (it moves code widely) and high on complexity (it forces decisions about module boundaries), so it is plan mode without hesitation. A null-pointer fix in a single function scores low everywhere, one place, mechanical, and the fix is known from the stack trace, so it is direct execution. A library migration across thirty files scores high on scope and carries real uncertainty about the migration strategy even though each edit is small, so it too is plan mode.

Notice that the thirty-file migration and the single-function fix are separated mainly by scope and uncertainty, not by how hard any individual edit is. That is the discriminating insight the evaluate-level item is probing: the migration's difficulty is in coordinating a consistent transformation across many files and choosing the order, which is precisely what a plan provides. Misreading it as "lots of easy edits, so just execute" is the trap.

Scoring a task on the three axes

Loading diagram...

Classification is a logical OR over three axes, not an average, one high axis is enough to plan.

Why this matters for the Claude Certified Architect exam

Evaluate-level questions are where the exam separates architects who have memorised definitions from those who can apply judgement. Because this knowledge point integrates the whole task statement, a single item can quietly test all of it: the scenario describes a task, the four options each pair a mode with a justification, and you must pick the option whose mode and reasoning both hold. Wrong options frequently get the mode right but the reason wrong, or offer a superficially plausible justification, file count alone, or "planning is always safer", that the rubric exposes.

Scenario 2 and Scenario 4 are the usual settings, and the principle the exam leans on is explicit in the documentation: planning is most useful when you are uncertain about the approach, when the change modifies multiple files, or when you are unfamiliar with the code, and you should skip it when you could describe the diff in one sentence. An answer that classifies correctly and cites the matching signal is the one that scores.

Worked example

An exam item presents two tasks back to back and asks an architect to assign each a mode with a justification: (A) rename a misspelled variable used in three adjacent lines of one function; (B) extract a shared authentication concern out of four controllers into middleware.

Task A scores low on every axis. The scope is three lines in one function, the complexity is mechanical, and the approach, rename the symbol, is fully known. There is no design to weigh and nothing to coordinate. The correct classification is direct execution, justified by contained scope and a known fix. Reaching for plan mode here would produce a plan that says "rename the variable," which is overhead with no payoff.

Task B looks small in line count but is not. Extracting a cross-cutting concern into middleware forces a structural decision: where the middleware sits, how the four controllers hand off to it, and whether their current behaviours are truly identical or merely similar. Scope spans four files, complexity is architectural, and there is real uncertainty about the seam. Any one of those alone would justify planning; together they make it clear-cut. The correct classification is plan mode, justified by multi-file scope and an architectural decision that deserves review before code is written.

The evaluative skill is holding both tasks against the same rubric and resisting the pull to treat them the same. The architect who classifies every task identically, always planning, or always executing, fails exactly the discrimination this item is built to test.

Common misreadings to avoid

The evaluate level invites subtle errors, and two recur.

Misconception

The number of files a change touches is a reliable rule: many files means plan mode, one file means direct execution.

What's actually true

File count is one signal among three, not a rule. A single file can hide a genuine design decision that warrants planning, and many files can be a purely mechanical edit that does not. Weigh scope alongside complexity and uncertainty rather than counting files in isolation.

Misconception

A consistent workflow is better, so picking one mode and applying it to every task makes an architect more disciplined.

What's actually true

Applying one mode regardless of complexity is the central anti-pattern of this task statement. It over-plans trivial fixes and under-plans risky changes. Discipline here means classifying each task on its merits and matching the mode to it, not enforcing uniformity.

Reading past the surface to the real decision

The hardest part of classification is resisting the surface features the exam deliberately makes salient. File count is the obvious lure: a change described as touching one file invites a direct-execution answer, and one described as touching many invites a plan-mode answer, regardless of what the change actually requires. The three-axis rubric exists to pull your attention back to scope, complexity, and uncertainty, which are properties of the work rather than of its description. A one-file change that quietly reshapes how a module is structured is high on complexity even though its scope reads as small, and the rubric catches that where a file-count heuristic would not.

A second surface feature is the language of routineness. Describing a task as just a swap or only a rename primes you to classify it as mechanical, but the words are the author's framing, not a fact about the task. The thirty-file library migration is described as a set of small swaps precisely so that an unwary reader treats it as direct execution; the discriminating detail, an unresolved decision about how to handle the old behaviour at a few call sites, is what actually sets the uncertainty axis high. Evaluate-level items reward the reader who weighs the buried detail over the breezy framing.

Why the distractors pair a right mode with a wrong reason

The most instructive thing about evaluate-level options is how they are wrong. Rarely is a distractor simply absurd; far more often it reaches a defensible mode through an indefensible justification. An option might correctly say plan mode but justify it with an absolute rule, such as any multi-file change must be planned, that the rubric rejects because file count alone is a signal and not a law. Another might correctly say direct execution but justify it by ignoring a real architectural decision hiding in the task. Scoring well means checking both halves of each option: the mode and the reason have to hold together. An answer whose mode is right for a reason the rubric would not endorse is still wrong, and the exam is built to reward you for noticing the difference.

This is why classification is an evaluate-level skill rather than an apply-level one. You are not merely selecting a mode; you are judging the quality of a justification, which is a higher-order act. Running each option through the three axes, and asking whether the stated reason is the reason the rubric would give, turns a set of plausible-sounding choices into a clear ranking and leaves you with one option whose mode and reasoning both survive scrutiny.

Two decisions the exam tries to merge: execution mode vs permission mode

A trap built into classification items is conflating two independent decisions. Execution mode is the plan-versus-direct choice this knowledge point is about: does Claude reason about the approach before editing, or proceed straight to the change. Permission mode is a different axis entirely; it governs how much autonomy Claude has while it executes, from prompting before every action to accepting edits unattended. Classifying a task on scope, complexity, and uncertainty answers the execution-mode question, but it does not set the permission level, and the two should be reasoned about separately.

The distinction matters because a wrong option can pair a correct execution-mode call with permission-mode language to sound authoritative, for example justifying plan mode as a way to make Claude "more careful" while it edits. Plan mode is read-only and produces no edits at all, so that framing confuses a workflow choice with an autonomy setting. Keeping the axes apart, scope and uncertainty drive plan versus direct, while sensitivity and trust drive the permission level, is what stops you from being pulled toward a plausible-sounding but wrong justification.

There is a related conflation worth naming: Claude Code's plan-versus-direct task strategy is not the same thing as the API's code execution tool. The code execution tool is an API capability you include in a request when you want Claude to run Bash commands and manipulate files in a sandbox; it is not a "plan mode" request field. The official docs separate Claude Code task strategy from API tool use, so an option that answers a Claude Code classification question by invoking an API parameter has crossed a boundary the exam expects you to hold.

How this is tested on the exam

Because this is the capstone of Task 3.4 and has no downstream dependents, expect it to appear as the hardest mode-selection item on a sitting: a scenario with two or more tasks, or one task with a misleading surface feature, where you must classify and justify. The reliable method is the three-axis rubric applied honestly, score scope, complexity, and uncertainty, and let any single high axis pull the task into plan mode. This concept builds on the hybrid plan-then-execute pattern and rhymes with other evaluate-level Domain 3 classification points such as convention application scenario analysis, all of which reward the same move: read past the surface to the underlying decision.

Check your understanding

An architect must classify a task: migrate a date-handling library across roughly thirty files. Each individual edit is a small, mechanical swap, but the team has not decided how to handle three call sites that rely on the old library's timezone behaviour. Which classification and justification is correct?

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

No videos curated for this concept yet

We are still curating the best official and community videos for this topic.

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying

Plan Mode vs Direct Execution: Classifying Claude Code Tasks