Agentic AI Reference Architecture: Patterns That Hold
A practical agentic AI reference architecture for Claude-based systems: orchestration patterns, tool design, session management, and CCA-F exam alignment in one guide.
By Solomon Udoh · AI Architect & Certification Lead

Building a production agent is not hard. Building one that stays correct across hundreds of turns, recovers from tool failures, and hands off cleanly to a human when it should stop is considerably harder. This post lays out a practical agentic AI reference architecture for Claude-based systems, grounded in the five domains of the Claude Certified Architect, Foundations exam (CCA-F) and in patterns we see rewarded on the exam itself.
We will cover orchestration topology, tool and MCP integration, session management, and the enforcement decision that trips up most candidates. Code snippets are minimal but concrete.
What does an agentic AI reference architecture actually contain?
An agentic architecture is the set of structural decisions that determine how a model perceives its environment, selects actions, executes them via tools, and decides when to stop or escalate. At minimum it specifies:
- Orchestration topology (single agent, hub-and-spoke, pipeline, or swarm)
- Tool surface (which tools exist, how they are described, how errors surface)
- Loop control (how the agent decides to continue, branch, or terminate)
- Session and context strategy (how state is preserved or discarded across turns)
- Enforcement layer (what is enforced by prompt vs. by code)
The CCA-F exam weights Domain 1 (Agentic Architecture & Orchestration) at 27%, the single largest domain, which signals how central these structural choices are to professional Claude work.
Which orchestration topology should you choose?
The topology decision is the first fork in any architecture. Four patterns cover the vast majority of production use cases.
| Topology | When to use | Key risk |
|---|---|---|
| Single agent with tools | Self-contained tasks, low latency budget | Context window fills; attention dilutes |
| Fixed sequential pipeline | Deterministic multi-step workflows | Brittle to upstream failures |
| Hub-and-spoke (coordinator + subagents) | Parallel or specialised subtasks | Coordinator becomes a bottleneck |
| Dynamic adaptive decomposition | Tasks whose shape is unknown at design time | Harder to test and audit |
The hub-and-spoke architecture is the pattern the CCA-F exam returns to most often. A coordinator model receives the user goal, decomposes it, delegates to specialised subagents, and synthesises results. The coordinator never does the work itself; it routes, monitors, and decides.
A minimal coordinator loop in Python looks like this:
import anthropicclient = anthropic.Anthropic()def coordinator_turn(goal: str, tools: list, history: list) -> dict:response = client.messages.create(model="claude-opus-4-5",max_tokens=4096,system="You are a coordinator. Decompose the goal, delegate to tools, synthesise results.",tools=tools,messages=history + [{"role": "user", "content": goal}],)return response
The coordinator inspects stop_reason on every response. If stop_reason == "tool_use", it executes the requested tool call, appends the result, and loops. If stop_reason == "end_turn", the task is complete. Misreading this field is one of the most common agentic loop anti-patterns we see in practice.
For tasks that benefit from parallel execution, the coordinator can spawn subagents concurrently. Parallel subagent spawning cuts wall-clock time but requires the coordinator to merge results carefully, because attribution can be lost when outputs from independent agents are concatenated naively.
How should tools be designed for agentic systems?
Tool design is Domain 2 of the CCA-F exam (18% weight) and the area where small decisions have outsized consequences. The model selects tools based on their descriptions alone. A vague description produces misrouting; a precise one produces correct selection.
"Descriptions are the primary mechanism by which Claude decides which tool to call. Treat them as a selection contract, not documentation."
Three principles govern good tool design for agentic systems:
-
One tool, one concern. If a tool does two things, split it. Tool splitting for specificity reduces ambiguity and makes descriptions easier to write accurately.
-
Errors must be structured. A tool that returns a plain string on failure gives the model nothing to act on. Use the MCP
isErrorflag pattern and include a machine-readable error category alongside the human-readable message. -
Distinguish access failure from valid empty result. A database query that returns zero rows is not an error. A query that times out is. Conflating the two causes the model to retry valid empty results indefinitely.
A well-structured error payload looks like this:
{"isError": true,"error": {"category": "access_failure","code": "DB_TIMEOUT","message": "Query exceeded 5 s limit on table orders","retryable": true}}
The retryable flag lets the coordinator decide whether to retry immediately, back off, or escalate to a human without re-reading the message text.
For systems with many tools, tool overload degrades selection accuracy. The fix is scoping: give each subagent only the tools it needs for its role, rather than exposing the full surface to every agent in the system.
How does MCP fit into the reference architecture?
The Model Context Protocol (MCP) is the standardised transport layer for connecting Claude to external tool servers. In the reference architecture, MCP sits between the orchestrator and the external world. Each MCP server exposes a set of tools; the orchestrator or subagent calls them via the protocol without needing to know the implementation details.
The MCP scoping hierarchy determines which tools are visible at which level. Global MCP servers are available to all agents; project-level servers are scoped to a specific workflow; user-level servers are personal. Getting this hierarchy wrong is a common source of privilege escalation bugs in multi-agent systems.
Environment variable expansion in MCP configuration keeps secrets out of version control:
{"mcpServers": {"database": {"command": "npx","args": ["-y", "@company/db-mcp-server"],"env": {"DB_CONNECTION_STRING": "${DB_CONNECTION_STRING}"}}}}
The ${VAR} syntax is expanded at runtime from the shell environment, not stored in the config file itself.
How should session and context be managed across long tasks?
Context management is Domain 5 of the CCA-F exam (15% weight) and the domain most likely to cause silent degradation rather than loud failure. As a session grows, two problems compound: the attention dilution problem (the model pays less attention to early content) and the stale context problem (facts from early turns may be superseded by later tool results).
The three session management options and their appropriate use cases:
| Option | Use when | Risk |
|---|---|---|
| Resume same session | Task is continuous and context is still fresh | Stale facts accumulate |
| Fork session | Exploring divergent paths from a known-good state | Fork overhead; merge complexity |
| Fresh session with summary injection | Long tasks that cross a natural checkpoint | Summary quality determines continuity |
Summary injection for fresh sessions is the most reliable approach for tasks that span more than a few dozen turns. The coordinator generates a structured summary of completed work, confirmed facts, and open questions, then injects it as the system prompt of the new session. This keeps the context window clean while preserving continuity.
A summary block might look like this:
## Task state as of checkpoint 3- Goal: Migrate customer records from legacy DB to new schema- Completed: Tables users, orders (12,400 rows each)- Pending: Table payments (88,000 rows)- Known issues: 3 rows in orders have null customer_id; flagged for human review- Next action: Begin payments migration, skip null-id rows
The subagent context isolation pattern complements this: each subagent receives only the context slice relevant to its subtask, not the full coordinator history. This keeps subagent context windows small and focused.
When should enforcement be in the prompt vs. in code?
This is the question the CCA-F exam tests most directly in Domain 1, and the answer is deterministic: when stakes are high, use programmatic enforcement. Prompt-based constraints are probabilistic; they can be overridden by sufficiently unusual inputs or long context. Code-based constraints are not.
"For high-stakes decisions, prefer deterministic solutions. Probabilistic enforcement is appropriate only when the cost of a constraint violation is low and recoverable."
The high-stakes enforcement decision rule maps cleanly to a decision table:
| Constraint type | Stakes | Enforcement method |
|---|---|---|
| Output format (JSON schema) | Low to medium | Prompt + schema validation |
| Rate limiting per user | Medium | Code (middleware) |
| PII redaction | High | Code (regex or classifier before output) |
| Irreversible actions (delete, send, pay) | High | Code (human approval gate) |
| Regulatory compliance | High | Code (audit log + block) |
Tool call interception hooks are the standard mechanism for programmatic enforcement in Claude-based systems. A PreToolUse hook can inspect every tool call before execution and block or modify it. A PostToolUse hook can normalise or redact the result before it enters the model's context. This is the correct place to enforce PII handling, not the system prompt.
def pre_tool_use_hook(tool_name: str, tool_input: dict) -> dict | None:"""Return None to allow, or a modified input dict. Raise to block."""if tool_name == "send_email" and not tool_input.get("approved_by_human"):raise PermissionError("send_email requires human approval flag")return tool_input
The hooks vs. prompts decision framework gives a structured way to make this call for any constraint in your system.
How does the reference architecture map to the CCA-F exam domains?
The five exam domains are not independent; they map directly onto the layers of the reference architecture. Understanding this mapping helps candidates study efficiently and helps architects communicate design decisions to stakeholders.
| CCA-F Domain | Weight | Architecture layer |
|---|---|---|
| Domain 1: Agentic Architecture & Orchestration | 27% | Topology, loop control, enforcement |
| Domain 2: Tool Design & MCP Integration | 18% | Tool surface, error handling, MCP scoping |
| Domain 3: Claude Code Configuration & Workflows | 20% | Configuration hierarchy, CI/CD integration |
| Domain 4: Prompt Engineering & Structured Output | 20% | Goal-based prompts, schema design, few-shot |
| Domain 5: Context Management & Reliability | 15% | Session strategy, summarisation, isolation |
Candidates who treat these as separate topics tend to struggle with scenario questions that span two or three domains simultaneously. The exam consistently presents a broken system and asks for the root cause and the proportionate fix. The answer almost always traces to one of the structural decisions in this reference architecture.
Our concept library at /concepts covers 174 atomic concepts mapped to all 30 task statements across the five domains, which is useful for identifying exactly which layer a given exam scenario is testing.
What does a complete reference architecture look like end to end?
Pulling the layers together, a production-grade Claude agent system has the following structure:
Each subagent is context-isolated, receives only its relevant tool subset, and returns structured results. The coordinator synthesises, checks for attribution loss, and either delivers the final output or routes to human review if an escalation trigger fires.
The three valid escalation triggers are: a tool returns an unretryable error, the task requires an action the coordinator is not authorised to take, or the coordinator detects a contradiction it cannot resolve from available evidence. Everything else should be handled autonomously.
For teams preparing for the CCA-F exam, the structured handoff to human agents concept covers exactly how to design the escalation payload so the human reviewer has everything they need without reading the full conversation history.
Where should you start when implementing this architecture?
Start with the enforcement layer, not the orchestration layer. Most teams build the happy path first and bolt on constraints later. The CCA-F exam, and production incidents, consistently show that the failure modes live in the enforcement and error-handling layers. Define your hook points, your escalation triggers, and your tool error taxonomy before you write the first coordinator prompt.
From there, work outward: design your tool surface, write precise descriptions, scope tools to roles, then build the coordinator logic on top of a stable foundation. Session management is the last layer to tune, because the right strategy depends on the actual context growth rate of your specific task, which you can only measure empirically.
The coordinator responsibilities concept is a good next read if you want to go deeper on what the coordinator should and should not own in a hub-and-spoke system.
Frequently asked questions
What is an agentic AI reference architecture?
What orchestration topology does Anthropic recommend for multi-agent Claude systems?
How do I decide whether to use prompt-based or code-based enforcement in my agent?
How does MCP fit into a Claude agentic architecture?
How does the CCA-F exam test agentic architecture knowledge?
When should a Claude agent escalate to a human rather than continue autonomously?
People also ask
What is the difference between an agentic AI architecture and a standard LLM pipeline?
How do you prevent an agentic AI system from running indefinitely?
What is hub-and-spoke architecture in multi-agent AI systems?
How does context management affect reliability in long-running AI agents?
What is the Model Context Protocol and why does it matter for agentic systems?
About the author
AI Architect & Certification Lead
Solomon Udoh is an AI Architect who designs and ships production agent systems on the Claude API and Claude Code. He built AI Skill Certs' adaptive engine and authored its 174-concept knowledge graph, mapping every Claude Certified Architect - Foundations objective to hands-on, exam-aligned practice.
- Designs production multi-agent systems on the Claude API and Agent SDK
- Author of the AI Skill Certs knowledge graph (174 mapped exam concepts)
- Builds with MCP, Claude Code, structured outputs, and agentic loops daily
- Reviews every concept page against the official Anthropic exam guide
You might also like
Ready to put it into practice?
Study every exam concept with an adaptive tutor.