Architecture·10 min read·1 July 2026

Agentic AI Reference Architecture: Patterns That Hold

A practical agentic AI reference architecture for Claude-based systems: orchestration patterns, tool design, session management, and CCA-F exam alignment in one guide.

By Solomon Udoh · AI Architect & Certification Lead

Agentic AI Reference Architecture: Patterns That Hold

Building a production agent is not hard. Building one that stays correct across hundreds of turns, recovers from tool failures, and hands off cleanly to a human when it should stop is considerably harder. This post lays out a practical agentic AI reference architecture for Claude-based systems, grounded in the five domains of the Claude Certified Architect, Foundations exam (CCA-F) and in patterns we see rewarded on the exam itself.

We will cover orchestration topology, tool and MCP integration, session management, and the enforcement decision that trips up most candidates. Code snippets are minimal but concrete.

What does an agentic AI reference architecture actually contain?

An agentic architecture is the set of structural decisions that determine how a model perceives its environment, selects actions, executes them via tools, and decides when to stop or escalate. At minimum it specifies:

Orchestration topology (single agent, hub-and-spoke, pipeline, or swarm)
Tool surface (which tools exist, how they are described, how errors surface)
Loop control (how the agent decides to continue, branch, or terminate)
Session and context strategy (how state is preserved or discarded across turns)
Enforcement layer (what is enforced by prompt vs. by code)

The CCA-F exam weights Domain 1 (Agentic Architecture & Orchestration) at 27%, the single largest domain, which signals how central these structural choices are to professional Claude work.

Which orchestration topology should you choose?

The topology decision is the first fork in any architecture. Four patterns cover the vast majority of production use cases.

Topology	When to use	Key risk
Single agent with tools	Self-contained tasks, low latency budget	Context window fills; attention dilutes
Fixed sequential pipeline	Deterministic multi-step workflows	Brittle to upstream failures
Hub-and-spoke (coordinator + subagents)	Parallel or specialised subtasks	Coordinator becomes a bottleneck
Dynamic adaptive decomposition	Tasks whose shape is unknown at design time	Harder to test and audit

The hub-and-spoke architecture is the pattern the CCA-F exam returns to most often. A coordinator model receives the user goal, decomposes it, delegates to specialised subagents, and synthesises results. The coordinator never does the work itself; it routes, monitors, and decides.

A minimal coordinator loop in Python looks like this:

python

import anthropic

client = anthropic.Anthropic()

def coordinator_turn(goal: str, tools: list, history: list) -> dict:
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=4096,
        system="You are a coordinator. Decompose the goal, delegate to tools, synthesise results.",
        tools=tools,
        messages=history + [{"role": "user", "content": goal}],
    )
    return response

The coordinator inspects stop_reason on every response. If stop_reason == "tool_use", it executes the requested tool call, appends the result, and loops. If stop_reason == "end_turn", the task is complete. Misreading this field is one of the most common agentic loop anti-patterns we see in practice.

For tasks that benefit from parallel execution, the coordinator can spawn subagents concurrently. Parallel subagent spawning cuts wall-clock time but requires the coordinator to merge results carefully, because attribution can be lost when outputs from independent agents are concatenated naively.

How should tools be designed for agentic systems?

Tool design is Domain 2 of the CCA-F exam (18% weight) and the area where small decisions have outsized consequences. The model selects tools based on their descriptions alone. A vague description produces misrouting; a precise one produces correct selection.

"Descriptions are the primary mechanism by which Claude decides which tool to call. Treat them as a selection contract, not documentation."

Anthropic , Claude Tool Use Documentation

Three principles govern good tool design for agentic systems:

One tool, one concern. If a tool does two things, split it. Tool splitting for specificity reduces ambiguity and makes descriptions easier to write accurately.
Errors must be structured. A tool that returns a plain string on failure gives the model nothing to act on. Use the MCP isError flag pattern and include a machine-readable error category alongside the human-readable message.
Distinguish access failure from valid empty result. A database query that returns zero rows is not an error. A query that times out is. Conflating the two causes the model to retry valid empty results indefinitely.

A well-structured error payload looks like this:

json

{
  "isError": true,
  "error": {
    "category": "access_failure",
    "code": "DB_TIMEOUT",
    "message": "Query exceeded 5 s limit on table orders",
    "retryable": true
  }
}

The retryable flag lets the coordinator decide whether to retry immediately, back off, or escalate to a human without re-reading the message text.

For systems with many tools, tool overload degrades selection accuracy. The fix is scoping: give each subagent only the tools it needs for its role, rather than exposing the full surface to every agent in the system.

How does MCP fit into the reference architecture?

The Model Context Protocol (MCP) is the standardised transport layer for connecting Claude to external tool servers. In the reference architecture, MCP sits between the orchestrator and the external world. Each MCP server exposes a set of tools; the orchestrator or subagent calls them via the protocol without needing to know the implementation details.

The MCP scoping hierarchy determines which tools are visible at which level. Global MCP servers are available to all agents; project-level servers are scoped to a specific workflow; user-level servers are personal. Getting this hierarchy wrong is a common source of privilege escalation bugs in multi-agent systems.

Environment variable expansion in MCP configuration keeps secrets out of version control:

json

{
  "mcpServers": {
    "database": {
      "command": "npx",
      "args": ["-y", "@company/db-mcp-server"],
      "env": {
        "DB_CONNECTION_STRING": "${DB_CONNECTION_STRING}"
      }
    }
  }
}

The ${VAR} syntax is expanded at runtime from the shell environment, not stored in the config file itself.

How should session and context be managed across long tasks?

Context management is Domain 5 of the CCA-F exam (15% weight) and the domain most likely to cause silent degradation rather than loud failure. As a session grows, two problems compound: the attention dilution problem (the model pays less attention to early content) and the stale context problem (facts from early turns may be superseded by later tool results).

The three session management options and their appropriate use cases:

Option	Use when	Risk
Resume same session	Task is continuous and context is still fresh	Stale facts accumulate
Fork session	Exploring divergent paths from a known-good state	Fork overhead; merge complexity
Fresh session with summary injection	Long tasks that cross a natural checkpoint	Summary quality determines continuity

Summary injection for fresh sessions is the most reliable approach for tasks that span more than a few dozen turns. The coordinator generates a structured summary of completed work, confirmed facts, and open questions, then injects it as the system prompt of the new session. This keeps the context window clean while preserving continuity.

A summary block might look like this:

text

## Task state as of checkpoint 3
- Goal: Migrate customer records from legacy DB to new schema
- Completed: Tables users, orders (12,400 rows each)
- Pending: Table payments (88,000 rows)
- Known issues: 3 rows in orders have null customer_id; flagged for human review
- Next action: Begin payments migration, skip null-id rows

The subagent context isolation pattern complements this: each subagent receives only the context slice relevant to its subtask, not the full coordinator history. This keeps subagent context windows small and focused.

When should enforcement be in the prompt vs. in code?

This is the question the CCA-F exam tests most directly in Domain 1, and the answer is deterministic: when stakes are high, use programmatic enforcement. Prompt-based constraints are probabilistic; they can be overridden by sufficiently unusual inputs or long context. Code-based constraints are not.

"For high-stakes decisions, prefer deterministic solutions. Probabilistic enforcement is appropriate only when the cost of a constraint violation is low and recoverable."

Anthropic , Claude Model Specification

The high-stakes enforcement decision rule maps cleanly to a decision table:

Constraint type	Stakes	Enforcement method
Output format (JSON schema)	Low to medium	Prompt + schema validation
Rate limiting per user	Medium	Code (middleware)
PII redaction	High	Code (regex or classifier before output)
Irreversible actions (delete, send, pay)	High	Code (human approval gate)
Regulatory compliance	High	Code (audit log + block)

Tool call interception hooks are the standard mechanism for programmatic enforcement in Claude-based systems. A PreToolUse hook can inspect every tool call before execution and block or modify it. A PostToolUse hook can normalise or redact the result before it enters the model's context. This is the correct place to enforce PII handling, not the system prompt.

python

def pre_tool_use_hook(tool_name: str, tool_input: dict) -> dict | None:
    """Return None to allow, or a modified input dict. Raise to block."""
    if tool_name == "send_email" and not tool_input.get("approved_by_human"):
        raise PermissionError("send_email requires human approval flag")
    return tool_input

The hooks vs. prompts decision framework gives a structured way to make this call for any constraint in your system.

How does the reference architecture map to the CCA-F exam domains?

The five exam domains are not independent; they map directly onto the layers of the reference architecture. Understanding this mapping helps candidates study efficiently and helps architects communicate design decisions to stakeholders.

CCA-F Domain	Weight	Architecture layer
Domain 1: Agentic Architecture & Orchestration	27%	Topology, loop control, enforcement
Domain 2: Tool Design & MCP Integration	18%	Tool surface, error handling, MCP scoping
Domain 3: Claude Code Configuration & Workflows	20%	Configuration hierarchy, CI/CD integration
Domain 4: Prompt Engineering & Structured Output	20%	Goal-based prompts, schema design, few-shot
Domain 5: Context Management & Reliability	15%	Session strategy, summarisation, isolation

Candidates who treat these as separate topics tend to struggle with scenario questions that span two or three domains simultaneously. The exam consistently presents a broken system and asks for the root cause and the proportionate fix. The answer almost always traces to one of the structural decisions in this reference architecture.

Our concept library at /concepts covers 174 atomic concepts mapped to all 30 task statements across the five domains, which is useful for identifying exactly which layer a given exam scenario is testing.

What does a complete reference architecture look like end to end?

Pulling the layers together, a production-grade Claude agent system has the following structure:

Loading diagram...

Each subagent is context-isolated, receives only its relevant tool subset, and returns structured results. The coordinator synthesises, checks for attribution loss, and either delivers the final output or routes to human review if an escalation trigger fires.

The three valid escalation triggers are: a tool returns an unretryable error, the task requires an action the coordinator is not authorised to take, or the coordinator detects a contradiction it cannot resolve from available evidence. Everything else should be handled autonomously.

For teams preparing for the CCA-F exam, the structured handoff to human agents concept covers exactly how to design the escalation payload so the human reviewer has everything they need without reading the full conversation history.

Where should you start when implementing this architecture?

Start with the enforcement layer, not the orchestration layer. Most teams build the happy path first and bolt on constraints later. The CCA-F exam, and production incidents, consistently show that the failure modes live in the enforcement and error-handling layers. Define your hook points, your escalation triggers, and your tool error taxonomy before you write the first coordinator prompt.

From there, work outward: design your tool surface, write precise descriptions, scope tools to roles, then build the coordinator logic on top of a stable foundation. Session management is the last layer to tune, because the right strategy depends on the actual context growth rate of your specific task, which you can only measure empirically.

The coordinator responsibilities concept is a good next read if you want to go deeper on what the coordinator should and should not own in a hub-and-spoke system.

Frequently asked questions

What is an agentic AI reference architecture?

An agentic AI reference architecture is a set of structural decisions that define how an AI model perceives its environment, selects and executes actions via tools, manages state across turns, and decides when to stop or escalate. For Claude-based systems it typically specifies an orchestration topology, a tool surface, loop control logic, a session strategy, and an enforcement layer.

What orchestration topology does Anthropic recommend for multi-agent Claude systems?

Anthropic's documentation and the CCA-F exam both emphasise the hub-and-spoke pattern as the primary multi-agent topology: a coordinator model decomposes goals and delegates to specialised subagents, then synthesises results. Fixed sequential pipelines suit deterministic workflows; dynamic adaptive decomposition suits tasks whose shape is unknown at design time.

How do I decide whether to use prompt-based or code-based enforcement in my agent?

Use programmatic (code-based) enforcement whenever the cost of a constraint violation is high or irreversible: PII redaction, irreversible actions such as payments or deletions, and regulatory compliance all require code. Prompt-based constraints are probabilistic and appropriate only for low-stakes formatting or style preferences where an occasional violation is recoverable.

How does MCP fit into a Claude agentic architecture?

The Model Context Protocol is the standardised transport layer between Claude agents and external tool servers. In the reference architecture, MCP sits between the orchestrator and the external world. Tools are scoped at global, project, or user level via the MCP scoping hierarchy, and secrets are kept out of config files using environment variable expansion.

How does the CCA-F exam test agentic architecture knowledge?

The CCA-F exam allocates 27% of its weight to Domain 1 (Agentic Architecture & Orchestration), the largest single domain. Scenario questions typically present a broken or suboptimal system and ask for the root cause and proportionate fix. The exam consistently rewards deterministic solutions over probabilistic ones when stakes are high.

When should a Claude agent escalate to a human rather than continue autonomously?

Three valid escalation triggers exist: a tool returns an unretryable error, the task requires an action the agent is not authorised to take, or the agent detects a contradiction it cannot resolve from available evidence. Uncertainty alone or a low-confidence score are not sufficient triggers; the agent should attempt resolution first.