Concept deep dive·10 min read·26 June 2026

Claude Memory Tool: Context Engineering for Long-Running Agents

Master the claude memory tool and context engineering strategies: compaction, sub-agent isolation, external retrieval, and semantic selection for reliable Claude agents.

By Solomon Udoh · AI Architect & Certification Lead

Claude Memory Tool: Context Engineering for Long-Running Agents

Memory is the silent bottleneck in every production Claude agent. The claude memory tool and the broader family of context engineering techniques exist to answer one question: how do you keep a long-running agent coherent when its context window is finite and every token costs something? This post maps the full design space, from in-window compaction to external retrieval, and ties each strategy to the CCA-F exam domains where they are tested.

What is the claude memory tool and why does it matter?

The claude memory tool is not a single API endpoint. It is a design pattern: any mechanism that allows a Claude agent to store, retrieve, and selectively load information across turns or across sessions. In Claude's architecture, "memory" spans four distinct layers, each with different latency, cost, and reliability characteristics.

Memory layer	Where state lives	Typical use case
In-context (window)	Active prompt tokens	Short tasks, single-session reasoning
External key-value store	Database or vector store	Facts that outlive a session
Tool-retrieved (RAG)	Retrieved at query time	Large corpora, up-to-date knowledge
Summarised injection	Compressed prior context	Resuming long sessions cheaply

Domain 5 of the CCA-F exam, Context Management and Reliability, carries 15% of the total weight and tests exactly this taxonomy. Domain 1, Agentic Architecture and Orchestration, at 27%, tests how memory strategies interact with multi-agent designs. Together they account for 42% of the exam, so a weak mental model here is expensive.

How does in-context compaction prevent context pollution?

In-context compaction is the first line of defence. It keeps the context window from filling with low-signal tokens that dilute attention and degrade output quality. The two most practical compaction patterns are Head+Tail splitting and tool result clearing.

Head+Tail splitting preserves the system prompt and the most recent turns (the "tail") while compressing or discarding the middle. This directly counters the attention dilution problem, where tokens buried in the middle of a long context receive systematically lower attention weights than tokens at the edges.

Tool result clearing removes verbose tool outputs once their key facts have been extracted and re-injected as a compact summary. A raw database response might consume 4,000 tokens; a structured summary of the same data might consume 200. The trade-off is one additional LLM call to produce the summary, which adds latency and cost. Whether that trade-off is favourable depends on how many subsequent turns will use the cleared context.

python

# Pseudocode: tool result clearing after extraction
def compact_tool_result(raw_result: str, extractor_prompt: str, client) -> str:
    """Summarise a verbose tool result before appending to context."""
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=512,
        messages=[
            {"role": "user", "content": f"{extractor_prompt}\n\n{raw_result}"}
        ]
    )
    return response.content[0].text

# Replace the raw tool result in the message list with the compact version
messages[-1]["content"] = compact_tool_result(
    raw_result=tool_output,
    extractor_prompt="Extract only the fields needed for the next step as JSON.",
    client=client
)

The cost calculus matters here. If a task runs for 20 turns and each turn reads 3,000 tokens of prior tool results, clearing those results after extraction saves roughly 60,000 input tokens per session. At current API pricing that is a meaningful reduction, and the compaction call itself typically costs far less than the tokens it eliminates.

When should you isolate context with sub-agent architectures?

Compaction has limits. Once a task genuinely requires more distinct knowledge than can fit in one window, compressing harder just loses information. The correct response is architectural: move to a sub-agent context isolation model where each sub-agent receives only the context slice it needs.

In a hub-and-spoke architecture, a coordinator holds the global task state and delegates narrow sub-tasks to specialised agents. Each sub-agent starts with a fresh, minimal context. The coordinator synthesises results. This pattern is particularly effective for workloads in the 10,000 to 100,000 token range, where a single-agent approach would either hit window limits or incur prohibitive costs from repeated large-context calls.

The decision rule is straightforward:

Workload signal	Preferred strategy
Single coherent task, under ~30k tokens	In-context compaction
Multiple independent sub-tasks	Sub-agent isolation
Long-running session with resumption	Summarised injection into fresh session
Large corpus, up-to-date retrieval needed	External retrieval (RAG)

Sub-agent isolation also improves reliability. A sub-agent that fails does not corrupt the coordinator's context. The coordinator can retry or reroute without replaying the entire conversation history. See multi-agent error handling and routing for the retry patterns the exam tests.

How does external retrieval move memory outside the context window?

External retrieval, commonly called RAG (retrieval-augmented generation), is the right tool when the information an agent needs is too large to fit in context even after compaction, or when it changes frequently enough that baking it into a system prompt would become stale.

The pattern works as follows:

Chunk and embed the source corpus into a vector store at index time.
At query time, embed the agent's current query or sub-task description.
Retrieve the top-k chunks by cosine similarity.
Inject only those chunks into the context before the next model call.

The critical engineering decision is chunk selection quality. Retrieving low-signal chunks wastes tokens and can actively mislead the model. Semantic embeddings outperform keyword search for most natural-language corpora, but they are not infallible. Hybrid approaches that combine dense retrieval with sparse (BM25) re-ranking consistently outperform either method alone.

Retrieval quality is the single largest determinant of RAG system performance. The model cannot reason well over chunks it was not given.

Anthropic , Claude Documentation (Building with Claude, context and retrieval guidance)

Measuring retrieval quality requires two metrics: recall (did the retrieved set contain the answer?) and context adherence (did the model actually use the retrieved context rather than its parametric knowledge?). Both are testable with an LLM-as-judge eval pipeline.

What is semantic token selection and how does it counter attention degradation?

Even when retrieval is good, the order and position of tokens inside the context window affects how much attention the model pays to them. The lost-in-the-middle effect is well-documented: models attend more strongly to tokens near the beginning and end of a long context than to tokens in the middle.

Semantic token selection addresses this by ranking candidate context chunks by relevance score and placing the highest-scoring chunks at the edges of the context, not in the middle. For a context window with a fixed budget, this means:

text

[System prompt] [Top-1 chunk] [Top-3 chunk] ... [Top-2 chunk] [Recent turns]

The middle positions receive lower-relevance material or are left empty. This is a low-cost intervention that measurably improves output quality on retrieval-heavy tasks.

The context management domain of the CCA-F exam includes scenarios where candidates must identify why an agent is ignoring provided context. Attention degradation from poor token placement is one of the four root causes tested.

How do MCP skills and rules enable dynamic context loading?

The Model Context Protocol (MCP) extends the memory design space beyond what is possible with raw API calls. An MCP server can expose resources (structured data catalogs), tools (callable functions), and prompts (reusable templates). An agent configured with the right MCP skills can load context dynamically at runtime rather than having it baked into a static system prompt.

A practical pattern for unsupervised agents:

json

{
  "mcpServers": {
    "knowledge-base": {
      "command": "npx",
      "args": ["-y", "@company/kb-mcp-server"],
      "env": {
        "KB_INDEX_URL": "${KB_INDEX_URL}",
        "KB_API_KEY": "${KB_API_KEY}"
      }
    }
  }
}

With this configuration, the agent calls the knowledge-base MCP server's retrieval tool when it needs domain-specific context, rather than receiving that context upfront. This keeps the initial context window small and avoids loading irrelevant material for tasks that do not need it.

The tool design and MCP integration domain (18% of the CCA-F exam) tests whether candidates understand how MCP resources differ from MCP tools, and when to use each. Resources are appropriate for large, stable content catalogs; tools are appropriate for dynamic queries where the retrieval parameters depend on the agent's current state.

Dynamic context loading also reduces the risk of prompt injection. A static system prompt that includes large blocks of external content is a larger attack surface than a system prompt that retrieves only what is needed, only when it is needed.

What are the cost trade-offs between summarisation and token savings?

Every compaction strategy that involves an additional LLM call has a cost structure worth modelling explicitly. The break-even point depends on three variables: the cost per input token, the cost per output token, and the number of subsequent turns that will read the compacted context.

Strategy	Extra LLM calls	Token savings per subsequent turn	Break-even turns
Tool result clearing	1 per tool call	500 to 4,000 tokens	1 to 3
Session summarisation	1 per session resume	5,000 to 50,000 tokens	1
Sub-agent isolation	0 (architectural)	Full coordinator context	N/A
RAG retrieval	1 per query	Varies by corpus size	1

Session summarisation almost always pays for itself in a single turn because the saved tokens on the resumed session exceed the cost of the summarisation call. Tool result clearing pays for itself within two to three turns for verbose tool outputs. Sub-agent isolation has no extra LLM call cost but has coordination overhead.

The summary injection for fresh sessions pattern is the canonical CCA-F answer when a scenario describes a long-running agent that needs to resume after a session boundary. The exam consistently rewards solutions that trace the cost to the root cause and apply a proportionate fix.

How do you build deterministic safety nets around non-deterministic memory retrieval?

The final layer of a robust memory architecture is enforcement: ensuring that the agent actually uses the context it has been given, rather than falling back on parametric knowledge or, worse, hallucinating facts that were not retrieved.

Deterministic safety nets take two forms. The first is schema validation: if the agent is expected to produce structured output that references retrieved facts, validate the output against a schema that requires source attribution fields. An output that cannot be validated is rejected and the agent is prompted to retry with explicit instructions to cite its sources.

The second is hook-based interception. Tool call interception hooks can inspect every tool call before execution and every tool result before it enters the context. A pre-tool hook can block calls that would retrieve from an unauthorised source. A post-tool hook can normalise retrieved content into a canonical format before it enters the context, preventing format-induced confusion.

python

# PostToolUse hook: normalise retrieved chunks before context injection
def normalise_retrieval_result(tool_result: dict) -> dict:
    """Ensure retrieved chunks have required provenance fields."""
    chunks = tool_result.get("chunks", [])
    normalised = []
    for chunk in chunks:
        if "source_url" not in chunk or "retrieved_at" not in chunk:
            raise ValueError(f"Chunk missing provenance: {chunk.get('id')}")
        normalised.append({
            "text": chunk["text"],
            "source": chunk["source_url"],
            "retrieved_at": chunk["retrieved_at"],
            "relevance_score": chunk.get("score", 0.0)
        })
    tool_result["chunks"] = normalised
    return tool_result

This pattern is tested in Domain 1 under PostToolUse hooks for data normalisation. The exam distinguishes between prompt-based enforcement (asking the model to behave correctly) and programmatic enforcement (making incorrect behaviour structurally impossible). For high-stakes memory operations, programmatic enforcement is the correct answer.

How does this map to the CCA-F exam?

The memory tool design space touches four of the five CCA-F domains. Understanding which pattern belongs to which domain prevents misclassification errors on scenario questions.

CCA-F domain	Weight	Memory-related concepts tested
Domain 1: Agentic Architecture	27%	Sub-agent isolation, hooks, session management
Domain 2: Tool Design and MCP	18%	MCP resources, dynamic context loading, tool descriptions
Domain 4: Prompt Engineering	20%	Summary injection, few-shot for retrieval tasks
Domain 5: Context Management	15%	Compaction, attention degradation, stale context

Our concept library at /concepts covers 174 atomic concepts mapped to all five domains and 30 task statements. The context management and agentic architecture clusters are the most directly relevant to memory tool design.

Agents should request only necessary permissions, avoid storing sensitive information beyond immediate needs, prefer reversible over irreversible actions, and err on the side of doing less and confirming with users when uncertain about intended scope.

Anthropic , Claude Documentation (Building effective agents)

That principle applies directly to memory design. An agent that retrieves and stores only what it needs for the current step is safer, cheaper, and easier to debug than one that pre-loads everything it might conceivably need.

The CCA-F exam launched on 12 March 2026 and as of 3 June 2026 has produced over 10,000 certified individuals. The 60-question format tests scenario judgement, not recall, which means understanding the trade-offs between memory strategies matters more than memorising their names.

Frequently asked questions

What is the claude memory tool in the context of the CCA-F exam?

On the CCA-F exam, the claude memory tool refers to the full family of context engineering patterns: in-context compaction, external retrieval (RAG), summarised session injection, and sub-agent isolation. The exam tests when to apply each pattern, not just whether you know the term. Domain 5 (Context Management, 15%) and Domain 1 (Agentic Architecture, 27%) carry the most memory-related questions.

How do I implement Head+Tail splitting for a Claude agent?

Head+Tail splitting preserves the system prompt (head) and the most recent N turns (tail) while compressing or discarding the middle. In practice, you maintain a message list, apply a summarisation call to the middle segment when total token count exceeds a threshold, replace the middle messages with the summary, and continue. The threshold is typically set at 70 to 80 percent of the model's context window.

When should I use RAG versus in-context compaction for a Claude agent?

Use in-context compaction when the information fits in the window after compression and does not change between sessions. Use RAG when the corpus is too large to fit even after compaction, when the data changes frequently, or when different queries need different subsets of a large knowledge base. RAG adds retrieval latency but keeps per-call token costs low for large corpora.

Does Claude have built-in persistent memory across sessions?

No. Claude does not persist memory between API sessions by default. Each API call starts with only the messages you provide. Persistent memory requires external infrastructure: a database, vector store, or file system that your application reads from and writes to, then injects into the context at the start of each session via a summary or retrieved chunks.

How do MCP resources differ from MCP tools for memory use cases?

MCP resources are static or slowly-changing content catalogs that the agent can read, similar to files or database tables. MCP tools are callable functions that accept parameters and return dynamic results. For memory use cases, resources suit stable reference data (documentation, policy text) while tools suit dynamic retrieval where query parameters depend on the agent's current reasoning state.

What is the lost-in-the-middle effect and how does it affect Claude agents?

The lost-in-the-middle effect is the empirical finding that language models attend less strongly to tokens positioned in the middle of a long context than to tokens near the start or end. For Claude agents, this means that relevant retrieved chunks placed in the middle of a large context may be effectively ignored. The mitigation is to place the highest-relevance chunks at the edges of the context window.