Claude Memory Tool: Context Engineering for Long-Running Agents
Master the claude memory tool and context engineering strategies: compaction, sub-agent isolation, external retrieval, and semantic selection for reliable Claude agents.
By Solomon Udoh · AI Architect & Certification Lead

Memory is the silent bottleneck in every production Claude agent. The claude memory tool and the broader family of context engineering techniques exist to answer one question: how do you keep a long-running agent coherent when its context window is finite and every token costs something? This post maps the full design space, from in-window compaction to external retrieval, and ties each strategy to the CCA-F exam domains where they are tested.
What is the claude memory tool and why does it matter?
The claude memory tool is not a single API endpoint. It is a design pattern: any mechanism that allows a Claude agent to store, retrieve, and selectively load information across turns or across sessions. In Claude's architecture, "memory" spans four distinct layers, each with different latency, cost, and reliability characteristics.
| Memory layer | Where state lives | Typical use case |
|---|---|---|
| In-context (window) | Active prompt tokens | Short tasks, single-session reasoning |
| External key-value store | Database or vector store | Facts that outlive a session |
| Tool-retrieved (RAG) | Retrieved at query time | Large corpora, up-to-date knowledge |
| Summarised injection | Compressed prior context | Resuming long sessions cheaply |
Domain 5 of the CCA-F exam, Context Management and Reliability, carries 15% of the total weight and tests exactly this taxonomy. Domain 1, Agentic Architecture and Orchestration, at 27%, tests how memory strategies interact with multi-agent designs. Together they account for 42% of the exam, so a weak mental model here is expensive.
How does in-context compaction prevent context pollution?
In-context compaction is the first line of defence. It keeps the context window from filling with low-signal tokens that dilute attention and degrade output quality. The two most practical compaction patterns are Head+Tail splitting and tool result clearing.
Head+Tail splitting preserves the system prompt and the most recent turns (the "tail") while compressing or discarding the middle. This directly counters the attention dilution problem, where tokens buried in the middle of a long context receive systematically lower attention weights than tokens at the edges.
Tool result clearing removes verbose tool outputs once their key facts have been extracted and re-injected as a compact summary. A raw database response might consume 4,000 tokens; a structured summary of the same data might consume 200. The trade-off is one additional LLM call to produce the summary, which adds latency and cost. Whether that trade-off is favourable depends on how many subsequent turns will use the cleared context.
# Pseudocode: tool result clearing after extractiondef compact_tool_result(raw_result: str, extractor_prompt: str, client) -> str:"""Summarise a verbose tool result before appending to context."""response = client.messages.create(model="claude-opus-4-5",max_tokens=512,messages=[{"role": "user", "content": f"{extractor_prompt}\n\n{raw_result}"}])return response.content[0].text# Replace the raw tool result in the message list with the compact versionmessages[-1]["content"] = compact_tool_result(raw_result=tool_output,extractor_prompt="Extract only the fields needed for the next step as JSON.",client=client)
The cost calculus matters here. If a task runs for 20 turns and each turn reads 3,000 tokens of prior tool results, clearing those results after extraction saves roughly 60,000 input tokens per session. At current API pricing that is a meaningful reduction, and the compaction call itself typically costs far less than the tokens it eliminates.
When should you isolate context with sub-agent architectures?
Compaction has limits. Once a task genuinely requires more distinct knowledge than can fit in one window, compressing harder just loses information. The correct response is architectural: move to a sub-agent context isolation model where each sub-agent receives only the context slice it needs.
In a hub-and-spoke architecture, a coordinator holds the global task state and delegates narrow sub-tasks to specialised agents. Each sub-agent starts with a fresh, minimal context. The coordinator synthesises results. This pattern is particularly effective for workloads in the 10,000 to 100,000 token range, where a single-agent approach would either hit window limits or incur prohibitive costs from repeated large-context calls.
The decision rule is straightforward:
| Workload signal | Preferred strategy |
|---|---|
| Single coherent task, under ~30k tokens | In-context compaction |
| Multiple independent sub-tasks | Sub-agent isolation |
| Long-running session with resumption | Summarised injection into fresh session |
| Large corpus, up-to-date retrieval needed | External retrieval (RAG) |
Sub-agent isolation also improves reliability. A sub-agent that fails does not corrupt the coordinator's context. The coordinator can retry or reroute without replaying the entire conversation history. See multi-agent error handling and routing for the retry patterns the exam tests.
How does external retrieval move memory outside the context window?
External retrieval, commonly called RAG (retrieval-augmented generation), is the right tool when the information an agent needs is too large to fit in context even after compaction, or when it changes frequently enough that baking it into a system prompt would become stale.
The pattern works as follows:
- Chunk and embed the source corpus into a vector store at index time.
- At query time, embed the agent's current query or sub-task description.
- Retrieve the top-k chunks by cosine similarity.
- Inject only those chunks into the context before the next model call.
The critical engineering decision is chunk selection quality. Retrieving low-signal chunks wastes tokens and can actively mislead the model. Semantic embeddings outperform keyword search for most natural-language corpora, but they are not infallible. Hybrid approaches that combine dense retrieval with sparse (BM25) re-ranking consistently outperform either method alone.
Retrieval quality is the single largest determinant of RAG system performance. The model cannot reason well over chunks it was not given.
Measuring retrieval quality requires two metrics: recall (did the retrieved set contain the answer?) and context adherence (did the model actually use the retrieved context rather than its parametric knowledge?). Both are testable with an LLM-as-judge eval pipeline.
What is semantic token selection and how does it counter attention degradation?
Even when retrieval is good, the order and position of tokens inside the context window affects how much attention the model pays to them. The lost-in-the-middle effect is well-documented: models attend more strongly to tokens near the beginning and end of a long context than to tokens in the middle.
Semantic token selection addresses this by ranking candidate context chunks by relevance score and placing the highest-scoring chunks at the edges of the context, not in the middle. For a context window with a fixed budget, this means:
[System prompt] [Top-1 chunk] [Top-3 chunk] ... [Top-2 chunk] [Recent turns]
The middle positions receive lower-relevance material or are left empty. This is a low-cost intervention that measurably improves output quality on retrieval-heavy tasks.
The context management domain of the CCA-F exam includes scenarios where candidates must identify why an agent is ignoring provided context. Attention degradation from poor token placement is one of the four root causes tested.
How do MCP skills and rules enable dynamic context loading?
The Model Context Protocol (MCP) extends the memory design space beyond what is possible with raw API calls. An MCP server can expose resources (structured data catalogs), tools (callable functions), and prompts (reusable templates). An agent configured with the right MCP skills can load context dynamically at runtime rather than having it baked into a static system prompt.
A practical pattern for unsupervised agents:
{"mcpServers": {"knowledge-base": {"command": "npx","args": ["-y", "@company/kb-mcp-server"],"env": {"KB_INDEX_URL": "${KB_INDEX_URL}","KB_API_KEY": "${KB_API_KEY}"}}}}
With this configuration, the agent calls the knowledge-base MCP server's retrieval tool when it needs domain-specific context, rather than receiving that context upfront. This keeps the initial context window small and avoids loading irrelevant material for tasks that do not need it.
The tool design and MCP integration domain (18% of the CCA-F exam) tests whether candidates understand how MCP resources differ from MCP tools, and when to use each. Resources are appropriate for large, stable content catalogs; tools are appropriate for dynamic queries where the retrieval parameters depend on the agent's current state.
Dynamic context loading also reduces the risk of prompt injection. A static system prompt that includes large blocks of external content is a larger attack surface than a system prompt that retrieves only what is needed, only when it is needed.
What are the cost trade-offs between summarisation and token savings?
Every compaction strategy that involves an additional LLM call has a cost structure worth modelling explicitly. The break-even point depends on three variables: the cost per input token, the cost per output token, and the number of subsequent turns that will read the compacted context.
| Strategy | Extra LLM calls | Token savings per subsequent turn | Break-even turns |
|---|---|---|---|
| Tool result clearing | 1 per tool call | 500 to 4,000 tokens | 1 to 3 |
| Session summarisation | 1 per session resume | 5,000 to 50,000 tokens | 1 |
| Sub-agent isolation | 0 (architectural) | Full coordinator context | N/A |
| RAG retrieval | 1 per query | Varies by corpus size | 1 |
Session summarisation almost always pays for itself in a single turn because the saved tokens on the resumed session exceed the cost of the summarisation call. Tool result clearing pays for itself within two to three turns for verbose tool outputs. Sub-agent isolation has no extra LLM call cost but has coordination overhead.
The summary injection for fresh sessions pattern is the canonical CCA-F answer when a scenario describes a long-running agent that needs to resume after a session boundary. The exam consistently rewards solutions that trace the cost to the root cause and apply a proportionate fix.
How do you build deterministic safety nets around non-deterministic memory retrieval?
The final layer of a robust memory architecture is enforcement: ensuring that the agent actually uses the context it has been given, rather than falling back on parametric knowledge or, worse, hallucinating facts that were not retrieved.
Deterministic safety nets take two forms. The first is schema validation: if the agent is expected to produce structured output that references retrieved facts, validate the output against a schema that requires source attribution fields. An output that cannot be validated is rejected and the agent is prompted to retry with explicit instructions to cite its sources.
The second is hook-based interception. Tool call interception hooks can inspect every tool call before execution and every tool result before it enters the context. A pre-tool hook can block calls that would retrieve from an unauthorised source. A post-tool hook can normalise retrieved content into a canonical format before it enters the context, preventing format-induced confusion.
# PostToolUse hook: normalise retrieved chunks before context injectiondef normalise_retrieval_result(tool_result: dict) -> dict:"""Ensure retrieved chunks have required provenance fields."""chunks = tool_result.get("chunks", [])normalised = []for chunk in chunks:if "source_url" not in chunk or "retrieved_at" not in chunk:raise ValueError(f"Chunk missing provenance: {chunk.get('id')}")normalised.append({"text": chunk["text"],"source": chunk["source_url"],"retrieved_at": chunk["retrieved_at"],"relevance_score": chunk.get("score", 0.0)})tool_result["chunks"] = normalisedreturn tool_result
This pattern is tested in Domain 1 under PostToolUse hooks for data normalisation. The exam distinguishes between prompt-based enforcement (asking the model to behave correctly) and programmatic enforcement (making incorrect behaviour structurally impossible). For high-stakes memory operations, programmatic enforcement is the correct answer.
How does this map to the CCA-F exam?
The memory tool design space touches four of the five CCA-F domains. Understanding which pattern belongs to which domain prevents misclassification errors on scenario questions.
| CCA-F domain | Weight | Memory-related concepts tested |
|---|---|---|
| Domain 1: Agentic Architecture | 27% | Sub-agent isolation, hooks, session management |
| Domain 2: Tool Design and MCP | 18% | MCP resources, dynamic context loading, tool descriptions |
| Domain 4: Prompt Engineering | 20% | Summary injection, few-shot for retrieval tasks |
| Domain 5: Context Management | 15% | Compaction, attention degradation, stale context |
Our concept library at /concepts covers 174 atomic concepts mapped to all five domains and 30 task statements. The context management and agentic architecture clusters are the most directly relevant to memory tool design.
Agents should request only necessary permissions, avoid storing sensitive information beyond immediate needs, prefer reversible over irreversible actions, and err on the side of doing less and confirming with users when uncertain about intended scope.
That principle applies directly to memory design. An agent that retrieves and stores only what it needs for the current step is safer, cheaper, and easier to debug than one that pre-loads everything it might conceivably need.
The CCA-F exam launched on 12 March 2026 and as of 3 June 2026 has produced over 10,000 certified individuals. The 60-question format tests scenario judgement, not recall, which means understanding the trade-offs between memory strategies matters more than memorising their names.
Frequently asked questions
What is the claude memory tool in the context of the CCA-F exam?
How do I implement Head+Tail splitting for a Claude agent?
When should I use RAG versus in-context compaction for a Claude agent?
Does Claude have built-in persistent memory across sessions?
How do MCP resources differ from MCP tools for memory use cases?
What is the lost-in-the-middle effect and how does it affect Claude agents?
People also ask
How does Claude memory tool work in agentic pipelines?
Can Claude remember things between conversations?
What is context compaction in Claude agents?
How do I stop Claude from forgetting information mid-task?
What is the best memory strategy for long-running Claude agents?
About the author
AI Architect & Certification Lead
Solomon Udoh is an AI Architect who designs and ships production agent systems on the Claude API and Claude Code. He built AI Skill Certs' adaptive engine and authored its 174-concept knowledge graph, mapping every Claude Certified Architect - Foundations objective to hands-on, exam-aligned practice.
- Designs production multi-agent systems on the Claude API and Agent SDK
- Author of the AI Skill Certs knowledge graph (174 mapped exam concepts)
- Builds with MCP, Claude Code, structured outputs, and agentic loops daily
- Reviews every concept page against the official Anthropic exam guide
You might also like
Ready to put it into practice?
Study every exam concept with an adaptive tutor.