Concept deep dive·10 min read·14 June 2026

Prompt Engineering vs Context Engineering for Claude Agents

Prompt engineering is table stakes. Context engineering is what separates reliable Claude agents from brittle ones. Here is how to architect context that scales.

By Solomon Udoh · AI Architect & Certification Lead

Prompt Engineering vs Context Engineering for Claude Agents

What is the difference between prompt engineering and context engineering?

Prompt engineering is the craft of writing instructions that reliably steer a model's output. Context engineering is the broader discipline of deciding what information enters the context window at all: which tools are exposed, which documents are retrieved, which rules apply, and in what order everything is assembled. If prompt engineering is copywriting, context engineering is information architecture.

The distinction matters because Claude's behaviour is determined not just by the words in your system prompt but by the entire payload the model receives at inference time. A perfectly worded instruction can still fail if it is surrounded by contradictory tool descriptions, stale retrieved documents, or a conversation history that has drifted far from the current task. The Prompt Engineering & Structured Output domain of the CCA-F exam accounts for 20% of the exam weight, but the surrounding domains (Agentic Architecture, Tool Design, Context Management) together account for the other 80%. That ratio reflects how much of reliable agent behaviour lives outside the prompt itself.

Why does context engineering matter more as agents scale?

Single-turn completions are forgiving. An agent that runs dozens of tool calls across a long session is not. Each tool result appended to the conversation, each subagent response folded back in, each retrieved document chunk added to the window compounds the risk of what Anthropic's documentation calls the attention dilution problem: as context grows, the model's effective attention to any individual instruction weakens.

The numbers make this concrete. The CCA-F exam covers five domains and 30 task statements. Domain 5, Context Management & Reliability, carries 15% of the exam weight on its own, and its task statements deal almost entirely with preventing context degradation: stale context, summary injection, session forking, and progressive summarisation traps. Domain 1, Agentic Architecture & Orchestration, carries 27%. Together, the two domains that are most directly about what goes into the context account for 42% of the exam. That is not an accident.

The goal of context engineering is to ensure the model has exactly the information it needs, no more and no less, at the moment it needs to act.

Anthropic , Claude Documentation (Model Context Protocol and agentic use-case guidance)

How do you architect context for a Claude-based agent?

A useful mental model is to treat context as a layered stack, assembled at request time from several distinct sources. Each layer has a different update frequency and a different owner.

LayerContentsUpdate frequencyOwner
System promptRole, rules, output format, safety constraintsPer deploymentDeveloper
Tool definitionsNames, descriptions, input schemasPer deployment or sessionDeveloper
Retrieved documentsRAG chunks, knowledge-base excerptsPer requestRetrieval system
Conversation historyPrior turns, tool calls, tool resultsPer turnRuntime
Injected stateRepo snapshot, git diff, dependency manifestPer taskOrchestrator
Session summaryCompressed prior contextOn compactionOrchestrator

The key architectural decision is which layers are hand-curated by developers and which are generated automatically. Hand-curation gives you precision; automation gives you scale. The right answer is almost always: curate the system prompt and tool definitions by hand, automate retrieval and state injection, and use programmatic compaction for conversation history.

What should go in the system prompt?

The system prompt is the highest-trust, lowest-noise layer. It should contain:

  1. A concise role statement (one to three sentences).
  2. Explicit output format requirements, including any JSON schema.
  3. Behavioural rules that must hold for every request (safety constraints, escalation triggers, tone).
  4. A short description of which tools are available and when to prefer each.

What it should not contain: large blocks of reference documentation, full file contents, or anything that changes per request. Those belong in retrieved or injected layers, not in a static prompt that bloats every call.

text
You are a code-review agent for the Acme payments platform.
Your output is always a JSON object matching the ReviewResult schema below.
Escalate to a human reviewer whenever you detect a PCI-DSS scope change.
You have access to three tools: read_file, search_codebase, and post_comment.
Use search_codebase before read_file; read_file only when you need full file contents.

How should tool definitions be written to avoid context confusion?

Tool descriptions are not documentation for humans; they are selection signals for the model. A vague description causes misrouting. A description that overlaps with another tool causes the model to guess. The Tool Descriptions as Selection Mechanism concept captures this precisely: the model reads descriptions at inference time and routes accordingly, so every word in a description is a routing instruction.

Practical rules:

  1. State what the tool does in the first sentence.
  2. State what it does not do in the second sentence, if there is a common confusion case.
  3. Include the expected input format and any constraints.
  4. Keep descriptions under 150 words per tool.

When you have too many tools in scope, the model's ability to select the right one degrades. The Tool Overload Problem is a real failure mode: beyond roughly ten to fifteen tools, selection accuracy drops measurably. The fix is scoping: expose only the tools relevant to the current task or subagent role, not the full catalogue.

json
{
"name": "search_codebase",
"description": "Full-text and semantic search across the repository index. Use this to locate files, functions, or patterns by keyword or concept. Do NOT use this to read a specific file by path; use read_file for that. Returns a ranked list of file paths and matching excerpts. Input: a natural-language or keyword query string.",
"input_schema": {
"type": "object",
"properties": {
"query": { "type": "string" }
},
"required": ["query"]
}
}

What are the best context packaging patterns for code agents?

Code agents are the hardest context engineering problem because the relevant state is large, heterogeneous, and changes with every commit. A well-packaged context for a code agent typically includes:

  • Repo state snapshot: a file tree (not full contents) so the model knows what exists.
  • Targeted file contents: only the files directly relevant to the current task, fetched via tool call rather than pre-loaded.
  • Git diff: the changes since the last stable commit, so the model understands what has changed.
  • Dependency manifest: package.json, pyproject.toml, or equivalent, so the model can reason about library versions.
  • Coding standards excerpt: the relevant section of your style guide, not the whole document.
  • Conversation history: trimmed or summarised once it exceeds a threshold.

The Incremental Codebase Understanding Pattern formalises this: start with structure, retrieve detail on demand, never pre-load what the model might not need. This keeps the context window lean and the model's attention focused.

python
# Minimal context bootstrap for a code-review agent
def build_context(pr_diff: str, repo_tree: str, standards_excerpt: str) -> list[dict]:
return [
{
"role": "user",
"content": (
f"## Repository structure\n{repo_tree}\n\n"
f"## Coding standards (relevant excerpt)\n{standards_excerpt}\n\n"
f"## Pull request diff\n{pr_diff}\n\n"
"Review the diff against the standards. "
"Use search_codebase if you need additional context."
)
}
]

How do you prevent conflicting instructions and noisy inputs from degrading output?

Conflicting instructions are the most common source of silent agent failure. The model does not throw an error when two instructions contradict each other; it resolves the conflict probabilistically, and the resolution may not be the one you intended. Three patterns prevent this:

1. Single source of truth for each rule class. If your system prompt says "always respond in English" and a retrieved document says "respond in the user's language", the model will sometimes do one and sometimes the other. Audit your context layers for overlapping rule domains and consolidate.

2. Explicit priority ordering. When conflict is unavoidable (for example, a general rule and a task-specific override), state the priority explicitly in the system prompt: "Task-specific instructions in the user turn override these defaults."

3. Programmatic enforcement for high-stakes rules. The Hooks vs Prompts Decision Framework is clear on this: if a rule must hold without exception (compliance logging, PII redaction, cost caps), enforce it in code, not in a prompt. Prompts are probabilistic; hooks are deterministic.

Noisy inputs are a separate problem. Retrieved documents that are outdated, off-topic, or internally inconsistent degrade output quality even when instructions are clean. Mitigations include:

  • Timestamp-filtering retrieved chunks (reject anything older than a configurable threshold).
  • Relevance-score thresholds (do not inject a chunk below a minimum cosine similarity).
  • Chunk deduplication before injection.
  • Explicit uncertainty signals in the prompt: "If the retrieved context does not answer the question, say so rather than guessing."

Inject only what the model needs to act correctly on this request. Every token that does not contribute to the decision is a token that dilutes the tokens that do.

Anthropic , Claude Documentation (context window and prompt design guidance)

How do you evaluate whether a context strategy is actually working?

Context engineering without evaluation is guesswork. The evaluation stack for a context strategy has three levels:

LevelMethodWhat it catches
UnitDeterministic assertion on output fieldsSchema violations, missing required fields
RegressionFixed prompt/context pairs with expected outputsRegressions introduced by context changes
Human-in-the-loopSampled review of live outputsSubtle quality degradation, edge cases

The CCA-F exam consistently rewards root-cause tracing over symptomatic fixes. If your agent's output quality degrades after a context change, the exam expects you to trace the failure to a specific layer (tool description, retrieved chunk, conversation history length) rather than tuning the system prompt blindly.

A practical regression harness for context changes:

python
import anthropic
client = anthropic.Anthropic()
def run_context_regression(test_cases: list[dict]) -> dict:
results = {"passed": 0, "failed": 0, "failures": []}
for case in test_cases:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system=case["system"],
messages=case["messages"]
)
output = response.content[0].text
if case["assertion"](output):
results["passed"] += 1
else:
results["failed"] += 1
results["failures"].append({
"case_id": case["id"],
"output": output
})
return results

For production agents, human-in-the-loop review should be triggered by structured escalation criteria, not by vague thresholds. Define what "degraded output" means before you deploy, not after you observe failures.

How does context engineering fit enterprise constraints?

Enterprise deployments add three constraints that pure prompt engineering does not address: compliance, safety, and cost.

Compliance requires that certain information never enters the context (PII, regulated data) and that certain rules always apply (audit logging, data residency). Programmatic hooks are the only reliable enforcement mechanism. A system prompt that says "do not log PII" is not a compliance control.

Safety in the CCA-F sense means preferring deterministic, reversible actions over probabilistic, irreversible ones when stakes are high. The exam's high-stakes enforcement decision rule is explicit: when an action cannot be undone, require explicit human confirmation before proceeding. This is a context engineering decision as much as a prompt engineering one: the context must include enough information about action reversibility for the model to apply the rule correctly.

Cost is managed primarily through retrieval discipline. Every token in the context window costs money. Strategies that keep context lean (on-demand retrieval, progressive summarisation, session forking for divergent tasks) directly reduce per-request cost. The Session Management Options concept covers the trade-offs between resuming, forking, and starting fresh, each of which has a different cost and context-quality profile.

As of 3 June 2026, more than 10,000 individuals hold the Claude Certified Architect credential, and over 40,000 firms have applied to the Claude Partner Network. The organisations deploying Claude at scale are the ones where context engineering discipline is most visibly separating reliable production systems from brittle prototypes.

Is context engineering a durable discipline or a transitional phase?

The honest answer is: both, depending on which part you mean. The manual parts of context engineering (hand-writing retrieval queries, tuning chunk sizes, curating tool catalogues) will increasingly be automated as tooling matures. The architectural parts (deciding what information a model needs, in what form, at what time, with what priority ordering) are durable because they reflect the fundamental nature of how language models process information.

Prompt engineering taught us that how you phrase an instruction matters. Context engineering teaches us that what surrounds the instruction matters just as much. Neither lesson becomes obsolete as models improve; they become more important, because more capable models are deployed in more complex, higher-stakes settings where context errors have larger consequences.

For CCA-F candidates, the practical implication is clear: mastery of prompt engineering and structured output is necessary but not sufficient. The exam's domain weights signal that Anthropic expects architects to reason about the full context stack, from system prompt through tool definitions, retrieval, conversation history, and session management, as an integrated system.

Frequently asked questions

What is context engineering in simple terms?
Context engineering is the practice of deciding what information enters a language model's context window at inference time: which documents are retrieved, which tools are exposed, which rules apply, and how conversation history is managed. It goes beyond writing good prompts to designing the entire information environment the model operates in.
How much of the CCA-F exam covers prompt engineering versus context management?
The Prompt Engineering & Structured Output domain carries 20% of the CCA-F exam weight. Context Management & Reliability carries 15%, and Agentic Architecture & Orchestration carries 27%. Together, the domains most directly concerned with what enters the context window account for the majority of the exam.
What is the attention dilution problem in Claude agents?
Attention dilution occurs when a growing context window causes the model to give weaker effective attention to any individual instruction. As tool results, retrieved documents, and conversation history accumulate, earlier or less prominent instructions become less reliably followed. The fix is disciplined context trimming, summarisation, and on-demand retrieval rather than pre-loading.
When should I use a hook instead of a prompt instruction for enforcing a rule?
Use a programmatic hook whenever a rule must hold without exception, such as compliance logging, PII redaction, or cost caps. Prompt instructions are probabilistic; the model may not follow them in every case. Hooks execute deterministically regardless of model output. The CCA-F exam consistently rewards this distinction for high-stakes enforcement scenarios.
How do I prevent tool misrouting in a multi-tool Claude agent?
Write tool descriptions that state both what the tool does and what it does not do. Avoid overlapping descriptions between tools. Limit the number of tools in scope for any given task or subagent role; beyond roughly ten to fifteen tools, selection accuracy degrades. Use the tool_choice parameter to constrain selection when the correct tool is known in advance.
How does AI Skill Certs help with context engineering and prompt engineering exam prep?
AI Skill Certs is an independent adaptive prep platform for the CCA-F exam. It covers 174 atomic concepts mapped to all five exam domains, including Prompt Engineering & Structured Output and Context Management & Reliability. The platform uses Bayesian Knowledge Tracing with a 0.90 mastery threshold and a Socratic tutor called Archie. AI Skill Certs is not affiliated with or endorsed by Anthropic.

People also ask

What is the difference between prompt engineering and RAG?
Prompt engineering shapes the instructions and format of a model's input. RAG (Retrieval-Augmented Generation) is a technique for injecting relevant external documents into the context at request time. They are complementary: RAG determines what knowledge the model sees; prompt engineering determines how the model is told to use it.
What is context engineering for AI agents?
Context engineering for AI agents is the discipline of assembling the right information, tools, and rules into the model's context window at the right time. It covers retrieval strategy, tool scoping, conversation history management, session summarisation, and instruction priority ordering, all aimed at keeping agent behaviour reliable as task complexity grows.
How do you manage context in a long-running Claude agent?
Manage long-running agent context by summarising and compacting conversation history before it bloats the window, retrieving documents on demand rather than pre-loading them, forking sessions when tasks diverge significantly, and injecting fresh state summaries at the start of new sessions. Avoid the progressive summarisation trap, where repeated compression loses critical detail.
How many tools can Claude handle before selection accuracy drops?
Selection accuracy degrades noticeably beyond roughly ten to fifteen tools in scope simultaneously. The recommended fix is to scope tool exposure to the current task or subagent role, exposing only the tools relevant to that specific step rather than the full catalogue. This is a core principle in the CCA-F Tool Design & MCP Integration domain.
Is prompt engineering still relevant in 2026?
Yes. Prompt engineering remains essential for defining model role, output format, and behavioural rules. What has changed is that it is now understood as one layer within the broader discipline of context engineering. Writing a good system prompt matters; so does everything else that surrounds it in the context window.

About the author

Solomon Udoh

AI Architect & Certification Lead

Solomon Udoh is an AI Architect who designs and ships production agent systems on the Claude API and Claude Code. He built AI Skill Certs' adaptive engine and authored its 174-concept knowledge graph, mapping every Claude Certified Architect - Foundations objective to hands-on, exam-aligned practice.

  • Designs production multi-agent systems on the Claude API and Agent SDK
  • Author of the AI Skill Certs knowledge graph (174 mapped exam concepts)
  • Builds with MCP, Claude Code, structured outputs, and agentic loops daily
  • Reviews every concept page against the official Anthropic exam guide

You might also like

Ready to put it into practice?

Study every exam concept with an adaptive tutor.

Start studying