Architecture·7 min read·25 June 2026

Claude Haiku vs Sonnet vs Opus: Pick the Right Model

A practical guide to claude haiku vs sonnet vs opus for engineers building agents: capability tiers, cost trade-offs, and routing logic for production MCP workflows.

By Solomon Udoh · AI Architect & Certification Lead

Claude Haiku vs Sonnet vs Opus: Pick the Right Model

Choosing between claude haiku vs sonnet vs opus is not a question of prestige; it is a FinOps decision that compounds across every agentic loop you run. Get it wrong and you will either overspend on capability you do not need or under-provision on tasks where accuracy failures cost more than the token savings. This guide gives engineers the framework to route intelligently from day one.

What are the three Claude model tiers and what does each do well?

Anthropic publishes three named tiers in the Claude 3 and Claude 3.5/3.7 families: Haiku (fastest, cheapest), Sonnet (balanced), and Opus (most capable, most expensive). Each tier is designed for a distinct cost-capability band, and Anthropic updates models within each tier over time, so the specific version you call matters less than understanding the tier's role in your architecture.

Tier	Primary strength	Typical use case
Haiku	Low latency, low cost	Classification, routing, short extraction
Sonnet	Balanced reasoning and cost	Code generation, summarisation, tool use
Opus	Highest reasoning depth	Complex multi-step planning, nuanced judgment

Models within each tier share a design philosophy: Haiku optimises for throughput, Sonnet for versatility, Opus for depth.

Anthropic , Claude model overview

The practical implication: a well-designed agent system rarely uses a single tier. It routes tasks to the cheapest model that can handle them reliably.

How do token costs differ across Haiku, Sonnet, and Opus?

Anthropic publishes per-token pricing on its pricing page. The ratios between tiers matter more than the absolute numbers, because those ratios determine the ROI of intelligent routing.

As a rule of thumb, Opus input tokens cost roughly 15x more than Haiku input tokens, and Sonnet sits at roughly 3x Haiku. Output tokens carry a higher multiplier in every tier. This means a single Opus call that could have been handled by Haiku wastes an order of magnitude more budget.

For agentic loops specifically, the cost gap widens. Agentic loops consume 4 to 15 times more tokens than single-turn chat interactions because each iteration re-sends tool results, conversation history, and system prompts. At 10x token amplification, routing a loop incorrectly to Opus instead of Sonnet can inflate your bill by 50x compared with a Haiku baseline.

The CCA-F exam weights Domain 1 (Agentic Architecture and Orchestration) at 27%, the largest single domain. Cost-aware routing is a core competency tested there.

What tasks belong to each model tier in a production agent?

When should you use Haiku?

Use Haiku for any task where the input is short, the output is structured, and correctness can be validated programmatically. Good candidates:

Intent classification ("is this a billing query or a technical query?")
Slot extraction from a user utterance
Routing decisions in a hub-and-spoke orchestrator
Generating short, templated responses where a schema enforces correctness

Because Haiku is fast, it also works well as a pre-filter before an expensive Opus call. If Haiku can answer with high confidence, you never pay for Opus.

When should you use Sonnet?

Sonnet is the workhorse tier for most production workloads. It handles multi-step reasoning, code generation, and tool-use chains without the latency or cost of Opus. In MCP integrations, Sonnet is usually the right default for tool-calling agents because it balances schema adherence with reasoning depth.

Sonnet is also the sensible choice for summarisation pipelines, document Q&A, and most prompt engineering tasks where you need nuanced output but not frontier-level reasoning.

When should you use Opus?

Opus earns its cost premium on tasks where:

The reasoning chain is long and interdependent (planning a multi-week project, auditing a complex codebase).
Errors are expensive to recover from (financial decisions, legal document review).
You need the model to catch subtle contradictions or edge cases that Sonnet misses.

The CCA-F exam consistently rewards deterministic, proportionate solutions. Routing every task to Opus because it is "safer" is not proportionate; it is a FinOps anti-pattern. Reserve Opus for the tasks where its incremental accuracy gain justifies the cost.

How do you build a cost-aware routing layer in practice?

A routing layer is a lightweight orchestrator that inspects each task and assigns it to the cheapest model tier that meets a confidence threshold. Here is a minimal pattern:

python

import anthropic

client = anthropic.Anthropic()

ROUTING_SYSTEM = """
You are a task router. Classify the incoming task as one of:
- haiku: short extraction, classification, or templated response
- sonnet: multi-step reasoning, code, tool use, summarisation
- opus: complex planning, high-stakes judgment, long reasoning chains

Respond with a JSON object: {"tier": "<haiku|sonnet|opus>", "reason": "<one sentence>"}
"""

def route_task(task_description: str) -> dict:
    response = client.messages.create(
        model="claude-haiku-4-5",  # Use Haiku to route; never pay Opus to decide
        max_tokens=64,
        system=ROUTING_SYSTEM,
        messages=[{"role": "user", "content": task_description}]
    )
    import json
    return json.loads(response.content[0].text)

MODEL_MAP = {
    "haiku": "claude-haiku-4-5",
    "sonnet": "claude-sonnet-4-5",
    "opus": "claude-opus-4-5",
}

def execute_task(task_description: str, task_payload: str) -> str:
    routing = route_task(task_description)
    model = MODEL_MAP[routing["tier"]]
    response = client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": task_payload}]
    )
    return response.content[0].text

Key point: the router itself runs on Haiku. You never pay Opus to decide whether to use Opus.

For context management in long-running agents, pair this router with a summarisation step so that the context passed to Opus is as compact as possible. Opus's cost advantage disappears if you are sending it 50,000 tokens of stale history.

How does model selection interact with MCP tool design?

Tool-calling amplifies cost differences between tiers because each tool call adds a round-trip: the model emits a tool-use block, your server executes the tool, and the result re-enters the context. In a five-tool chain, you pay for five model calls plus the growing context.

The implication for tool design: narrow, well-described tools reduce the number of round-trips required. A tool that does one thing precisely lets Sonnet succeed where a vague, multi-purpose tool might require Opus to reason through ambiguity.

Concretely:

json

{
  "name": "get_invoice_status",
  "description": "Returns the payment status of a single invoice by invoice_id. Returns one of: paid, pending, overdue, void. Do NOT use for bulk invoice queries.",
  "input_schema": {
    "type": "object",
    "properties": {
      "invoice_id": {"type": "string", "description": "The unique invoice identifier, e.g. INV-20240312-001"}
    },
    "required": ["invoice_id"]
  }
}

A description this precise lets Sonnet select the right tool on the first attempt. Ambiguous descriptions force the model to hedge, sometimes calling the wrong tool and burning an extra round-trip at full context cost.

What does the CCA-F exam test about model selection?

The CCA-F exam covers model selection implicitly across several domains. Domain 4 (Prompt Engineering and Structured Output, 20%) tests whether you can design prompts that work reliably at a given capability tier. Domain 5 (Context Management and Reliability, 15%) tests whether you understand how context length and model tier interact.

Exam scenarios typically present a production constraint (latency budget, cost ceiling, accuracy requirement) and ask you to identify the appropriate model tier and architecture. The exam rewards proportionate solutions: if a task can be solved reliably with Haiku, choosing Opus is wrong even if Opus would also work.

The exam consistently rewards deterministic solutions over probabilistic ones when stakes are high, proportionate fixes, and root-cause tracing.

AI Skill Certs , CCA-F Facts Block (verified 11 June 2026)

Our concept library maps 174 atomic concepts to the five exam domains, including model selection trade-offs, routing patterns, and context management strategies. AI Skill Certs is independent of Anthropic and not endorsed by Anthropic.

How should you instrument cost observability in an agent?

Routing to the right tier is step one. Knowing whether your routing is working is step two. Build cost observability into your orchestrator from the start:

python

import anthropic
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class AgentCallRecord:
    task_id: str
    model: str
    input_tokens: int
    output_tokens: int
    tier: str
    routed_by: str = "haiku-router"

def tracked_call(
    client: anthropic.Anthropic,
    task_id: str,
    model: str,
    tier: str,
    system: str,
    user_message: str,
    max_tokens: int = 1024,
) -> tuple[str, AgentCallRecord]:
    response = client.messages.create(
        model=model,
        max_tokens=max_tokens,
        system=system,
        messages=[{"role": "user", "content": user_message}]
    )
    record = AgentCallRecord(
        task_id=task_id,
        model=model,
        input_tokens=response.usage.input_tokens,
        output_tokens=response.usage.output_tokens,
        tier=tier,
    )
    return response.content[0].text, record

Aggregate AgentCallRecord objects per workflow run. If your Opus spend exceeds a threshold you set in advance, that is a signal your routing logic is misclassifying tasks, not a signal to raise the budget.

The agentic architecture domain of the CCA-F covers orchestrator design patterns in depth, including how coordinators should handle cost signals and when to escalate versus retry.

What is the right default model for a new agent project?

Start with Sonnet as your default. It handles the majority of production workloads reliably, and its cost is low enough that early-stage token waste is not catastrophic. As you instrument your agent and gather real task distributions, you will find a subset of tasks that Haiku handles correctly (move those down) and a subset where Sonnet fails in ways that matter (move those up to Opus).

This iterative approach is more reliable than guessing upfront. It also gives you empirical data to justify model spend to stakeholders, which matters when you are scaling from pilot to production.

As of 3 June 2026, over 10,000 individuals have earned the CCA-F certification, and the exam tests exactly this kind of production-grade, cost-proportionate thinking. If you are preparing for the exam, the model selection trade-offs covered here map directly to Domain 1, Domain 4, and Domain 5 scenario questions.

Frequently asked questions

Can I switch between Haiku, Sonnet, and Opus mid-conversation in an agentic loop?

Yes. Each call to the Messages API specifies a model independently, so your orchestrator can route individual turns to different tiers. The conversation history you pass in the messages array is model-agnostic. The main risk is consistency: if Opus planned a task and Haiku executes a step, ensure the executing model receives enough context to follow the plan faithfully.

Does prompt caching work across all three Claude model tiers?

Anthropic's prompt caching feature is available on supported Claude models, including versions in the Sonnet and Haiku families. Check the current model documentation on anthropic.com for the exact list of cache-eligible models, as availability changes with model releases. Caching is especially valuable in agentic loops where the same system prompt is re-sent on every iteration.

How does the CCA-F exam test knowledge of model selection between Haiku, Sonnet, and Opus?

The exam presents scenario-based questions with a production constraint (latency, cost, accuracy) and asks you to identify the appropriate architecture. Model selection appears most often in Domain 1 (Agentic Architecture, 27%) and Domain 4 (Prompt Engineering, 20%). The exam rewards proportionate choices: selecting Opus for a task Haiku can handle reliably is marked incorrect.

Is there a latency difference between Haiku, Sonnet, and Opus that matters for real-time applications?

Yes, meaningfully so. Haiku is designed for low-latency use cases and typically returns first tokens significantly faster than Opus. For user-facing applications with a response-time SLA under two seconds, Haiku or Sonnet are the practical options. Opus is better suited to batch or background tasks where latency is less critical than reasoning depth.

Should I use Opus for all high-stakes production decisions to minimise risk?

Not automatically. The CCA-F exam and Anthropic's own guidance both emphasise proportionate solutions. For high-stakes decisions, the right answer is often a combination: Haiku or Sonnet for initial processing, with a programmatic validation layer and a targeted Opus call only when the validation fails or confidence is below a threshold. Using Opus universally inflates cost without a commensurate reliability gain.

How do I measure whether my model routing is actually saving money?

Instrument every API call with the model name, input token count, and output token count from the response usage field. Aggregate by tier per workflow run. Compare actual spend against a baseline of running every task on Sonnet. If your Haiku-routed tasks are completing correctly (validate outputs programmatically), the delta is your routing saving. Review Opus spend weekly; unexpected spikes indicate routing misclassification.