Crash Recovery with a Manifest

In short: Crash recovery with a manifest is a pattern where each agent exports its structured state to a known file location, and on resume a coordinator loads those manifests and injects them back into the agents prompts. Because progress lives on disk rather than only in context, a crash or context reset loses little instead of everything.

What crash recovery with a manifest protects against

Crash recovery with a manifest answers a brutal question: if your long-running, multi-agent job dies at hour three, how much do you lose? Without persisted state the answer is everything, because all the progress lived in context windows that no longer exist. With a manifest the answer is almost nothing, because each agent has been continuously exporting its structured state to a file, and a coordinator can reload that file and pick up where the agent left off. This knowledge point turns a catastrophic failure into a recoverable one.

The pattern has three moving parts. Each agent exports structured state, what it has completed, what it decided, what remains, to a known file location, the manifest. On resume, the coordinator loads each manifest and injects the recovered state back into the relevant agent's prompt. And because the state is written continuously rather than only at the end, an interruption at any point leaves behind a usable checkpoint. It is the natural extension of the scratchpad file you met earlier, the same write-to-disk instinct, but now structured for recovery and coordinated across multiple agents.

Manifest: A structured state file an agent writes to a known location, recording its progress, decisions, and remaining work. On resume, a coordinator reads the manifest and re-injects its contents so the agent continues rather than restarting from zero.

Why context-only state is a trap

The reason this matters is that a context window is volatile. It exists only as long as the session does, and it can be reset at any moment, by a crash, a timeout, a deliberate restart, or a context overflow. Anthropic's memory-tool guidance bakes this assumption into the system prompt it ships: "ASSUME INTERRUPTION: Your context window might be reset at any moment, so you risk losing any progress that is not recorded in your memory directory." The instruction tells the agent to "record status / progress / thoughts" continuously, precisely so that an interruption is survivable. That single directive is the philosophy of crash recovery in one sentence.

Anthropic's memory tool also describes the concrete recovery shape under its multi-session pattern. An initialiser session sets up the state artefacts, a progress log of what has been done and what comes next, a checklist defining scope, before substantive work begins. Each subsequent session opens by reading those artefacts, which "recovers the full state of the project in seconds, without needing to re-explore the codebase or retrace earlier decisions." And before a session ends, it updates the log. Read that as a manifest protocol: bootstrap the state file, resume from it, keep it current.

Manifest-based crash recovery across agents

Loading diagram...

State flows continuously to disk during work; on crash the coordinator reloads the manifests and re-injects them, so the agents resume instead of restarting.

How a manifest differs from a plain scratchpad

A scratchpad and a manifest both write to disk, but the manifest adds two things a casual note lacks: structure and a recovery protocol. The structure means the state is organised for reloading, typically progress, decisions, and remaining work in a known schema and a known location, so a coordinator can find and parse it deterministically. The protocol means there is an agreed sequence: initialise the manifest, update it as work proceeds, and read it on resume. A scratchpad answers "what did I find?"; a manifest answers "where exactly am I, and how do I continue from here if everything else is gone?"

This is also why crash recovery is a multi-agent concern in the exam framing. When a single agent crashes, the loss is bounded. When a coordinator orchestrating several subagents crashes, the loss can be enormous unless each agent has its own manifest the coordinator can re-gather. Anthropic's long-running-agents guidance describes exactly this kind of infrastructure-as-state: agents maintain recovery artefacts such as a progress log and an initial git commit, and on resume "every coding agent is prompted to run through a series of steps to get its bearings," reading those files before continuing. The manifest is what makes a fleet of agents restartable rather than disposable.

Worked example

An overnight job runs a coordinator and three subagents auditing a large codebase for security issues; at 3 a.m. the machine reboots.

Designed naively, the reboot is a disaster: three subagents had each read hundreds of files and accumulated findings purely in context, and all of it vanishes with their windows. The next morning the team finds an empty result and a wasted night, and the audit must start over from zero.

Designed with manifests, the reboot is a speed bump. Throughout the night, each subagent exported its state to a known file, audit/auth.manifest, audit/data.manifest, audit/api.manifest, recording which files it had reviewed, the issues it had confirmed, and what remained. The coordinator kept its own manifest of overall progress. On restart, the coordinator loads the four manifests and injects each subagent's recovered state into its prompt. The auth subagent resumes at file 213 of 400 with its 9 confirmed findings intact; the others do the same. The night's work is preserved, and only the in-flight step is repeated.

The exam trap is the naive design: not persisting agent state, and thereby losing all progress on a crash. The correct apply-level response is to have each agent export structured state continuously to a manifest and to define a resume protocol that reloads and re-injects it. Note the dependency on the scratchpad skill, a manifest is a scratchpad with structure and a recovery contract, and the link to summary injection, since a manifest typically stores summarised state rather than full transcripts.

What belongs in a manifest

A manifest is only useful if its contents are sufficient to resume from, so its schema is part of the design. At minimum it records four things: what has been completed, the decisions taken along the way, the current position in the task, and what still remains. Anthropic's multi-session pattern describes exactly this shape, a progress log of what has been done and what comes next, plus a checklist defining the scope of work, bootstrapped by an initialiser session before substantive work begins. The harness guidance adds concrete artefacts from real long-running agents: a claude-progress.txt log, a feature list, an init.sh startup script, and an initial git commit, with git history itself acting as a recovery surface for reverting to a known-good state.

Two design rules keep a manifest trustworthy. First, store summarised state, not raw transcripts: a manifest that pastes whole file reads is slow to reload and recreates the bulk problem, so it should hold distilled conclusions and pointers to scratchpads where the depth lives. Second, only mark work complete after it is genuinely verified, not merely attempted. Anthropic stresses working one feature at a time and confirming it end-to-end before logging it done, because a progress log that records unfinished work as complete will resume into a false state. A manifest is a contract about reality; its value collapses the moment it starts lying about what is finished.

Designing the resume protocol

Persistence is half the pattern; the resume protocol is the other half, and the exam expects both. A manifest sitting on disk does nothing unless something reads it correctly on restart. The protocol has three steps: on resume, load every relevant manifest; inject each one back into the appropriate agent's prompt so it recovers its state; and verify that recovered state before charging ahead, because the crash may have happened mid-write or mid-action. Anthropic's harness work describes prompting every resuming agent to "run through a series of steps to get its bearings", reading its progress files and git log before continuing, which is precisely this load-inject-verify sequence.

The verification step is what separates robust recovery from optimistic recovery. If the auth subagent's manifest says it was reviewing file 213 when the machine died, the coordinator should not assume file 213 was fully processed; it should treat the last in-flight item as possibly incomplete and re-run it. This is why state is written continuously rather than only at clean checkpoints: the more recent the last durable write, the smaller the window of work that must be repeated. A well-designed protocol therefore minimises both loss (through frequent writes) and double-work (through verifying the boundary), and it does so deterministically, so recovery is a procedure rather than a guess.

How this is tested on the exam

Task 5.4 questions in this area describe a long-running or multi-agent job exposed to interruption and ask how to make it survivable. The answer is to persist structured state to a manifest and define a resume protocol that reloads and re-injects it. Strong answers stress that the state must be written continuously, not just at the end, because the interruption is unpredictable, and that recovery is the coordinator's job: gather the manifests, inject them back, resume.

The distractors usually present hopeful non-solutions: trusting that the session will not crash, relying on a larger context window to somehow retain state, or trying to reconstruct progress by re-exploring the codebase after the fact. None of these recover the decisions and confirmed findings that only existed in the lost windows. Recognising that durability requires writing state to a known location before the crash, and that resume means reloading it, is the judgement this knowledge point assesses. It is also a key input to the broader strategy-selection decision: manifests are the tool you reach for specifically when a job is long-running enough that a crash is a real risk.

Misconception

Crash recovery is handled automatically because the model can just reconstruct its progress from the codebase when it restarts.

What's actually true

Re-reading the codebase cannot recover the decisions, confirmed findings, and position-in-task that lived only in the lost context. Those must be exported to a manifest before the crash; resume then reloads them. Reconstruction from source is not recovery of state.

Misconception

A manifest only needs to be written when the agent finishes, as a final record of what it did.

What's actually true

An interruption is unpredictable, so a manifest written only at the end protects nothing if the crash happens mid-task. State must be recorded continuously as work proceeds, which is exactly why Anthropic's memory guidance says to assume interruption at any moment.

Check your understanding

An overnight job runs a coordinator plus three subagents auditing a large codebase, each holding findings only in context. The team wants the night's progress to survive an unexpected reboot. What is the correct design?

Watch and learn

Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.

No videos curated for this concept yet

We are still curating the best official and community videos for this topic.

References & primary sources

Adaptive study

Master this concept with Archie

Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.

Start studying

Crash Recovery with a Manifest for Multi-Agent Systems