- In short
- Schema design to prevent fabrication means shaping an extraction schema so the model can faithfully report missing or ambiguous information instead of inventing it. The core moves are making fields nullable when the source may omit them, adding an unclear enum value for ambiguous cases, and offering an other option with a freeform detail string for categories you did not anticipate.
What it means to prevent fabrication with JSON schema design
To prevent fabrication with JSON schema design is to shape your extraction contract so that telling the truth is always a legal move for the model. Fabrication, in the structured-output sense, is the model inventing a value for a field the source never supplied. It is not the model being dishonest; it is the model obeying a contract you wrote. If you declare a field required, you have told the API that a valid response must contain it, and when the document is silent the model resolves the conflict the only way the schema allows, by producing a confident guess.
That reframing is the whole knowledge point. The lever is not a sterner instruction like do not make things up, which competes weakly against a hard schema requirement. The lever is the schema itself. Industry guidance on structured extraction is blunt about this: if a field is marked required but the information is missing from the source, the model may hallucinate a value to satisfy the constraint, and the remedy is to allow null. You design the gap into the contract so the model never has to choose between the schema and the truth.
- Fabrication-resistant schema
- An extraction schema whose field requirements, nullability, and enum values let the model faithfully represent missing or ambiguous source data, removing the structural pressure that otherwise pushes it to invent values for fields it must fill.
Schema patterns that prevent fabrication with JSON schema choices
Three concrete patterns do most of the work, and the exam expects you to reach for them by name.
- Nullable or optional fields for sometimes-absent data. Any field that does not appear in every document should accept null. A union type such as string-or-null gives the model an honest placeholder for I did not find this, which is precisely the value you want stored rather than a fabricated one. Reserve required for fields that genuinely appear every time.
- An
unclearenum value for ambiguous classification. When a field is a category drawn from an enum, a document will eventually arrive that fits none of your values, or fits two. Without an escape hatch the model must pick the closest wrong answer. Addingunclearas an allowed enum value lets it flag the ambiguity instead of inventing certainty, and your pipeline can route those records to review. - An
otheroption plus a freeform detail string. Rigid category sets lose information at the edges. Pairing anotherenum value with a short freeformother_detailstring preserves the edge case rather than discarding it, so you learn what your taxonomy is missing instead of silently mislabelling it.
A fourth, quieter pattern matters too: many strict-schema setups also disallow unexpected keys, which stops the model from inventing extra fields beyond the ones you declared. The throughline is that every one of these moves widens the set of honest responses the schema permits.
Schema shape is not value formatting
A subtle trap is assuming the schema can also normalise the format of values. It cannot. A schema says amount is a number; it does not say whether a date should be ISO-8601 or whether a currency should be stripped of its symbol. Those are formatting decisions, and they belong in the prompt as explicit normalisation rules that sit alongside the strict schema. Telling the model to render all dates as YYYY-MM-DD and all amounts as plain decimals complements the schema rather than duplicating it, and the exam treats the schema-plus-normalisation pairing as a single competent design.
Redesigning a brittle extraction schema
Worked example
A team extracts research-paper metadata into a schema where title, doi, funding_source, and study_type are all required, and study_type is an enum of randomized_trial, cohort, and case_report.
In testing the schema looks tidy and the JSON always parses, but the stored data is quietly wrong. Many papers list no DOI, yet doi is required, so the model produces realistic-looking but invented identifiers. Older papers omit a funding statement, yet funding_source is required, so the model guesses a plausible agency. And review articles, which fit none of the three study_type values, get forced into cohort because the model must choose one of the allowed values.
The redesign is mechanical once the cause is clear. doi and funding_source become nullable, so absent values come back as null and a downstream check can flag them for a human rather than trusting a fabricated string. The study_type enum gains an unclear value for genuinely ambiguous designs and an other value paired with an other_detail string for designs outside the original three, so a systematic review is labelled honestly instead of being mislabelled as a cohort. Finally a normalisation rule is added to the prompt instructing the model to return titles in their original casing, because the schema alone never governed formatting.
After the change the parse rate is unchanged, since it was already perfect, but the trustworthiness of the data jumps. Nulls now mark real gaps, unclear marks real ambiguity, and the team can measure how often each occurs. They have not made the model smarter; they have stopped forcing it to lie.
Common misconceptions to avoid
Misconception
Adding a prompt instruction like 'never invent values' is enough to stop fabrication.
What's actually true
Misconception
Making fields required produces cleaner, more complete data than allowing nulls.
What's actually true
Closing the door on invented fields
Nullability handles missing values, but there is a second flavour of fabrication worth designing against: invented fields. A model under pressure can add keys you never asked for, attaching a plausible-looking attribute that has no place in your contract. Strict schema handling closes this door by disallowing properties that were not declared, the JSON Schema equivalent of saying these are the only keys permitted. Combined with nullable values for the keys you do declare, you get a contract that is honest in both directions: it will not omit a structure you required, and it will not smuggle in one you did not ask for.
The lesson generalises. Every degree of freedom you leave in a schema is a degree of freedom the model can fill with invention when the source is thin. Designing a fabrication-resistant schema is partly about widening the honest options, with nulls and escape-hatch enums, and partly about narrowing the dishonest ones, by pinning down exactly which fields and which values are permitted. The two moves are complementary, and a strong design uses both rather than leaning on either alone.
Field order and reasoning fields
A less obvious lever is the order in which fields appear. Because the model generates the object from top to bottom, a field placed early can inform a field placed later. Practitioners exploit this by putting a short reasoning or analysis field before the conclusion it supports, so the model articulates its basis before committing to a verdict. Placing a risk_analysis string ahead of a risk boolean, for instance, tends to produce a more accurate boolean, because the model has effectively reasoned in writing before answering. The schema is not just a container; its layout subtly steers the model's attention.
This matters for fabrication because a conclusion reached without visible reasoning is easier for the model to assert confidently from nothing. Forcing an explanation field first gives you both a more reliable answer and an artifact you can audit, since a downstream reviewer can read the reasoning and judge whether the conclusion follows. It is a cheap addition that pays off in both accuracy and explainability, and it costs only a handful of tokens per record.
When required is exactly right
It would be a misreading to conclude that everything should be nullable. Required fields are the right call whenever a value genuinely appears in every document and a downstream system depends on its presence. An order identifier the format guarantees, a timestamp the ingestion layer stamps, a primary key your database needs, these belong as required, and loosening them would weaken a contract for no benefit. The skill is discrimination: required for the genuinely universal, nullable for the merely usual. A schema that is reflexively all-optional is almost as careless as one that is reflexively all-required, because it throws away the guarantees you are actually entitled to make.
Nullable, optional, and absent are not the same
Precision in how you express a gap matters, because three schema choices that sound interchangeable behave differently. A property listed under properties but left out of the object-level required array is optional: when the model has nothing to report, the field is simply omitted and the key does not appear in the record at all. A nullable field is different. The correct JSON Schema syntax is a union type such as "type": ["string", "null"], and it means the key is present with an explicit null value. There is no "string?" shorthand; a union that includes "null" is how nullability is actually written.
The distinction is not pedantry, because your downstream code branches on it. An omitted property tells a consumer the field was never in play, while a present null tells it the field was in play and genuinely empty. For extraction, a present null is usually the more useful signal, since it records that you looked and found nothing rather than that you never considered the field. Either choice removes the fabrication pressure that a bare required string creates, but choosing deliberately between absence and explicit null is what separates a schema that merely avoids fabrication from one that also reports its gaps in a form your pipeline can detect and act on.
How it shows up on the exam
In Domain 4 this is an apply-level skill, so Scenario 6 questions rarely ask for a definition; they hand you a schema or a failure and ask what to change. The signature symptom is data that is structurally perfect yet subtly false: invented identifiers, guessed categories, or fields that should have been blank. The credited answer almost always loosens the contract in a targeted way, making the right fields nullable, adding an unclear enum, or introducing an other option, rather than adding instructions or rewriting the prompt. When a stem stresses that every field is required and the documents vary, read that as a fabrication risk and look for the answer that lets the model say nothing was there.
One last framing helps under time pressure: read the words required and varied in the same stem as a contradiction the schema is forcing the model to resolve dishonestly. A required field is a promise that the value always exists, and a varied document set is a guarantee that sometimes it will not, so the two together describe a schema that must fabricate. The credited fix is almost never a new instruction or a different model setting; it is the structural change that dissolves the contradiction, by letting absent data be null, ambiguous data be unclear, and unforeseen data land in an other bucket. Train your eye to spot that required-meets-varied tension and the right option tends to announce itself.
An extraction schema classifies support tickets into an enum of billing, technical, and account. Reviewers notice that genuinely mixed or unclassifiable tickets are always shoehorned into technical, polluting the analytics. Which schema change most directly fixes the fabricated classifications?
People also ask
How do nullable fields stop an LLM from fabricating data?
Should extraction fields be required or optional?
What is an unclear enum value in a schema?
Watch and learn
Official Anthropic Academy lessons first, then hand-picked walkthroughs. Videos load only when you press play.
Structured data
Why watch: Designing fields that tolerate missing source data is how schemas prevent the model from fabricating values.
More videos for this concept
References & primary sources
Master this concept with Archie
Practice it inside an adaptive study session. Archie, your Socratic AI tutor, tracks your mastery with Bayesian Knowledge Tracing and schedules the perfect next review.