AI drafts, curators decide: the case for three-layer state
How we separate AI output, curator review, and published snapshots — and why collapsing those layers is where most tools go wrong.
There is a tempting design for AI-assisted cataloging: a single Description field, the AI fills it, the curator edits it, the curator saves. One field, one source of truth, no ceremony.
It is wrong, and the wrongness is structural. The AI output is evidence, not truth. The curator edits are truth in progress. The published record is truth at a moment. Collapsing these into one field destroys the three things archivists most need: provenance, reversibility, and a defensible audit trail.
We separate them into three layers. This post is what they are, how they are stored, and what behaviour falls out for free when you do it this way.
Layer one: the AI snapshot
Every time we run an extraction — OCR, transcription, summarisation, entity extraction — we store the raw output in a column called AiExtractionJson. That column is immutable. The curator never edits it. The UI never overwrites it. If we re-run the AI tomorrow with a better model, the new output goes to a new row, the old one is kept for diffing.
This is the same discipline we apply to source files. You would not let a curator overwrite an original scan; you would not let the next AI run overwrite the previous AI evidence. The snapshot is durable.
Layer two: the review draft
When the curator opens the record, the page loads with the AI output projected into editable fields. Internally we keep a pristine snapshot — the exact values the page mounted with — and a dirty state that tracks every edit. A "Save changes" button appears the moment dirty diverges from pristine. A "Revert" button restores pristine.
This pattern is implemented in useTrackChanges and rendered by FloatingActionBar. We use it on every editable detail page in the product. It looks like a cosmetic UI choice; it is actually the layer that makes AI re-runs survivable.
Here is why. Imagine the curator has spent thirty minutes editing the description. The institution releases a new transcription model and someone re-runs extraction on the file. In a one-field design, the curator just lost thirty minutes. In our design, the AI snapshot updates, but the review draft is untouched. The curator sees a banner: "AI re-ran. View what changed?" and chooses whether to merge.
Layer three: the publish snapshot
When the curator publishes, we serialise the entire reviewed state into a PublishedJson column on the record. Like the AI snapshot, it is immutable. Future edits become a new pristine; the next publish writes a new snapshot. The published JSON is what feeds the public portal, the OAI-PMH endpoint, the EAD export.
This means at any time we can answer two questions: what does the public see right now (the most recent PublishedJson), and what did the public see when this DOI was cited last year (a historical PublishedJson). The answers are different rows.
What this gives you
Three layers, four behaviours that are difficult or impossible without them:
- AI is re-runnable without trampling curator work. Replacing the model is a routine operation, not a destructive one.
- Diffs are real. "What changed between publish 3 and publish 4?" returns a field-level answer, not a guess.
- Citation is stable. A scholar who cites your DOI is citing the state of the record at the moment of publish.
- Audit is defensible. When someone asks whether a human approved a description, we answer with a timestamp and a user id, not a vibe.
Why most tools do not do this
The objection is always the same: the schema is more complex, the storage doubles or triples, the developer has to think about three states instead of one.
The schema is more complex. Storage roughly doubles for descriptive fields. The developer does have to think about three states. We accept all three because the alternative is a tool that cannot honestly support the workflows archivists have always practiced — review, approve, publish, revise — and have to practice doubly carefully when there is a machine in the loop.
When the machine writes the first draft, the archivist still owns every word that ships. The data model has to make that ownership legible.
AI does not replace the archivist. It accelerates one phase of their work. The data model has to know which phase that is, and it has to keep the phases separate enough that the archivist can intervene in any of them without losing the others.
See it on your own collection.
Upload a few records, run the AI, and publish a finding aid — before the next post lands.