From spreadsheet to finding aid: a six-week pilot
How a specialist library moved from a CSV catalog to a published finding aid, without hiring a developer.
A specialist library — under three thousand records, single curator, half-time — wanted to publish a finding aid. They had a spreadsheet. They had document scans on a network drive. They had no developer, no budget for one, and a board meeting in eight weeks where they had promised "a digital catalog".
We ran the pilot. It took six weeks. This is what happened in each of them, and what we learned about which parts of the process are slow for reasons we cannot fix.
Week 1: import mapping
The spreadsheet had eighteen columns. Some were obvious — "Title", "Date", "Box Number". Some were not — "Notes", "Source", "Status (CHECK)". The AI-assisted import wizard read the first hundred rows and proposed a mapping: Title → Item.Title, Date → Item.DateExpression, Box Number → Item.ContainerReference, and so on. It got fifteen of the eighteen right. The curator corrected the other three in about twenty minutes.
The folder structure on the network drive — Series A / Series A.3 / Box 12 / Item 42.tiff — became the fonds hierarchy. The wizard detected three levels and asked the curator to confirm each level was a "Series", "Box", or "Item". Five minutes.
The import ran overnight. By Monday morning, 2,847 item records existed in the system, parented to the right series, with the right reference codes. Three records failed (corrupt scans). The curator triaged those in the run drawer the next morning.
Week 2: AI extraction on scans
Each item had at least one document scan. Some had ten or twelve. We ran the OCR + summary + entity-extraction pipeline against all of them as a background job. The full run took two days of wall-clock time on a single tenant — most of that wait was rate limits, not processing.
What came back: for each item, an AI-drafted title (or confirmation of the spreadsheet title), an AI-drafted description, extracted person names, extracted place names, and an extracted date range. The names and places landed in suggestion queues, not directly in the authority tables. The curator approved them in the next week.
Week 3: curator review
This was the slow week. The curator opened items in batches of 50, reviewed the AI-drafted descriptions, accepted or edited each one. Average time per item: 90 seconds for the easy ones, five minutes for the hard ones. Across 2,847 items, total curator time was about 47 hours over the week. They did it in spurts, between other duties.
The track-changes UI made this survivable. The curator could see exactly what they had changed from the AI draft, revert per-field, and save without committing to publish. The "Save changes" button was clicked roughly 4,000 times during the week. Zero data was lost.
Week 4: authority deduplication
The AI had extracted around 400 unique person names across the items. Many of them were variants of the same person — "J. Smith", "John Smith", "J. M. Smith", "Smith, John". We ran an authority-matching pass that proposed merges; the curator approved or rejected each. After the merge pass, the catalog had 312 distinct Person records, each linked to between one and eighty-six items.
The same pass ran on organisations and places. Fewer entities, more confidence, fewer manual decisions. About four hours of work for the curator.
Week 5: standards export
With the catalog clean, we generated an EAD 2002 export for the entire fonds. The XML was 1.4MB. We submitted it to a regional aggregator for validation. It came back accepted on the first attempt — no manual cleanup needed.
We also enabled the OAI-PMH endpoint for the tenant. The tenant subdomain became a harvestable repository overnight. Two aggregators — one regional, one thematic — picked up the records within forty-eight hours of the endpoint going live.
Week 6: public portal launch
The public portal was already running — every tenant gets one, configured from the same data. The curator picked a logo, set the colour palette, wrote two paragraphs of about-page copy, and turned the portal on for indexing. By the time the board meeting happened on Friday, the catalog was searchable at the institution's domain, with stable URLs for every record.
What we learned
Three things, in order of how surprising they were.
The AI is fastest at description, slowest at correctness. First-draft descriptions for 2,847 items took two days of wall-clock time. Curator review of those descriptions took the entire third week. The AI compresses a year of cataloging effort into days; it does not compress the human review pass. If the institution had budgeted for AI to do the whole job, they would have published wrong descriptions on time. They budgeted for AI to do the draft and a human to do the review, and they published right descriptions on time.
Standards are the unblocker. The first month produced a clean catalog inside our system. The fifth week produced an EAD export that an aggregator accepted on the first try. Most of the value to the researcher community came from that file, not from the catalog itself. Without the standards layer, the same work would have produced an internal-only catalog that nobody outside the institution could find.
No developer was hired. This was the explicit pilot success criterion. The curator did the work, supported by the AI for description and by the platform for everything else. They did not write SQL, edit XML, or call an API. The tools that did those jobs were the platform.
Six weeks from spreadsheet to indexed finding aid. The bottleneck was curator time. The bottleneck was always going to be curator time. The job of the platform is to make sure it stays the only bottleneck.
See it on your own collection.
Upload a few records, run the AI, and publish a finding aid — before the next post lands.