Vector search over archival prose: what actually worked
Our journey through pgvector, embedding choices, and the strange semantic texture of 19th-century correspondence.
The promise of vector search is that a researcher can type "milkmaid wages 1880" and find records that talk about dairy labour without using those exact three words. The promise is real. The implementation is harder than the demos suggest. This post is what we tried, what worked, what failed, and what we have started to suspect about the texture of archival prose that makes it a particularly awkward target for embeddings.
Why archival prose is weird
Modern embedding models are trained on modern internet text. The training distribution is over a few decades of journalism, fiction, technical writing, social media, code. Archival prose is none of those things. It is:
- Period English, sometimes several centuries of it
- Idiosyncratic abbreviation ("recd.", "do.", "&c.")
- Bureaucratic register ("hereinafter referred to as", "in respect of")
- Place and personal names that have changed spelling or boundaries multiple times
- Long, descriptive finding-aid sentences with a syntax all their own ("Series consists of correspondence, financial papers, and miscellaneous printed matter relating to...")
None of these are catastrophic on their own. Together, they push the embeddings further from the model's training distribution than typical search applications do, and the result is that what looks like "semantic search" in a SaaS demo behaves worse on a real catalog.
Model choice
We use text-embedding-3-small at 1536 dimensions. We considered text-embedding-3-large. We ran A/B tests on a sample of about 5,000 records from three pilot tenants — a colonial-era correspondence collection, a 20th-century institutional archive, and a recent born-digital photographic collection.
The accuracy delta between small and large was, on this sample, around 4 percentage points on a top-10 retrieval benchmark we built from curator-annotated relevance judgements. The cost delta was roughly 6x. For our content, on our budgets, small wins. If we were doing semantic search over modern legal or scientific text where the embedding quality matters more, the calculation would be different.
Storage
We use pgvector with an HNSW index. The SemanticVector column is added to three entity types: Item, Story, and FileExtraction. Embedding happens on save, inside the same transaction. If the embedding API fails, the save still succeeds; the vector is regenerated on the next background sweep.
HNSW parameters: m=16, ef_construction=64. We tested higher values; the recall improvement was sub-percentage and the index build cost roughly doubled. The defaults shipped with pgvector are sane for our content size.
For a tenant with 200,000 items, the HNSW index is around 2.4GB on disk and queries return in 20–40ms cold, 5–10ms warm. The on-disk cost is the dominant operational concern — at ten tenants of that size we are at 24GB just for the item index. We will revisit at the next order of magnitude.
Hybrid with tsvector
Pure vector search is a worse experience than the demos suggest. The reason is the long tail of cases where the user is searching for a specific thing — a reference code, a date, an exact name — and semantic similarity is the wrong question. "ARC-1986-042" is not similar to anything; the user wants the document with that code, exactly.
We ship hybrid: a tsvector lexical query and a pgvector semantic query run in parallel, and the result lists are fused via reciprocal rank. Lexical wins on exact-match queries; semantic wins on conceptual queries. The fusion handles the common case where the user does not know which kind of query they are typing.
tsvector is a stored generated column per entity with a GIN index. The expression varies per module — Item indexes title, description, reference code, and a concat of authority names; Person indexes the structured name fields and biography. We do not index every text column; the index size grows quadratically with field count and most fields are not searched.
Where vectors fail
Three failure modes we now know to expect.
First, rare names. A person who appears in five records, with a name not seen in the embedding training data, has a vector representation that is essentially noise. The model places it near other rare-name vectors, not near the records it is supposed to be associated with. Lexical search recovers this case; vectors do not.
Second, period spelling variants. "Henrietta" and "Henretta" are the same person, but their embeddings can be far apart if the model has not seen the variant. We are trialling pg_trgm as a fallback specifically for this case — when both lexical and semantic return zero hits, run a fuzzy match and surface "Did you mean...?" hints.
Third, finding-aid prose. Long descriptive paragraphs in finding aids have a flat embedding signature — they all look like each other to the model, because they are all the same register. A query for "papers relating to land disputes" returns dozens of finding-aid passages that all loosely match the genre but do not specifically match the topic. The fix is to embed at a finer granularity (per item, not per fonds) and to weight against the genre-mean vector. We do the first; we have not yet done the second.
The third leg of the stool
Lexical (tsvector) for exactness. Semantic (pgvector) for meaning. Fuzzy (pg_trgm) for typos and spelling variants. We have shipped the first two; the third is on the immediate roadmap.
The intent is fallback-then-merge: when the combined lexical + semantic result set is empty or thin, run a trigram similarity query against the same composite expression that tsvector uses, threshold at around 0.3, rank by similarity score. If anything comes back, surface a "Showing results for X" line and let the user accept or reject the correction.
This avoids running trigram on every query — pg_trgm is comparatively expensive, especially over long Description columns — while still rescuing the queries where the user has typed a name slightly wrong.
What we tell new tenants about semantic search
We tell them: it is the difference between finding the records you already know exist and discovering the ones you did not know to ask for. The first job belongs to lexical search and will until the heat death of the universe. The second job is where semantic earns its keep, and it earns it differently on different collections — best on collections with consistent descriptive depth, worst on collections that are mostly metadata stubs with little prose.
Meaning is not a substitute for keyword. Keyword is not a substitute for meaning. The search bar has to know which job it is doing, and the user should not have to.
This is the heart of what makes archival search a hard product problem. The user does not know whether they are asking a lexical question or a semantic one — they just know what they are looking for. The job of the system is to do both well enough that the user does not have to choose.
See it on your own collection.
Upload a few records, run the AI, and publish a finding aid — before the next post lands.