Engineering22 April 2026·10 min read

Mapping ISAD(G) to a relational schema without losing your mind

Fonds, series, items — and the many-to-many headaches that emerge when you try to store provenance faithfully.

Rafiq Hossain

ISAD(G) is hierarchical, many-to-many, and written for human readers. Postgres is flat, set-based, and written for indexes. Mapping one to the other is the foundational schema decision in any archival platform, and getting it wrong is expensive in a particular way: every join you forget to model is a join you have to retrofit later, with data already in the database, against curators who have already memorised the field names.

We made this decision early. This post is what we picked, what we considered, and the small handful of indexes that make it survivable at scale.

The hierarchy: closure table over recursive CTE

The naive approach is parent_id on every record and a recursive CTE to walk ancestors. This works fine until you have a million records and you want to load the full ancestor chain for a hundred items on a search results page. You end up with N+1 CTEs.

We use a closure table: a separate ItemAncestry table with one row per (descendant, ancestor) pair plus the depth. Ancestor lookups are an indexed equi-join, hierarchy moves are O(subtree size) maintenance, and recursive walks vanish from the hot path.

The tradeoff is write amplification on bulk imports — a fonds with 50k items writes 50k * average-depth rows to the closure table — and the import job is the only place this matters. We made it a batched insert and moved on.

Authorities: separate tables, link tables, no comma strings

ISAD(G) field 3.2.1 is the "Name of creator(s)". A flat catalog stores this as a string. We store it as a join through ItemPerson or ItemOrganization to the Person and Organization tables, with a role column.

The same shape applies to subjects (ItemSubject → Subject), places (ItemPlace → Place), and events. Every entity that ISAD(G) gestures at as "controlled" gets its own table.

The headache is the editor UI. When the curator types a creator name, the field has to be a typeahead against the authority table, not a free-text input. We invested in EntitySearchDialog so that every linked-authority surface in the product uses the same picker. It is one component used in nine places.

The 26 elements: group them, do not column them

ISAD(G) defines 26 descriptive elements. The naive schema is 26 columns. We grouped them into five conceptual areas — identity, context, content, access, allied materials — and within each area used a mix of dedicated columns (for fields with predictable shape) and a Notes collection (for the more freeform 4.x and 7.x sections).

A Note is its own entity: type, language, content, ordering. This means a record can have multiple notes of the same type, in multiple languages, and ordered. Trying to express that with 26 flat columns is how schemas end up with notes_en, notes_en_2, notes_fr over a few years of patches.

Tenant scoping at the schema level

Every catalog entity carries a TenantId. ABP enforces the filter at the repository level so we cannot accidentally read across tenants in application code. But the indexes have to know about this — every secondary index is composite with TenantId first.

The indexes that earn their keep:

(TenantId, FondsId) for fonds-scoped browses

(TenantId, ParentId) for tree expansion

(TenantId, WorkflowStage, ModifiedAt) for the curator dashboard

GIN (SearchVector) for tsvector full-text

HNSW (SemanticVector vector_l2_ops) for pgvector semantic search

These are not optimisations. They are the difference between a 30ms search and a 3-second one on a 200k-item catalog.

What we wish we had done differently

Two things. First, we modelled the closure table before we modelled soft-delete, and the interaction is awkward — a soft-deleted ancestor still has rows in the closure table, and we filter them out in application code. With hindsight we would have done the closure table inserts inside the same transaction as the soft-delete, with a flag.

Second, we used Guid primary keys everywhere. They are perfect for distributed systems and they are a small but real loss on index size and join cost compared with bigints. For a single-cluster system that was always going to be hosted, bigint would have been the right call. We will not migrate; the cost is bearable; future you, do the maths.

ISAD(G) is a description of how archival records relate. It is not a schema. The job of the developer is to design a schema that can express the description faithfully — not to copy the description into column names.

The spec gives you permission to design. Take it.

engineeringschemaISAD(G)postgres

← Back to blog

See it on your own collection.

Upload a few records, run the AI, and publish a finding aid — before the next post lands.

Start free trial Book a demo