Transcribing oral histories: word-level confidence matters
The difference between 'good enough for search' and 'good enough to cite' is visible at the word level. A tour of our transcription pipeline.
Oral history is the worst-case for automatic speech recognition. Accents drift across continents and decades. Recordings are made on whatever was available — reel-to-reel, cassette, microcassette, mini-DV, smartphone. The vocabulary includes period slang, place names that have changed twice, and code-switching mid-sentence. The speaker pauses to cry, or to drink water, or because they have just remembered something they did not mean to say.
ASR systems handle this badly. They handle it less badly than they did three years ago, but still badly. The honest engineering move is not to pretend the transcript is correct — it is to surface where the model is uncertain so a human can intervene at the points that matter.
That surface is word-level confidence. This post is what it looks like, how we store it, and the one counter-intuitive finding that changed how we ranked our review queue.
Sentence-level transcripts hide the problem
The default transcription output from most APIs is plain text with paragraph breaks. You get a readable document, you skim it, you decide it looks fine, you publish. Six months later a scholar cites a passage that was actually the model hallucinating around a low-volume passage about a now-deceased relative. The text reads fluently. The fluency was the warning sign, but you had nothing to compare it against.
The fix is structural: stop accepting a plain-text transcript as the output of ASR. Accept a sequence of words, each with a start timestamp, an end timestamp, and a confidence score. Render the transcript by laying those words out as text, but keep the underlying object addressable per word.
What we store
For audio and video, gpt-4o-transcribe returns per-word logprob and timestamp data when we request it. We store the full word array as JSON on the FileExtraction row, alongside a derived plain-text view and a VTT track for video playback.
The UI then has everything it needs. Words below a confidence threshold get a yellow tint. Hovering shows the confidence and lets you click-to-play that exact word from the audio. The "Accept" / "Reject" buttons on the proofreader operate at the word level, not the segment level.
When a curator accepts a low-confidence word, we mark it as human-verified in the JSON. From that point the word is no longer flagged, regardless of its original score. The confidence threshold is per-tenant configurable; some tenants want everything below 0.85 reviewed, some only below 0.6.
The counter-intuitive finding
We assumed confidence would correlate cleanly with correctness — high confidence, probably right; low confidence, probably wrong. Mostly true. Mostly is not all.
The exception we found: long pauses. When the speaker stops to think for ten seconds, the model fills the silence with the most probable next token, which is usually a filler word ("um", "you know", "right"). The confidence on these filler insertions is high — they are statistically natural completions — but they are also inventions. The audio at that timestamp is silence.
We added a second signal: gaps between consecutive word timestamps. Any word whose start time is more than two seconds after the previous word ended is flagged regardless of its confidence score. The threshold is empirical and probably wrong for languages we have not tested.
The pipeline
For an uploaded video file the steps are:
- FFmpeg extracts the audio track at 16kHz mono
- The audio is chunked at silence points (avoiding mid-word splits) into 10-minute segments
- Each segment goes to
gpt-4o-transcribewithresponse_format=verbose_jsonandtimestamp_granularities=word
- The per-word arrays are stitched back together with timestamp offsets
- The combined word array is stored as JSON; a VTT track is generated for the HTML5 video player; a plain-text view is derived for full-text search
The proofreader UI loads the VTT track via the standard TextTrack API, so the captions are real captions — they show up in fullscreen, they respect the OS caption styling, screen readers see them. The proofreader overlay is layered on top.
Why this matters for citation
A researcher who quotes an oral history transcript is making a claim about what someone said. If the underlying transcript is a plain-text smear of the audio, the quote points to a passage, not to evidence. If the transcript carries word-level timestamps and confidence, the quote can point to a specific second of audio, with a record of which human accepted that word.
A transcript is a hypothesis until a human signs off on each word that carries weight. The system has to know which words carry weight; the human has to do the signing.
We are not the first to think this way. The oral history community has been arguing for word-level verifiability for two decades. What is new is that the tooling is now cheap enough to do this by default. There is no longer a reason to ship sentence-level transcripts as the final artefact.
See it on your own collection.
Upload a few records, run the AI, and publish a finding aid — before the next post lands.