The Post‑OCR Gazette

Diffusion vs Autoregression · June 2026

DiffusionGemma is an experimental language model that generates text by denoising 256 tokens in parallel rather than writing one token at a time. This demo uses it to correct noisy OCR from historical newspapers, head‑to‑head against a conventional autoregressive model (Gemma‑4‑E4B).

How to use it: pick a passage below (or paste your own), press Correct this text, and watch the correction emerge step by step. On a 75‑passage benchmark the diffusion model corrected more accurately than the autoregressive baseline — and roughly 8× faster.

All experiments ran on Hugging Face Jobs — benchmark scripts & README in this bucket.

Human transcription of this passage
View the original page (full page — the passage is one excerpt from it)
Scanned source page
DiffusionGemma's intermediate output at each denoising step
Run a passage and watch DiffusionGemma refine all 256 tokens at once — from rough draft to corrected text in a handful of steps. The final step shows what changed, highlighted against the input.
⚠ No human transcription for this passage — corrections cannot be verified. Fluent output is not necessarily correct output.
⚠ No human transcription exists for this passage, so corrections cannot be verified — fluent output is not necessarily correct output. Models may invent plausible readings.

DiffusionGemma

26B‑A4B‑it · diffusion: denoises 256 tokens in parallel
no run yet
Corrected text appears here.

Gemma‑4‑E4B

E4B‑it · autoregressive: one token at a time, greedy
no run yet
Corrected text appears here.
changed added removed relative to the OCR input

The data. 75 passages from BLN600, a corpus of 600 excerpts of 19th‑century London newspapers (largely crime reporting) from the British Library's collections, each paired with both the original OCR and a careful human transcription. That human transcription is the “right answer” every number below is measured against. Passages longer than DiffusionGemma's 256‑token output block were trimmed at a point where OCR and transcription align, so the pairs stay parallel. (BLN600 is CC‑BY‑NC, so the passages themselves aren't republished here — only these metrics.)

The task. Both models got the identical instruction — fix recognition errors only, don't modernise or rephrase — one passage at a time on the same A100 GPU. CER / WER: how far the output remains from the human transcription, by character / by word (the “OCR input” row is the damage before any correction). Relative CER reduction: how much of that damage the model repaired. Over‑correction: how much text that was already right the model needlessly changed. Fix rate: how much of what was actually wrong it fixed.

The full record. Every experiment behind these numbers — scripts, configs, findings (including the negative results) — is logged in a public experiment-log bucket; all runs executed on Hugging Face Jobs.

Fetching the ledger…