The Post-OCR Gazette — DiffusionGemma vs Gemma-4

DiffusionGemma's intermediate output at each denoising step

Run a passage and watch DiffusionGemma refine all 256 tokens at once — from rough draft to corrected text in a handful of steps. The final step shows what changed, highlighted against the input.

⚠ No human transcription for this passage — corrections cannot be verified. Fluent output is not necessarily correct output.

⚠ No human transcription exists for this passage, so corrections cannot be verified — fluent output is not necessarily correct output. Models may invent plausible readings.

DiffusionGemma

26B‑A4B‑it · diffusion: denoises 256 tokens in parallel

no run yet

Corrected text appears here.

Gemma‑4‑E4B

E4B‑it · autoregressive: one token at a time, greedy

no run yet

Corrected text appears here.

changed added removed relative to the OCR input

The data. 75 passages from BLN600, a corpus of 600 excerpts of 19th‑century London newspapers (largely crime reporting) from the British Library's collections, each paired with both the original OCR and a careful human transcription. That human transcription is the “right answer” every number below is measured against. Passages longer than DiffusionGemma's 256‑token output block were trimmed at a point where OCR and transcription align, so the pairs stay parallel. (BLN600 is CC‑BY‑NC, so the passages themselves aren't republished here — only these metrics.)

The task. Both models got the identical instruction — fix recognition errors only, don't modernise or rephrase — one passage at a time on the same A100 GPU. CER / WER: how far the output remains from the human transcription, by character / by word (the “OCR input” row is the damage before any correction). Relative CER reduction: how much of that damage the model repaired. Over‑correction: how much text that was already right the model needlessly changed. Fix rate: how much of what was actually wrong it fixed.

The full record. Every experiment behind these numbers — scripts, configs, findings (including the negative results) — is logged in a public experiment-log bucket; all runs executed on Hugging Face Jobs.

Fetching the ledger…