OCR made the PDF searchable, but reading the text reveals 'rn' where 'm' should be, 'cl' instead of 'd', random spaces, and the occasional dropped letter.
What's actually going wrong
OCR is pattern recognition. It guesses based on character shapes. Common confusions: 'rn' vs 'm', 'cl' vs 'd', 'I' vs 'l' vs '1', 'O' vs '0'. Quality of source determines how many errors slip through.
High-resolution scans of clean text produce nearly perfect OCR. Faded or low-resolution sources produce more errors.
The quick fix
Run the PDF through convert PDF to Word. Open the Word output. Use Find and Replace to bulk-correct common OCR errors:
- Find 'rn', replace with 'm' (carefully — 'corn' shouldn't become 'com') - Find ' cl ', replace with ' d ' - Find '0' that should be 'O' in proper nouns
A spell check in Word catches most remaining errors. Take twenty minutes for a long document; produces clean text.
If that didn't work
For OCR with widespread errors, the source was too poor for reliable recognition. Either rescan the source at higher quality, or accept that manual proofreading of the whole document is needed.
For consistent errors (specific words always wrong), use Find and Replace All to fix them in bulk.
Prevent it next time
Source quality drives OCR accuracy. Scan at 300dpi minimum. Use clean, well-lit originals. And for documents where accuracy matters, always proofread OCR output before relying on it.
FAQ
How accurate is Flint's OCR?
Above 99% on crisp 300dpi sources. Lower on faded or low-resolution scans. Specialised content (handwriting, unusual fonts) recognises less reliably.
Can OCR errors be fully automated to fix?
Common errors yes — Find and Replace catches predictable mistakes. Unique errors need manual review. Plan for some manual cleanup on important documents.
Does spell check catch OCR mistakes?
Yes for nonsense words, but missed if the error is a real word ('cat' instead of 'oat'). Combine spell check with manual proofread for important content.
Why does OCR confuse certain characters?
Similar shapes — 'rn' renders almost identically to 'm' at small sizes. Higher-resolution source distinguishes them better. So does proofread.
OCR cleanup happens in Word. Convert PDF to Word in Flint, clean the text, re-export to PDF.