I'm working on scanned documents (registers) that contain both French and Arabic text.
When I run Tesseract OCR with lang='fra'
, all the French text is extracted perfectly.
But when I use lang='ara+fra'
to handle both languages in one pass, I start getting weird errors:
- French words are misread or replaced (e.g.,
SOCIETE
becomes50012175
) - Some French company names become random Arabic-like characters (e.g.,
ALPHA
becomesحمناطاحم
) - Arabic words work fine, but French gets corrupted.
What I want:
I’d like Tesseract to prioritize French whenever the text looks French (Latin letters), and only fall back to Arabic if it’s actually Arabic text.
I tried doing OCR in two passes (fra
and ara
) and combining them manually, but it’s hard to align words correctly since Tesseract splits and orders text differently for Arabic and French.
Question:
How can I ensure French text is not misinterpreted as Arabic when using Tesseract on bilingual documents?
Is there a way to:
- Prioritize
fra
overara
during a single OCR run? - Or post-process the result from
lang='ara+fra'
to correct misclassified French text? - Or a smarter way to combine
fra
andara
outputs?
Any tips, workarounds, or best practices would be appreciated