python - How to prioritize French OCR over Arabic when using Tesseract (fra+ara) on bilingual documents?

I'm working on scanned documents (registers) that contain both French and Arabic text.

When I run Tesseract OCR with lang='fra', all the French text is extracted perfectly.

But when I use lang='ara+fra' to handle both languages in one pass, I start getting weird errors:

French words are misread or replaced (e.g., SOCIETE becomes 50012175)
Some French company names become random Arabic-like characters (e.g., ALPHA becomes حمناطاحم)
Arabic words work fine, but French gets corrupted.

What I want:

I’d like Tesseract to prioritize French whenever the text looks French (Latin letters), and only fall back to Arabic if it’s actually Arabic text.

I tried doing OCR in two passes (fra and ara) and combining them manually, but it’s hard to align words correctly since Tesseract splits and orders text differently for Arabic and French.

Question:

How can I ensure French text is not misinterpreted as Arabic when using Tesseract on bilingual documents?

Is there a way to:

Prioritize fra over ara during a single OCR run?
Or post-process the result from lang='ara+fra' to correct misclassified French text?
Or a smarter way to combine fra and ara outputs?

Any tips, workarounds, or best practices would be appreciated

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - How to prioritize French OCR over Arabic when using Tesseract (fra+ara) on bilingual documents? - Stack Overflow

What I want:

Question:

与本文相关的文章

评论列表(0)