最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - How to prioritize French OCR over Arabic when using Tesseract (fra+ara) on bilingual documents? - Stack Overflow

programmeradmin2浏览0评论

I'm working on scanned documents (registers) that contain both French and Arabic text.

When I run Tesseract OCR with lang='fra', all the French text is extracted perfectly.

But when I use lang='ara+fra' to handle both languages in one pass, I start getting weird errors:

  • French words are misread or replaced (e.g., SOCIETE becomes 50012175)
  • Some French company names become random Arabic-like characters (e.g., ALPHA becomes حمناطاحم)
  • Arabic words work fine, but French gets corrupted.

What I want:

I’d like Tesseract to prioritize French whenever the text looks French (Latin letters), and only fall back to Arabic if it’s actually Arabic text.

I tried doing OCR in two passes (fra and ara) and combining them manually, but it’s hard to align words correctly since Tesseract splits and orders text differently for Arabic and French.

Question:

How can I ensure French text is not misinterpreted as Arabic when using Tesseract on bilingual documents?

Is there a way to:

  • Prioritize fra over ara during a single OCR run?
  • Or post-process the result from lang='ara+fra' to correct misclassified French text?
  • Or a smarter way to combine fra and ara outputs?

Any tips, workarounds, or best practices would be appreciated

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论