最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

tesseract - Arabic OCR from document - Stack Overflow

programmeradmin2浏览0评论

am working on arabic document (image) info extraction tool ( i detect the region with bounding box and then crop it to perform ocr

image example ( after detected and cropped)

  • (the quality isnt that high when taking an image of a4 document and then detect and then crop)

  • the line is on the form of dots and the ocr i used mix some times between the dots of the line and the dots of arabic letter ( ex: ماد gets ضاد)

  • the ocr also sometimes skips the letter on the border of the cropped image ( first and last letters )

i used paddleocr ( english verison worked well for the numbers of the same image but the arabic version failed ) :

ocr=PaddleOCR(use_angle_cls=True, lang='ar', ocr_version='PP-OCRv4', use_space_char=True)   

tesseract :

text = pytesseract.image_to_string(image, lang="ara" )

..arabic small nougat none of that is extracting the text correctly ( also used image binarizaton to remove the dots line , it didnt perform well ) what do you suggest, also if you think there is a better workflow for this project or changes in parameters ,share please.

发布评论

评论列表(0)

  1. 暂无评论