tesseract - Arabic OCR from document

am working on arabic document (image) info extraction tool ( i detect the region with bounding box and then crop it to perform ocr

image example ( after detected and cropped)

(the quality isnt that high when taking an image of a4 document and then detect and then crop)
the line is on the form of dots and the ocr i used mix some times between the dots of the line and the dots of arabic letter ( ex: ماد gets ضاد)
the ocr also sometimes skips the letter on the border of the cropped image ( first and last letters )

i used paddleocr ( english verison worked well for the numbers of the same image but the arabic version failed ) :

ocr=PaddleOCR(use_angle_cls=True, lang='ar', ocr_version='PP-OCRv4', use_space_char=True)

tesseract :

text = pytesseract.image_to_string(image, lang="ara" )

..arabic small nougat none of that is extracting the text correctly ( also used image binarizaton to remove the dots line , it didnt perform well ) what do you suggest, also if you think there is a better workflow for this project or changes in parameters ,share please.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

tesseract - Arabic OCR from document - Stack Overflow

与本文相关的文章

评论列表(0)