am working on arabic document (image) info extraction tool ( i detect the region with bounding box and then crop it to perform ocr
image example ( after detected and cropped)
(the quality isnt that high when taking an image of a4 document and then detect and then crop)
the line is on the form of dots and the ocr i used mix some times between the dots of the line and the dots of arabic letter ( ex: ماد gets ضاد)
the ocr also sometimes skips the letter on the border of the cropped image ( first and last letters )
i used paddleocr ( english verison worked well for the numbers of the same image but the arabic version failed ) :
ocr=PaddleOCR(use_angle_cls=True, lang='ar', ocr_version='PP-OCRv4', use_space_char=True)
tesseract :
text = pytesseract.image_to_string(image, lang="ara" )
..arabic small nougat none of that is extracting the text correctly ( also used image binarizaton to remove the dots line , it didnt perform well ) what do you suggest, also if you think there is a better workflow for this project or changes in parameters ,share please.