python - Extracting Images from a PDF using PyMuPDF gives broken output images

The code I am using to extract the images is

from PIL import Image


def extract_images_from_pdfs(pdf_list):
    import fitz  # PyMuPDF
    
    output_dir = "C:/path_to_image"
    os.makedirs(output_dir, exist_ok=True)
    
    for pdf_path in pdf_list:
        pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
       
        # Open the PDF
        pdf_document = fitz.open(pdf_path)
        
        # Track the count of images extracted per page
        image_count = 0
        
        for page_num, page in enumerate(pdf_document):
            # Get the images on this page
            image_list = page.get_images(full=True)
            
            if not image_list:
                print(f"No images found on page {page_num+1} of {pdf_name}")
                continue
            
            # Process each image
            for img_index, img in enumerate(image_list):
                xref = img[0]
                base_image = pdf_document.extract_image(xref)
                
                if base_image:
                    image_bytes = base_image["image"]
                    image_ext = base_image["ext"]
                    
                    # Convert bytes to image
                    image = Image.open(io.BytesIO(image_bytes))
                    
                    # Save the image
                    image_name = f"{pdf_name}_image_{image_count}.{image_ext}"
                    image_path = os.path.join(output_dir, image_name)
                    
                    image.save(image_path)
                    
                    image_count += 1
        
        pdf_document.close()
        print(f"Extracted {image_count} images from {pdf_name}")

The input, pdf_list, is just a list containing all the names of my pdf's.

Extracted image 1

Extracted image 2

Expected image:

Could it be that the images on the PDF are encrypted / accessible and is there a work around for this.

Any help is greatly appreciated.

testingpdfexampaper.tiiny.site This is the URL for the PDF

The code I am using to extract the images is

from PIL import Image


def extract_images_from_pdfs(pdf_list):
    import fitz  # PyMuPDF
    
    output_dir = "C:/path_to_image"
    os.makedirs(output_dir, exist_ok=True)
    
    for pdf_path in pdf_list:
        pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
       
        # Open the PDF
        pdf_document = fitz.open(pdf_path)
        
        # Track the count of images extracted per page
        image_count = 0
        
        for page_num, page in enumerate(pdf_document):
            # Get the images on this page
            image_list = page.get_images(full=True)
            
            if not image_list:
                print(f"No images found on page {page_num+1} of {pdf_name}")
                continue
            
            # Process each image
            for img_index, img in enumerate(image_list):
                xref = img[0]
                base_image = pdf_document.extract_image(xref)
                
                if base_image:
                    image_bytes = base_image["image"]
                    image_ext = base_image["ext"]
                    
                    # Convert bytes to image
                    image = Image.open(io.BytesIO(image_bytes))
                    
                    # Save the image
                    image_name = f"{pdf_name}_image_{image_count}.{image_ext}"
                    image_path = os.path.join(output_dir, image_name)
                    
                    image.save(image_path)
                    
                    image_count += 1
        
        pdf_document.close()
        print(f"Extracted {image_count} images from {pdf_name}")

The input, pdf_list, is just a list containing all the names of my pdf's.

Extracted image 1

Extracted image 2

Expected image:

Could it be that the images on the PDF are encrypted / accessible and is there a work around for this.

Any help is greatly appreciated.

testingpdfexampaper.tiiny.site This is the URL for the PDF

Share Improve this question edited Apr 2 at 6:57 cards 5,0641 gold badge11 silver badges26 bronze badges asked Mar 31 at 19:17 ShinyZack123 475 bronze badges

1 Can you post a link to the PDF? Have you tried the pdfimages command to see what it gets? – Tim Roberts Commented Mar 31 at 19:23
1 we don't have your PDF so we can't check what can be wrong. – furas Commented Mar 31 at 19:40
So, to summarize the answer below, the "expected image" you show is not actually an image. It's 13 separate images (for the lines) plus 14 one-character strings. – Tim Roberts Commented Apr 2 at 7:03
@TimRoberts No perhaps I was not clear the images are parts of one word the one shown is the title O the graphic HAS NO images but 14 lines as Path outlines (glyphs) and 15 vectors as lines between those other lines that's why they work as SVG lettering. – K J Commented Apr 2 at 9:57

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

The PDF has 78 very small pieces of imagery of which the "largest" is masking for O on the first page:

 1    60 image      81    62  index   1   8  image  no       271  0   151   151 1996B  40%

And many are simply one single pixel.
They can be in any order and the early ones of the 78 are generally parts of R:

pdfimages -list chem.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image       4    26  cmyk    4   8  image  no       214  0   163   153   77B  19%
   1     1 image       2     2  cmyk    4   8  image  no       215  0   204   245   21B 131%
   1     2 image       7    59  index   1   8  image  no       226  0   306   303   53B  13%
   1     3 image      60    39  index   1   8  image  no       237  0   150   153  819B  35%
   1     4 image       1     1  cmyk    4   8  image  no       248  0   204   204   14B 350%
   1     5 image       9     4  cmyk    4   8  image  no       259  0   162   153   74B  51%
   1     6 image      58    31  index   1   8  image  no       270  0   150   154  526B  29%
   1     7 image       4     3  cmyk    4   8  image  no       281  0   153   153   38B  79%
   1     8 image       2     2  cmyk    4   8  image  no       290  0   153   175   24B 150%

NOTE there is common with many PDF constructions no "one to one" relationship.
One text line can be many places and one visible line can be multiple paths too.

Thus image extraction is of no real value as any whole page could be exported as single images, then trimmed to desired areas, at any density/quality you wish.

Python has PyMuPDF which can "gather" "paths" and combine into single graphical units. So if you select an area of inclusions (Region of Interest) they can possibly be reused as vectors elsewhere?

This is similar in effect to the way the MuPDF command line can with a few well chosen commands export SVG areas for reuse.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Extracting Images from a PDF using PyMuPDF gives broken output images - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)