extract vector charts from PDF

I can't find the solution to how to isolate or extract vector chart and graphs(that are not images) from pdf.

I have tried extract directly, but I realize that it is not that straight forward. I was using mymupdf. This script extracts and saves only images. But I needed to save charts that are not the images. In PDF apparently it is stored differently.

import fitz
import os 

pdf_path = 'path to pdf'
output_folder = 'your output folder'
os.makedirs(output_folder, exist_ok=True)

doc = fitz.open(pdf_path)
chart_count = 0

page = doc.load_page(0)
img_list = page.get_images(full=True )
for img_index, img in enumerate(img_list):
    base_image = doc.extract_image(img[0])
    image_bytes = base_image["image"]
    image = Image.open(io.BytesIO(image_bytes))
    image_path = os.path.join(output_folder, 
        f"chart_{chart_count+1}.png")
    image.save(image_path)
    chart_count += 1

This one only performs good on image type in PDF but not for vector charts. Do you have any suggestions or solutions?

Sample PDF file ( where you can see not all charts are being extracted)

I can't find the solution to how to isolate or extract vector chart and graphs(that are not images) from pdf.

import fitz
import os 

pdf_path = 'path to pdf'
output_folder = 'your output folder'
os.makedirs(output_folder, exist_ok=True)

doc = fitz.open(pdf_path)
chart_count = 0

page = doc.load_page(0)
img_list = page.get_images(full=True )
for img_index, img in enumerate(img_list):
    base_image = doc.extract_image(img[0])
    image_bytes = base_image["image"]
    image = Image.open(io.BytesIO(image_bytes))
    image_path = os.path.join(output_folder, 
        f"chart_{chart_count+1}.png")
    image.save(image_path)
    chart_count += 1

This one only performs good on image type in PDF but not for vector charts. Do you have any suggestions or solutions?

Sample PDF file ( where you can see not all charts are being extracted)

Share Improve this question asked Feb 3 at 12:38 ravshanovbek 113 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

You have correctly described PDF is different components on a page. Some are areas of colour and others are text and perhaps JPEG images so when we strip the background paper colours the first 6 pages match that description well.

Floating images and floating text characters in chart like pages. Any page colours or linework are totally separate sub page objects.

Moving on to the ones you hope to see different. We can see these are either images or simply just parts of a page thus not independent graphics for extraction.

Thus to extract objects from an area they must be gathered by co-ordinates in your Region of Interest (ROI) or redact the others from the page.

PyMuPdf is good at redaction so trim all the page outside the Region of interest using X and Y REDACTION boxes.

Then once all the surrounding data is deleted ensure the remaining text is one colour for ease of viewing.

The culmination of editing With MuPDF can thus be a single page PDF of the retained and edited area.

Finally you can reduce the page size to what you design it to be.

The code would be too large for me to write each custom page editor so I simply cut and paste using Mutools and Notepad as far easier.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

extract vector charts from PDF - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)