最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

extract vector charts from PDF - Stack Overflow

programmeradmin0浏览0评论

I can't find the solution to how to isolate or extract vector chart and graphs(that are not images) from pdf.

I have tried extract directly, but I realize that it is not that straight forward. I was using mymupdf. This script extracts and saves only images. But I needed to save charts that are not the images. In PDF apparently it is stored differently.

import fitz
import os 

pdf_path = 'path to pdf'
output_folder = 'your output folder'
os.makedirs(output_folder, exist_ok=True)

doc = fitz.open(pdf_path)
chart_count = 0

page = doc.load_page(0)
img_list = page.get_images(full=True )
for img_index, img in enumerate(img_list):
    base_image = doc.extract_image(img[0])
    image_bytes = base_image["image"]
    image = Image.open(io.BytesIO(image_bytes))
    image_path = os.path.join(output_folder, 
        f"chart_{chart_count+1}.png")
    image.save(image_path)
    chart_count += 1

This one only performs good on image type in PDF but not for vector charts. Do you have any suggestions or solutions?

Sample PDF file ( where you can see not all charts are being extracted)

I can't find the solution to how to isolate or extract vector chart and graphs(that are not images) from pdf.

I have tried extract directly, but I realize that it is not that straight forward. I was using mymupdf. This script extracts and saves only images. But I needed to save charts that are not the images. In PDF apparently it is stored differently.

import fitz
import os 

pdf_path = 'path to pdf'
output_folder = 'your output folder'
os.makedirs(output_folder, exist_ok=True)

doc = fitz.open(pdf_path)
chart_count = 0

page = doc.load_page(0)
img_list = page.get_images(full=True )
for img_index, img in enumerate(img_list):
    base_image = doc.extract_image(img[0])
    image_bytes = base_image["image"]
    image = Image.open(io.BytesIO(image_bytes))
    image_path = os.path.join(output_folder, 
        f"chart_{chart_count+1}.png")
    image.save(image_path)
    chart_count += 1

This one only performs good on image type in PDF but not for vector charts. Do you have any suggestions or solutions?

Sample PDF file ( where you can see not all charts are being extracted)

Share Improve this question asked Feb 3 at 12:38 ravshanovbekravshanovbek 113 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

You have correctly described PDF is different components on a page. Some are areas of colour and others are text and perhaps JPEG images so when we strip the background paper colours the first 6 pages match that description well.

Floating images and floating text characters in chart like pages. Any page colours or linework are totally separate sub page objects.

Moving on to the ones you hope to see different. We can see these are either images or simply just parts of a page thus not independent graphics for extraction.

Thus to extract objects from an area they must be gathered by co-ordinates in your Region of Interest (ROI) or redact the others from the page.

PyMuPdf is good at redaction so trim all the page outside the Region of interest using X and Y REDACTION boxes.

Then once all the surrounding data is deleted ensure the remaining text is one colour for ease of viewing.

The culmination of editing With MuPDF can thus be a single page PDF of the retained and edited area.

Finally you can reduce the page size to what you design it to be.

The code would be too large for me to write each custom page editor so I simply cut and paste using Mutools and Notepad as far easier.

发布评论

评论列表(0)

  1. 暂无评论