最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - How to Extract and Map JSON Values to PDF Coordinates - Stack Overflow

programmeradmin7浏览0评论

I'm working on a project where I need to extract data from a PDF and map it to a structured JSON format. Additionally, I want to map each JSON value to its corresponding position (bounding box coordinates) in the PDF for quality control purposes. This will allow users to easily reference where each value is located in the PDF.

Here's what I've done so far:

Converted pdf to Json : Used Gemini to get Json data from pdf

Extracted Text and Positions: I'm using PyMuPDF (fitz) to extract text along with their bounding box coordinates from the PDF.

Mapped JSON Values to Positions: I'm attempting to map the extracted text to my JSON values using a fuzzy matching algorithm (Levenshtein distance).

Integrated with JSON: I aim to include these positions in my JSON output.

However, I'm facing an issue where the positions are still coming out empty in the JSON output. I suspect the fuzzy matching isn't finding suitable matches between the extracted text and the JSON values.

Is there any other way to solve this issue? Can i directly prompt any LLM to give coordinates while they process pdf to json data? Hardcoding functions to extract coordinates seems to be not working so far

I'm working on a project where I need to extract data from a PDF and map it to a structured JSON format. Additionally, I want to map each JSON value to its corresponding position (bounding box coordinates) in the PDF for quality control purposes. This will allow users to easily reference where each value is located in the PDF.

Here's what I've done so far:

Converted pdf to Json : Used Gemini to get Json data from pdf

Extracted Text and Positions: I'm using PyMuPDF (fitz) to extract text along with their bounding box coordinates from the PDF.

Mapped JSON Values to Positions: I'm attempting to map the extracted text to my JSON values using a fuzzy matching algorithm (Levenshtein distance).

Integrated with JSON: I aim to include these positions in my JSON output.

However, I'm facing an issue where the positions are still coming out empty in the JSON output. I suspect the fuzzy matching isn't finding suitable matches between the extracted text and the JSON values.

Is there any other way to solve this issue? Can i directly prompt any LLM to give coordinates while they process pdf to json data? Hardcoding functions to extract coordinates seems to be not working so far

Share Improve this question asked Mar 27 at 7:04 Mandvi ShuklaMandvi Shukla 13 bronze badges 1
  • 1 Looks like you fot to add the code you're struggling with to your question. Please provide a Minimal Reproducible Example – Adon Bilivit Commented Mar 27 at 9:36
Add a comment  | 

1 Answer 1

Reset to default 0

To extract data from a PDF and map each value to its bounding box coordinates for JSON integration, follow these steps:

1. Extract Text with Bounding Boxes Using PyMuPDF:

Utilize PyMuPDF to extract text along with their bounding box coordinates:

import fitz  # PyMuPDF

# Open the PDF
doc = fitz.open('your_document.pdf')

# Iterate through each page
for page_num in range(len(doc)):
    page = doc.load_page(page_num)
    blocks = page.get_text('dict')['blocks']
    for block in blocks:
        for line in block.get('lines', []):
            for span in line.get('spans', []):
                text = span['text']
                bbox = span['bbox']  # (x0, y0, x1, y1)
                print(f'Text: {text}, BBox: {bbox}')

This script extracts text and their bounding boxes, which can be used for further processing.

2. Enhance Fuzzy Matching:

To improve the accuracy of mapping extracted text to JSON values:

  • Use Advanced Libraries: Employ libraries like FuzzyWuzzy for better string matching.
  from fuzzywuzzy import fuzz, process

  # Example: Matching a single value
  match = process.extractOne('target_value', list_of_extracted_texts)
  print(match)

This will return the best match along with a similarity score.

  • Set Similarity Thresholds: Define a similarity score threshold to filter out poor matches.

  • Incorporate Context: Use surrounding text or structural elements from the PDF to provide additional context for each value, aiding in disambiguation.

3. Integrate Coordinates into JSON:

Once accurate matches are established, integrate the bounding box coordinates into your JSON structure:

import json

# Example JSON structure
data = {
    'field_name': {
        'value': 'extracted_value',
        'bbox': (x0, y0, x1, y1)
    }
}

# Convert to JSON string
json_output = json.dumps(data, indent=4)
print(json_output)

4. Alternative Tools:

Consider other libraries like PDFMiner for text extraction with positional information.

Note: While LLMs like Gemini can assist in converting PDFs to JSON, they typically do not provide bounding box coordinates. Relying on specialized libraries like PyMuPDF remains the most effective approach for this task.

发布评论

评论列表(0)

  1. 暂无评论