最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - How to Convert a PDF Table with Thousands of Rows into JSON in React - Stack Overflow

programmeradmin1浏览0评论

I am working on a project where I need to convert a PDF containing a large table (thousands of rows) into a JSON Array of object. The PDF has a table with headers that should be used as keys in the JSON object, and the respective cell values should be the values. Each row of the table should be represented as an object in the JSON.

I have tried using libraries like pdf-parse and pdfjs-dist from npm, but they didn't meet my expectations for extracting the data correctly.

What is the best approach to extract the table data from the PDF in the format I need? Should I handle this processing on the frontend in React and then send the resulting JSON to the backend (which is built with Python), or should I send the PDF to the backend and handle the conversion there?

I am working on a project where I need to convert a PDF containing a large table (thousands of rows) into a JSON Array of object. The PDF has a table with headers that should be used as keys in the JSON object, and the respective cell values should be the values. Each row of the table should be represented as an object in the JSON.

I have tried using libraries like pdf-parse and pdfjs-dist from npm, but they didn't meet my expectations for extracting the data correctly.

What is the best approach to extract the table data from the PDF in the format I need? Should I handle this processing on the frontend in React and then send the resulting JSON to the backend (which is built with Python), or should I send the PDF to the backend and handle the conversion there?

Share Improve this question asked Feb 4 at 8:18 Manu H NManu H N 13 bronze badges 1
  • 1 Please provide enough code so others can better understand or reproduce the problem. – Community Bot Commented Feb 4 at 8:40
Add a comment  | 

1 Answer 1

Reset to default 1

What is PDF

PDF is not a structured language, but instead a display-oriented format. In fact, it is even better described as a rendering engine programming language.

To render the three words "The lazy fox", the PDF-generating software can choose to instruct either:

  • to draw the sentence in one strike
  • or to draw the words one by one, with instructions to offset its starting point between each
  • it could even choose to draw it letter by letter, with no way for you (unless you are a rendering engine) to distinguish between two adjacent letters, and space-separated letters
  • or have the sentence interspersed with noise, as "The lazy 36 fox", because it decided to write the page number inbetween
    (well, the PDF will contain The lazy <move to bottom right> 36 <come back to the page> fox)
  • did I tell it could choose not to embed the letters in the generated PDF file, but instead convert the text to curbs, and curbs are all you'll be able to extract from the PDF?
  • even if it uses real letters or words, it could choose to write them in reversed order, or first draw all the 'e's, then the 's's, and so on
  • hum, if it's a PDF editor it could also contain the text The nice fox <move back to the start position of "nice"> <draw a white rectangle over the word "nice"> lazy

How you could deal with it

Ensure a formal input

Thus the ability to extract contents in a structured way from your PDF can vary greatly, depending on what produced the PDF.

Your first mission is to ensure you only have 1 stable source of PDF.
Do not expect to create a general-use "any PDF containing tables-to-JSON".

OK, let's say that you're OK with it, you just have to get the juice of that specific PDF, and once done, you'll trash the whole project never to work on it anymore (no way to "Manu, the engine you gave us in 2025 doesn't work anymore on the 2027 version of the PDF, can you repair it please?").

Determine the firepower needed

Your best bet then will be to try tools starting from the simplests.

Level 1

First try PDF-to-text extractors (like pdf-parse; but please give an excerpt of its output!),
but don't count on them to output a pretty table;
instead try to find a pattern in the output:
if your output looks like:

col1
col2
col3
col1
col2
col3
pagenumber
col1
col2
col3

then you're good to go with some loops, parsing, detection and steering.

Be warned that you may have some manual iterations to do,
for example if the table's data is hardly distinguishable from the page numbers or headers or footers,
or if the table contains multi-line cells:

col1
col2
second line of col2 that you could mistake for a col3
col3

Then this would be a cycle of "parse PDF to a .txt -> regex to JSON -> verify consistence -> if fail then edit the .txt -> regex to JSON -> verify -> […]".

This would be the most efficient solution,
depending on the kind of guts of your PDF of course.

Level 2

Level 2 would be to parse the PDF instructions (pdfjs-dist may be good at it) to detect the "pen moves" between text tokens, and then mentally place it on a map, knowing that buckets at the same ordinate with subsequent abscissas are adjacent words, or cells.

But I'm not sure it's worth the effort, and then you could go to…

Level 3

In case you need a fully automated workflow that level 1 can't provide (from your specific PDF),
then you could use pdfjs-dist to render the PDF, pushing the image to table-aware OCR software that would output something more suitable to the "regex to JSON" last step of Level 1.

发布评论

评论列表(0)

  1. 暂无评论