最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

java - Preserve Empty Columns When Extracting Tables from PDF - Stack Overflow

programmeradmin3浏览0评论

I have 25–30 different types of PDF documents, each containing tables with varying structures. My ultimate goal is to extract table data from specific headings (i.e., between certain titles) and convert them into a JSON format mapping each column to a specific field with Java.

However, I'm running into a major issue with Apache PDFBox: if a table cell/column is empty, there’s no whitespace or placeholder in the PDF text. As a result, the extracted text merges adjacent columns, destroying the table structure. In other words, PDFBox doesn’t insert any spacing for empty columns, so I lose the column layout in the output text.

I’ve tried feeding the raw PDF into ChatGPT with a zero temperature prompt, hoping it would parse the table into json format consistently, but the results vary each time. Because of this inconsistency, I prefer a text-based approach: extract consistent text first, then feed that text into AI (or other rule-based logic) to transform it into JSON. But if the text extraction tool doesn’t preserve columns for empty cells, it becomes nearly impossible to map columns reliably.

There is an example above of a row with missing columns; it should be processed as follows:

"TOPLAM missing_column missing_column missing_column 70,000,000.00 70,427,300.00 4.88"

It should be extracted as:

"TOPLAM       70,000,000.00 70,427,300.00 4.88"

or

TOPLAM | [empty] | [empty] | [empty] | 70,000,000.00 | 70,427,300.00 | 4.88

My questions are:

  • Is there a way to configure Apache PDFBox to preserve spacing or placeholders for empty columns/cells so the structure remains intact?
  • Are there alternative Java libraries or techniques (e.g., PDFPlumber, Tabula, iText, OCRs, etc.) that can better preserve table columns even if some columns are empty? Do I need to rely on bounding box coordinates or a PDF-to-HTML approach to keep the layout consistent?
  • Has anyone dealt with similarly “invisible” empty columns, and how did you solve it? I want to avoid manual or per-PDF custom code if possible, because I have many PDF variations.

If that is not possible, should I just extract the relevant sections of the PDF and feed them to a fine-tuned AI?

Any advice on how to ensure reliable table extraction where empty columns don’t break the layout would be greatly appreciated.

发布评论

评论列表(0)

  1. 暂无评论