I have 25–30 different types of PDF documents, each containing tables with varying structures. My ultimate goal is to extract table data from specific headings (i.e., between certain titles) and convert them into a JSON format mapping each column to a specific field with Java.
However, I'm running into a major issue with Apache PDFBox: if a table cell/column is empty, there’s no whitespace or placeholder in the PDF text. As a result, the extracted text merges adjacent columns, destroying the table structure. In other words, PDFBox doesn’t insert any spacing for empty columns, so I lose the column layout in the output text.
I’ve tried feeding the raw PDF into ChatGPT with a zero temperature prompt, hoping it would parse the table into json format consistently, but the results vary each time. Because of this inconsistency, I prefer a text-based approach: extract consistent text first, then feed that text into AI (or other rule-based logic) to transform it into JSON. But if the text extraction tool doesn’t preserve columns for empty cells, it becomes nearly impossible to map columns reliably.
There is an example above of a row with missing columns; it should be processed as follows:
"TOPLAM missing_column missing_column missing_column 70,000,000.00 70,427,300.00 4.88"
It should be extracted as:
"TOPLAM 70,000,000.00 70,427,300.00 4.88"
or
TOPLAM | [empty] | [empty] | [empty] | 70,000,000.00 | 70,427,300.00 | 4.88
My questions are:
- Is there a way to configure Apache PDFBox to preserve spacing or placeholders for empty columns/cells so the structure remains intact?
- Are there alternative Java libraries or techniques (e.g., PDFPlumber, Tabula, iText, OCRs, etc.) that can better preserve table columns even if some columns are empty? Do I need to rely on bounding box coordinates or a PDF-to-HTML approach to keep the layout consistent?
- Has anyone dealt with similarly “invisible” empty columns, and how did you solve it? I want to avoid manual or per-PDF custom code if possible, because I have many PDF variations.
If that is not possible, should I just extract the relevant sections of the PDF and feed them to a fine-tuned AI?
Any advice on how to ensure reliable table extraction where empty columns don’t break the layout would be greatly appreciated.