itext - Trying to Maintain PDF Formatting When Replacing Text?

Summary: I've been working extensively on automating text replacement in PDF documents using Java. My goal is to preserve the original formatting and structure of the PDFs. Despite trying multiple approaches, I haven't been able to achieve the desired results. I'm seeking advice on what I might be missing in my implementation.

Approaches I have Tried Approach 1: Using PDFBox to Extract, Edit, and Reinsert Text

package PDFbox;
    
    import .apache.pdfbox.pdmodel.PDDocument;
    import .apache.pdfbox.text.PDFTextStripper;
    
    import java.io.File;
    import java.io.FileWriter;
    import java.io.IOException;
    
    public class ExtractText {
        public static void main(String[] args) throws IOException {
            if (args.length < 2) {
                System.err.println("Usage: java ExtractText <pdfFilePath> <outputTextFilePath>");
                return;
            }
    
            String pdfFilePath = args[0];
            String outputTextFilePath = args[1];
    
            try (PDDocument document = PDDocument.load(new File(pdfFilePath));
                 FileWriter writer = new FileWriter(outputTextFilePath)) {
                PDFTextStripper textStripper = new PDFTextStripper();
                String text = textStripper.getText(document);
                writer.write(text);
                System.out.println("Text content extracted and saved to " + outputTextFilePath);
            }
        }
    }

Edited the Text:

Manually edited the extracted text in a text file.

package PDFbox;
    
    import .apache.pdfbox.pdmodel.PDDocument;
    import .apache.pdfbox.pdmodel.PDPage;
    import .apache.pdfbox.pdmodel.PDPageContentStream;
    import .apache.pdfbox.pdmodel.font.PDType1Font;
    
    import java.io.File;
    import java.io.IOException;
    import java.nio.file.Files;
    import java.nio.file.Paths;
    
    public class InsertText {
        public static void main(String[] args) throws IOException {
            if (args.length < 3) {
                System.err.println("Usage: java InsertText <pdfFilePath> <textFilePath> <outputPdfFilePath>");
                return;
            }
    
            String pdfFilePath = args[0];
            String textFilePath = args[1];
            String outputPdfFilePath = args[2];
    
            // Load the text content
            String editedText = new String(Files.readAllBytes(Paths.get(textFilePath)));
    
            try (PDDocument document = PDDocument.load(new File(pdfFilePath))) {
                PDPage page = document.getPage(0);
    
                // Modify the page content
                try (PDPageContentStream contentStream = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.APPEND, true, true)) {
                    contentStream.beginText();
                    contentStream.setFont(PDType1Font.HELVETICA, 12);
                    contentStream.newLineAtOffset(25, 750);
    
                    // Split text into lines to handle line breaks
                    String[] lines = editedText.split("\n");
                    for (String line : lines) {
                        contentStream.showText(line);
                        contentStream.newLineAtOffset(0, -15); // Move to the next line
                    }
    
                    contentStream.endText();
                }
    
                // Save the updated PDF
                document.save(outputPdfFilePath);
                System.out.println("Edited text inserted and PDF saved to " + outputPdfFilePath);
            }
        }
    }

Attempted to Re-insert the Text:

package PDFbox;
    
    import .apache.pdfbox.pdmodel.PDDocument;
    import .apache.pdfbox.pdmodel.PDPage;
    import .apache.pdfbox.pdmodel.PDPageContentStream;
    import .apache.pdfbox.pdmodel.font.PDType1Font;
    
    import java.io.File;
    import java.io.IOException;
    import java.nio.file.Files;
    import java.nio.file.Paths;
    
    public class InsertText {
        public static void main(String[] args) throws IOException {
            if (args.length < 3) {
                System.err.println("Usage: java InsertText <pdfFilePath> <textFilePath> <outputPdfFilePath>");
                return;
            }
    
            String pdfFilePath = args[0];
            String textFilePath = args[1];
            String outputPdfFilePath = args[2];
    
            // Load the text content
            String editedText = new String(Files.readAllBytes(Paths.get(textFilePath)));
    
            try (PDDocument document = PDDocument.load(new File(pdfFilePath))) {
                PDPage page = document.getPage(0);
    
                // Modify the page content
                try (PDPageContentStream contentStream = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.APPEND, true, true)) {
                    contentStream.beginText();
                    contentStream.setFont(PDType1Font.HELVETICA, 12);
                    contentStream.newLineAtOffset(25, 750);
    
                    // Split text into lines to handle line breaks
                    String[] lines = editedText.split("\n");
                    for (String line : lines) {
                        contentStream.showText(line);
                        contentStream.newLineAtOffset(0, -15); // Move to the next line
                    }
    
                    contentStream.endText();
                }
    
                // Save the updated PDF
                document.save(outputPdfFilePath);
                System.out.println("Edited text inserted and PDF saved to " + outputPdfFilePath);
            }
        }
    }

Challenges:

The formatting and layout of the PDF are significantly altered after editing.
Issues with text alignment, fonts, and page breaks.

Approach 2: Using iText to Extract, Edit, and Reinsert Text

import com.itextpdf.text.pdf.PdfReader;
    import com.itextpdf.text.pdf.parser.PdfTextExtractor;
    
    public class ExtractTextUsingIText {
        public static void main(String[] args) {
            try {
                PdfReader reader = new PdfReader("path/to/pdf");
                String text = PdfTextExtractor.getTextFromPage(reader, 1);
                System.out.println(text);
                reader.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

Edited the Text:

Manually edited the extracted text in a text file.

Attempted to Re-insert the Text:

import com.itextpdf.text.Document;
    import com.itextpdf.text.DocumentException;
    import com.itextpdf.text.Paragraph;
    import com.itextpdf.text.pdf.PdfWriter;
    
    import java.io.FileOutputStream;
    import java.io.IOException;
    
    public class InsertTextUsingIText {
        public static void main(String[] args) {
            Document document = new Document();
            try {
                PdfWriter.getInstance(document, new FileOutputStream("path/to/edited_pdf"));
                document.open();
                document.add(new Paragraph("Edited text goes here"));
                document.close();
                System.out.println("Edited text inserted and PDF saved");
            } catch (DocumentException | IOException e) {
                e.printStackTrace();
            }
        }
    }

Challenges:

The PDF structure is disrupted after editing.
Loss of original formatting and layout.

Approach 3: Editing Binary Data

Exported PDF Content to Binary:

package PDFbox;
    
    import java.io.File;
    import java.io.FileInputStream;
    import java.io.FileOutputStream;
    import java.io.IOException;
    
    public class ExportPDFAsBinary {
        public static void main(String[] args) throws IOException {
            if (args.length < 2) {
                System.err.println("Usage: java ExportPDFAsBinary <sourcePdfPath> <outputBinaryFilePath>");
                return;
            }
    
            String sourcePdfPath = args[0];
            String outputBinaryFilePath = args[1];
    
            try (FileInputStream fis = new FileInputStream(new File(sourcePdfPath));
                 FileOutputStream fos = new FileOutputStream(new File(outputBinaryFilePath))) {
                byte[] buffer = new byte[1024];
                int bytesRead;
                while ((bytesRead = fis.read(buffer)) != -1) {
                    fos.write(buffer, 0, bytesRead);
                }
    
                System.out.println("PDF content exported as binary data to " + outputBinaryFilePath);
            }
        }
    }

Edited Binary Data:

package PDFbox;
    
    import java.io.File;
    import java.io.FileInputStream;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.nio.charset.StandardCharsets;
    
    public class EditBinaryPDF {
        public static void main(String[] args) throws IOException {
            if (args.length < 4) {
                System.err.println("Usage: java EditBinaryPDF <binaryFilePath> <outputBinaryFilePath> <searchString> <replaceString>");
                return;
            }
    
            String binaryFilePath = args[0];
            String outputBinaryFilePath = args[1];
            String searchString = args[2];
            String replaceString = args[3];
    
            // Ensure search and replace strings are of the same length
            if (searchString.length() != replaceString.length()) {
                System.err.println("Search and replace strings must be of the same length");
                return;
            }
    
            // Read the binary file into a byte array
            byte[] binaryData = readBinaryFile(binaryFilePath);
    
            // Convert search and replace strings to byte arrays
            byte[] searchBytes = searchString.getBytes(StandardCharsets.ISO_8859_1);
            byte[] replaceBytes = replaceString.getBytes(StandardCharsets.ISO_8859_1);
    
            // Edit the binary data
            binaryData = replaceTextInBinaryData(binaryData, searchBytes, replaceBytes);
    
            // Save the edited binary data to the output file
            try (FileOutputStream fos = new FileOutputStream(new File(outputBinaryFilePath))) {
                fos.write(binaryData);
            }
    
            System.out.println("Edited binary data saved to " + outputBinaryFilePath);
        }
    
        private static byte[] readBinaryFile(String filePath) throws IOException {
            File file = new File(filePath);
            byte[] binaryData = new byte[(int) file.length()];
            try (FileInputStream fis = new FileInputStream(file)) {
                fis.read(binaryData);
            }
            return binaryData;
        }
    
        private static byte[] replaceTextInBinaryData(byte[] binaryData, byte[] searchBytes, byte[] replaceBytes) {
            for (int i = 0; i <= binaryData.length - searchBytes.length; i++) {
                boolean match = true;
                for (int j = 0; j < searchBytes.length; j++) {
                    if (binaryData[i + j] != searchBytes[j]) {
                        match = false;
                        break;
                    }
                }
                if (match) {
                    System.arraycopy(replaceBytes, 0, binaryData, i, replaceBytes.length);
                    i += searchBytes.length - 1;  // Move past the replaced text
                }

ReCreated PDF out of .binary file

package PDFbox;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;

public class RecreatePDFFromBinary {
    public static void main(String[] args) throws IOException {
        if (args.length < 2) {
            System.err.println("Usage: java RecreatePDFFromBinary <inputBinaryFilePath> <outputPdfPath>");
            return;
        }

        String inputBinaryFilePath = args[0];
        String outputPdfPath = args[1];

        try (FileInputStream fis = new FileInputStream(new File(inputBinaryFilePath));
             FileOutputStream fos = new FileOutputStream(new File(outputPdfPath))) {

            byte[] buffer = new byte[1024];
            int bytesRead;
            while ((bytesRead = fis.read(buffer)) != -1) {
                fos.write(buffer, 0, bytesRead);
            }

            System.out.println("PDF recreated from binary data at " + outputPdfPath);
        }
    }
}

ExportPDFAsBinary: output file should be a .binary extension and it worked 100% fine

RecreatePDFFromBinary also worked absolutely fine but if the binary file was edited then output PDF file was totally messed up

Any help or breakthrough will be highly appreciated. Note: I also tried JavaScript feature inside Adobe Acrobat Pro but that also did not work. Reading through the documentation I came to conclusion that Adobe discourages automation.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

itext - Trying to Maintain PDF Formatting When Replacing Text? - Stack Overflow

与本文相关的文章

评论列表(0)