最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

pdfbox - Why PDF BOX's PDFStreamEngine.processPage giving wrong result? - Stack Overflow

programmeradmin2浏览0评论

I am trying to extract the marked Content on the page. It's not giving the correct mapped marked content of the page. Here is the sample File. The Formula content is marked like below.

The MC0 is not a Cosdictionary because of this reason while extracting Marked contents of the page the formula-related content is unable to read by PDFBox. Here

public void process(Operator operator, List<COSBase> arguments) throws IOException {
    COSName tag = null;
    COSDictionary properties = null;
    Iterator var5 = arguments.iterator();

    while(var5.hasNext()) {
        COSBase argument = (COSBase)var5.next();
        if (argument instanceof COSName) {
            tag = (COSName)argument;
        } else if (argument instanceof COSDictionary) {
            properties = (COSDictionary)argument;
        }
    }

    this.context.beginMarkedContentSequence(tag, properties);
}

But I found there is indirect reference for the MC0 is nothing but MCID -34. As reference shown in below figure.

How can I get the figure related marked content, When I ran the below code?

PDFMarkedContentExtractor extractor = new PDFMarkedContentExtractor();
    extractor.processPage(page);

Map<Integer, PDMarkedContent> theseMarkedContents = new HashMap<>();
markedContents.put(page, theseMarkedContents);
for (PDMarkedContent markedContent : extractor.getMarkedContents()) {
    addToMap(theseMarkedContents, markedContent);
    num++;
}

I am trying to extract the marked Content on the page. It's not giving the correct mapped marked content of the page. Here is the sample File. The Formula content is marked like below.

The MC0 is not a Cosdictionary because of this reason while extracting Marked contents of the page the formula-related content is unable to read by PDFBox. Here

public void process(Operator operator, List<COSBase> arguments) throws IOException {
    COSName tag = null;
    COSDictionary properties = null;
    Iterator var5 = arguments.iterator();

    while(var5.hasNext()) {
        COSBase argument = (COSBase)var5.next();
        if (argument instanceof COSName) {
            tag = (COSName)argument;
        } else if (argument instanceof COSDictionary) {
            properties = (COSDictionary)argument;
        }
    }

    this.context.beginMarkedContentSequence(tag, properties);
}

But I found there is indirect reference for the MC0 is nothing but MCID -34. As reference shown in below figure.

How can I get the figure related marked content, When I ran the below code?

PDFMarkedContentExtractor extractor = new PDFMarkedContentExtractor();
    extractor.processPage(page);

Map<Integer, PDMarkedContent> theseMarkedContents = new HashMap<>();
markedContents.put(page, theseMarkedContents);
for (PDMarkedContent markedContent : extractor.getMarkedContents()) {
    addToMap(theseMarkedContents, markedContent);
    num++;
}
Share Improve this question asked Apr 2 at 4:16 fascinating coderfascinating coder 3091 silver badge14 bronze badges 1
  • 1 What exactly is the "wrong result" you claim that the stream engine gives you? – mkl Commented Apr 2 at 8:00
Add a comment  | 

1 Answer 1

Reset to default 2

It took some time to understand what the actual issue is here and what the given pieces of information in the question refer to. But indeed, there is a bug in the PDFBox BeginMarkedContentSequenceWithProperties operator processor.

The process method the OP quoted in the question turns out to be the process method of BeginMarkedContentSequenceWithProperties:

    public void process(Operator operator, List<COSBase> arguments) throws IOException
    {
        COSName tag = null;
        COSDictionary properties = null;
        for (COSBase argument : arguments)
        {
            if (argument instanceof COSName)
            {
                tag = (COSName) argument;
            }
            else if (argument instanceof COSDictionary)
            {
                properties = (COSDictionary) argument;
            }
        }
        getContext().beginMarkedContentSequence(tag, properties);
    }

The issue is that this method implicitly assumes that there is at most one name parameter and one dictionary parameter of interest to the BDC operation. This is wrong! Actually this operation is specified as tag properties BDC to

Begin a marked-content sequence with an associated property list, terminated by a balancing EMC operator. tag shall be a name object indicating the role or significance of the sequence. properties shall be either an inline dictionary representing the property list or a name object associated with it in the Properties subdictionary of the current resource dictionary (see 14.6.2, "Property lists").

(ISO 32000-2, Table 352 — Marked-content operators)

Thus, the property dictionary argument can also be given by a name. In that case the BDC operation has two name parameters of interest!

For example in the case of the OP's file:

/Formula /MC0 BDC

In this case the process implementation above drops the Formula name and instead puts the MC0 into the tag variable. Correctly, though, it should have put Formula into tag and looked up MC0 in the Properties resources to put the dictionary from there into the properties variable.

发布评论

评论列表(0)

  1. 暂无评论