java - Apache Tika upgrade from 2.6.0 to 3.0.0 content extraction includes document information

I have a project that is using Apache Tika 2.6.0 and want to upgrade to 3.0.0 for performance improvements.

The upgrade is simple enough in that I've not had to change or refactor any code and everything works as is. However, the actual content extraction is behaving differently between 3.0.0 and 2.6.0 and is including information about the document type. I have tried various different approaches to parsing document content, but each way I have tried produces the same result. For context, I am testing with a very simple Word document.

2.6.0 Parse Result

This is a word document with some nonsensical text that makes no sense. Simple Table

Text Here

Why Not

· A very important point

· Another important point

· No one cares about this point

3.0.0 Parse Result

[Content_Types].xml

_rels/.rels

word/document.xml This is a word document with some nonsensical text that makes no sense. Simple Table Text Here Why Not A very important point Another important point No one cares about this point

word/_rels/document.xml.rels

word/theme/theme1.xml

word/settings.xml

word/numbering.xml

word/styles.xml

word/webSettings.xml

word/fontTable.xml

docProps/core.xml

docProps/app.xml

Implementation

Here is the code I am using to run this which has not been changed after moving versions.

String content;
try
{
   parser.parse(inputStream, bodyContentHandler, new Metadata(), new ParseContext());
   content = bodyContentHandler.toString();
   inputStream.close();
}

I have tried other options for parsing such as new Tika().parseToString(inputStream, new Metadata()); but, as mentioned, I am getting the same result.

Has something changed between the above versions, or is this a known thing with a workaround? Any help/tips is appreciated.

Packages and Versions Being Used

tika-core: 3.0.0

tika-parsers-standard-package: 3.0.0

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

java - Apache Tika upgrade from 2.6.0 to 3.0.0 content extraction includes document information - Stack Overflow

2.6.0 Parse Result

3.0.0 Parse Result

Implementation

Packages and Versions Being Used

与本文相关的文章

评论列表(0)