最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

java - Apache Tika upgrade from 2.6.0 to 3.0.0 content extraction includes document information - Stack Overflow

programmeradmin1浏览0评论

I have a project that is using Apache Tika 2.6.0 and want to upgrade to 3.0.0 for performance improvements.

The upgrade is simple enough in that I've not had to change or refactor any code and everything works as is. However, the actual content extraction is behaving differently between 3.0.0 and 2.6.0 and is including information about the document type. I have tried various different approaches to parsing document content, but each way I have tried produces the same result. For context, I am testing with a very simple Word document.

2.6.0 Parse Result

This is a word document with some nonsensical text that makes no sense. Simple Table

Text Here

Why Not

· A very important point

· Another important point

· No one cares about this point

3.0.0 Parse Result

[Content_Types].xml

_rels/.rels

word/document.xml This is a word document with some nonsensical text that makes no sense. Simple Table Text Here Why Not A very important point Another important point No one cares about this point

word/_rels/document.xml.rels

word/theme/theme1.xml

word/settings.xml

word/numbering.xml

word/styles.xml

word/webSettings.xml

word/fontTable.xml

docProps/core.xml

docProps/app.xml

Implementation

Here is the code I am using to run this which has not been changed after moving versions.

String content;
try
{
   parser.parse(inputStream, bodyContentHandler, new Metadata(), new ParseContext());
   content = bodyContentHandler.toString();
   inputStream.close();
}

I have tried other options for parsing such as new Tika().parseToString(inputStream, new Metadata()); but, as mentioned, I am getting the same result.

Has something changed between the above versions, or is this a known thing with a workaround? Any help/tips is appreciated.

Packages and Versions Being Used

tika-core: 3.0.0

tika-parsers-standard-package: 3.0.0

发布评论

评论列表(0)

  1. 暂无评论