I have a project that is using Apache Tika 2.6.0 and want to upgrade to 3.0.0 for performance improvements.
The upgrade is simple enough in that I've not had to change or refactor any code and everything works as is. However, the actual content extraction is behaving differently between 3.0.0 and 2.6.0 and is including information about the document type. I have tried various different approaches to parsing document content, but each way I have tried produces the same result. For context, I am testing with a very simple Word document.
2.6.0 Parse Result
This is a word document with some nonsensical text that makes no sense. Simple Table
Text Here
Why Not
· A very important point
· Another important point
· No one cares about this point
3.0.0 Parse Result
[Content_Types].xml
_rels/.rels
word/document.xml This is a word document with some nonsensical text that makes no sense. Simple Table Text Here Why Not A very important point Another important point No one cares about this point
word/_rels/document.xml.rels
word/theme/theme1.xml
word/settings.xml
word/numbering.xml
word/styles.xml
word/webSettings.xml
word/fontTable.xml
docProps/core.xml
docProps/app.xml
Implementation
Here is the code I am using to run this which has not been changed after moving versions.
String content;
try
{
parser.parse(inputStream, bodyContentHandler, new Metadata(), new ParseContext());
content = bodyContentHandler.toString();
inputStream.close();
}
I have tried other options for parsing such as new Tika().parseToString(inputStream, new Metadata());
but, as mentioned, I am getting the same result.
Has something changed between the above versions, or is this a known thing with a workaround? Any help/tips is appreciated.
Packages and Versions Being Used
tika-core: 3.0.0
tika-parsers-standard-package: 3.0.0