I am searching for a JavaScript library, which can read .doc
- and .docx
- files. The focus is only on the text content. I am not interested in pictures, formulas or other special structures in MS-Word file.
It would be great if the library works with to JavaScript FileReader as shown in the code below.
function readExcel(currfile) {
var reader = new FileReader();
reader.onload = (function (_file) {
return function (e) {
//here should the magic happen
};
})(currfile);
reader.onabort = function (e) {
alert('File read canceled');
};
reader.readAsBinaryString(currfile);
}
I searched through the internet, but I could not get what I was looking for.
I am searching for a JavaScript library, which can read .doc
- and .docx
- files. The focus is only on the text content. I am not interested in pictures, formulas or other special structures in MS-Word file.
It would be great if the library works with to JavaScript FileReader as shown in the code below.
function readExcel(currfile) {
var reader = new FileReader();
reader.onload = (function (_file) {
return function (e) {
//here should the magic happen
};
})(currfile);
reader.onabort = function (e) {
alert('File read canceled');
};
reader.readAsBinaryString(currfile);
}
I searched through the internet, but I could not get what I was looking for.
Share Improve this question edited Aug 4, 2018 at 23:45 halfer 20.5k19 gold badges108 silver badges201 bronze badges asked Jun 22, 2017 at 12:01 TorbenTorben 4781 gold badge7 silver badges25 bronze badges 2-
I'm not aware of any JS libraries that can display doc/docx contents on front end only. But if you fetch these files from a backend, you can extract the text content of doc/docx files in the backend before sending the text content to the front end by using Apache Tika, e.g.
Tika#parseToString()
method. – Dat Nguyen Commented Jun 22, 2017 at 12:14 - Thanks for your reply, but my backend is Microsoft Dynamics NAV. So your solution is sadly not working for me. And as further information it has to be a JS AddIn for NAV. – Torben Commented Jun 22, 2017 at 13:21
2 Answers
Reset to default 7You can use docxtemplater for this (even if normally, it is used for templating, it can also just get the text of the document) :
const zip = new PizZip(content);
// This will parse the template, and will throw an error if the template is
// invalid, for example, if the template is "{user" (no closing tag)
const doc = new Docxtemplater(zip, {
paragraphLoop: true,
linebreaks: true,
});
const text = doc.getFullText();
See the Doc for installation information (I'm the maintainer of this project)
However, it only handles docx, not doc
now you can extract the text content from doc/docx without installing external dependencies.
You can use the node library called any-text
Currently, it supports a number of file extensions like PDF, XLSX, XLS, CSV etc
Usage is very simple:
- Install the library as a dependency (/dev-dependency)
npm i -D any-text
- Make use of the
getText
method to read the text content
var reader = require('any-text');
reader.getText(`path-to-file`).then(function (data) {
console.log(data);
});
- You can also use the
async/await
notation
var reader = require('any-text');
const text = await reader.getText(`path-to-file`);
console.log(text);
Sample Test
var reader = require('any-text');
const chai = require('chai');
const expect = chai.expect;
describe('file reader checks', () => {
it('check docx file content', async () => {
expect(
await reader.getText(`${process.cwd()}/test/files/dummy.doc`)
).to.contains('Lorem ipsum');
});
});
I hope it will help!