I want a function that extracts the user-readable text from an HTML string, but for each character in the output text string, I want to know its matching index in the initial HTML string. The end goal is to generate text fragment URLs based on the content of target web pages, in a web scraper.
For example:
const html = "lorem <em>ipsum</em> dolor";
const { text, indexMap } = flattenHtml(html);
// text = "lorem ipsum dolor"
// l o r e, m i p s ...
// indexMap = [ 0, 1, 2, 3, 4, 5, 10, 11, 12 ...]
Is there any module readily available, or that would allow for a simple integration / extension?
An initial approach I had was to use a ontext
event in a htmlparser2
parser, but I find myself reinventing the wheel by trying to add line breaks for block tags, trimming spaces, etc.
I want a function that extracts the user-readable text from an HTML string, but for each character in the output text string, I want to know its matching index in the initial HTML string. The end goal is to generate text fragment URLs based on the content of target web pages, in a web scraper.
For example:
const html = "lorem <em>ipsum</em> dolor";
const { text, indexMap } = flattenHtml(html);
// text = "lorem ipsum dolor"
// l o r e, m i p s ...
// indexMap = [ 0, 1, 2, 3, 4, 5, 10, 11, 12 ...]
Is there any module readily available, or that would allow for a simple integration / extension?
An initial approach I had was to use a ontext
event in a htmlparser2
parser, but I find myself reinventing the wheel by trying to add line breaks for block tags, trimming spaces, etc.
2 Answers
Reset to default 0The following extracts all text content from an html document, while keeping track of the string offset for each text content item.
Each yielded text item is trimmed in both directions, and has all whitespace sequences replaced with a single space.
const htmlTextOffsets = html => {
// Logic to recursively get text offsets
const offsetItems = function*(doc, offset = 0) {
// Text-type elements are our base-case
if (doc.constructor.name === 'Text') {
const rawText = doc.textContent;
const trimText = rawText.trim();
if (!trimText) return; // Don't yield whitespace-only strings
// Offset by leading whitespace (which is omitted from results)
const trimLength = (rawText.length - rawText.trimStart().length);
return yield {
offset: offset + trimLength,
text: trimText.replace(/[\s]+/g, ' ')
};
}
// We're dealing with a container - increment offset by the length
// of the opening tag, and recurse on children
const [ openTag, closeTag ] = doc.outerHTML.split(doc.innerHTML);
offset += openTag.length;
for (const child of doc.childNodes) {
yield* offsetItems(child, offset);
// For each child, increment the offset by the child's full length
offset += (child.outerHTML || child.textContent).length;
}
};
// Parse the supplied string as html (xml) and send it to the recursive logic
const parsed = new DOMParser().parseFromString(html, 'text/xml');
return [ ...offsetItems(parsed.documentElement) ];
};
// Example of calling `htmlTextOffsets`:
const html = `
<div class="abc">
<p>Hello1</p>
<p>
Hello2<br/>
Hello3
<ol>
<li>Item 1</li>
<li>Item 2</li>
<li>
This is an extended block of multiline text.
May you thrive and achieve all your dreams. You are beautiful.
Everyone loves you. The world is better because of you.
</li>
</ol>
</p>
</div>
`.trim();
console.log('Text with offsets:', htmlTextOffsets(html));
console.log('Flattened text:\n' + htmlTextOffsets(html).map(item => item.text).join('\n'));
An example of one of the yielded items is:
{ offset: 108, text: 'Item 2' }
This reflects the fact that what preceeds the text "Item 2" is:
<div class="abc">
<p>Hello1</p>
<p>
Hello2<br/>
Hello3
<ol>
<li>Item 1</li>
<li>
which is exactly 108 characters (including whitespace).
I think this code will provide the required functionality:
function getIndexCharacterMap(str) {
const indexMap = new Map();
for (let i = 0; i < str.length; i++) {
if(str[i] === ' ')continue;
indexMap.set(i, str[i]);
}
return indexMap;
}
const html = "<style>body { color: red; }</style>Some content";
const text = html.replace(/<\/?[^>]+>/g, match => ' '.repeat(match.length));
console.log('Text with spaces inserted:\n' + text);
console.log('Mapped values:', [...getIndexCharacterMap(text).keys()].join(', '));
HTMLElement
type so that you can retrieve theinnerText
value as a string and then break apart to a character array? Some of the answer here might aid you: stackoverflow/questions/57551589/… – majixin Commented Feb 3 at 17:25