node.js - Extract the readable text content from an HTML document, with a reverse"character index map"

I want a function that extracts the user-readable text from an HTML string, but for each character in the output text string, I want to know its matching index in the initial HTML string. The end goal is to generate text fragment URLs based on the content of target web pages, in a web scraper.

For example:

const html = "lorem <em>ipsum</em> dolor";
const { text, indexMap } = flattenHtml(html);

// text = "lorem ipsum dolor"

//              l  o  r  e, m      i   p  s  ...
// indexMap = [ 0, 1, 2, 3, 4, 5, 10, 11, 12 ...]

Is there any module readily available, or that would allow for a simple integration / extension?

An initial approach I had was to use a ontext event in a htmlparser2 parser, but I find myself reinventing the wheel by trying to add line breaks for block tags, trimming spaces, etc.

For example:

const html = "lorem <em>ipsum</em> dolor";
const { text, indexMap } = flattenHtml(html);

// text = "lorem ipsum dolor"

//              l  o  r  e, m      i   p  s  ...
// indexMap = [ 0, 1, 2, 3, 4, 5, 10, 11, 12 ...]

Is there any module readily available, or that would allow for a simple integration / extension?

An initial approach I had was to use a ontext event in a htmlparser2 parser, but I find myself reinventing the wheel by trying to add line breaks for block tags, trimming spaces, etc.

Share Improve this question edited Feb 4 at 10:17 asked Feb 3 at 17:09 AlexAngc 3732 gold badges4 silver badges13 bronze badges

Have you considered using HTMLElement type so that you can retrieve the innerText value as a string and then break apart to a character array? Some of the answer here might aid you: stackoverflow/questions/57551589/… – majixin Commented Feb 3 at 17:25
How is an "html string" not already "plain text"? Do you wish to extract only the text content? – Gershom Maes Commented Feb 3 at 19:18
"Do you wish to extract only the text content?" yes, that's what I meant. I've updated the title for clarity – AlexAngc Commented Feb 4 at 10:17
"Have you considered using HTMLElement type so that you can retrieve the innerText value as a string and then break apart to a character array?" I'm not sure how this will provide the index map? – AlexAngc Commented Feb 4 at 10:20

Add a comment |

2 Answers 2

Sorted by: Reset to default 0

The following extracts all text content from an html document, while keeping track of the string offset for each text content item.

Each yielded text item is trimmed in both directions, and has all whitespace sequences replaced with a single space.

const htmlTextOffsets = html => {
  
  // Logic to recursively get text offsets
  const offsetItems = function*(doc, offset = 0) {
    
    // Text-type elements are our base-case
    if (doc.constructor.name === 'Text') {
      const rawText = doc.textContent;
      const trimText = rawText.trim();
      if (!trimText) return; // Don't yield whitespace-only strings
      
      // Offset by leading whitespace (which is omitted from results)
      const trimLength = (rawText.length - rawText.trimStart().length);
      
      return yield {
        offset: offset + trimLength,
        text: trimText.replace(/[\s]+/g, ' ')
      };
    }
    
    // We're dealing with a container - increment offset by the length
    // of the opening tag, and recurse on children
    const [ openTag, closeTag ] = doc.outerHTML.split(doc.innerHTML);
    offset += openTag.length;
    
    for (const child of doc.childNodes) {
      yield* offsetItems(child, offset);
      
      // For each child, increment the offset by the child's full length
      offset += (child.outerHTML || child.textContent).length;
    }
    
  };
  
  // Parse the supplied string as html (xml) and send it to the recursive logic
  const parsed = new DOMParser().parseFromString(html, 'text/xml');
  return [ ...offsetItems(parsed.documentElement) ];
    
};

// Example of calling `htmlTextOffsets`:
const html = `
<div class="abc">
  <p>Hello1</p>
  <p>
    Hello2<br/>
    Hello3
    <ol>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>
        This is an extended block of multiline text.
        May you thrive and achieve all your dreams. You are beautiful.
        Everyone loves you. The world is better because of you.
      </li>
    </ol>
  </p>
</div>
`.trim();
console.log('Text with offsets:', htmlTextOffsets(html));
console.log('Flattened text:\n' + htmlTextOffsets(html).map(item => item.text).join('\n'));

An example of one of the yielded items is:

{ offset: 108, text: 'Item 2' }

This reflects the fact that what preceeds the text "Item 2" is:

<div class="abc">
  <p>Hello1</p>
  <p>
    Hello2<br/>
    Hello3
    <ol>
      <li>Item 1</li>
      <li>

which is exactly 108 characters (including whitespace).

I think this code will provide the required functionality:

function getIndexCharacterMap(str) {
  const indexMap = new Map();
  for (let i = 0; i < str.length; i++) {
    if(str[i] === ' ')continue;
    indexMap.set(i, str[i]);
  }
  return indexMap;
}

const html = "<style>body { color: red; }</style>Some content";
const text = html.replace(/<\/?[^>]+>/g, match => ' '.repeat(match.length));

console.log('Text with spaces inserted:\n' + text);
console.log('Mapped values:', [...getIndexCharacterMap(text).keys()].join(', '));

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

node.js - Extract the readable text content from an HTML document, with a reverse"character index map" - Stack

2 Answers 2

与本文相关的文章

评论列表(0)