javascript - Cheerio - Get text with html tags replaced by white spaces

Today we're using Cheerio's and notably the method .text() to extract text from a html input.

But when html is

<div>
  By<div><h2 class="authorh2">John Smith</h2></div>
</div>

Visually on the page, the /div after the word "by" ensures there is a space or a line break. but when applying cheerio text(), we get as result sth that is wrong:

ByJohn smith => which is wrong as we need a white space between By and john.

Generally speaking, is it possible to get the text but in a little special way so that ANY html tag is replaced by a white space. (I'm OK to trim afterwards all multiple whites spaces ...)

We'd like to have as output By John smith

Today we're using Cheerio's and notably the method .text() to extract text from a html input.

But when html is

<div>
  By<div><h2 class="authorh2">John Smith</h2></div>
</div>

Visually on the page, the /div after the word "by" ensures there is a space or a line break. but when applying cheerio text(), we get as result sth that is wrong:

ByJohn smith => which is wrong as we need a white space between By and john.

Generally speaking, is it possible to get the text but in a little special way so that ANY html tag is replaced by a white space. (I'm OK to trim afterwards all multiple whites spaces ...)

We'd like to have as output By John smith

Share Improve this question edited Jul 10, 2019 at 9:28 asked Jul 10, 2019 at 9:22 Mathieu 4,79713 gold badges64 silver badges126 bronze badges

Perhaps not relevant to the problem but your html example is invalid as the divs enclosing John Smith are both closing tags. – cYrixmorten Commented Jul 10, 2019 at 9:27
sure not rleevant to the real issue. thanks, corrected the mistype – Mathieu Commented Jul 10, 2019 at 9:28
Looks to me you're just not applying the right selector. Take the one you already use and add ` h2` to get the content of the header separately. – Trace Commented Dec 22, 2021 at 23:39
@Mathieu You have to use cheerio? – Maik Lowrey Commented Dec 23, 2021 at 13:22

Add a ment |

7 Answers 7

Sorted by: Reset to default 3 +50

You could use the following regex to replace all HTML tags with a space:

/<\/?[a-zA-Z0-9=" ]*>/g

So when you replace your HTML with this regex, it may produce multiple spaces. In that case you can use replace(/\s\s+/g, ' ') to replace all spaces with a single space.

See the result:

console.log(document.querySelector('div').innerHTML.replaceAll(/<\/?[a-zA-Z0-9=" ]*>/g, ' ').replace(/\s\s+/g, ' ').trim())

<div>
  By<div><h2 class="authorh2">John Smith</h2></div>
</div>

You can use pure JavaScript for this task.

const parent = document.querySelector('div');
console.log(parent.innerText.replace(/(\r\n|\n|\r)/gm, " "))

<div>
  By<div><h2 class="authorh2">John Smith</h2></div>
</div>

Generally speaking, is it possible to get the text but in a little special way so that ANY html tag is replaced by a white space. (I'm OK to trim afterwards all multiple whites spaces ...)

Just add ' ' before and after all the tags:

$("*").each(function (index) {
    $(this).prepend(' ');
    $(this).append(' ');
});

Then deal with multiple spaces:

$.text().replace(/\s{2,}/g, ' ').trim();
//=> "By John Smith"

Since cheerio is just a jQuery implementation for NodeJS, you might find these answers useful as well.

Working example:

const cheerio = require('cheerio');
const $ = cheerio.load(`
    <div>
        By<div><h2 class="authorh2">John Smith</h2></div>
    </div>
`);

$("*").each(function (index) {
    $(this).prepend(' ');
    $(this).append(' ');
});

let raw = $.text();
//=> "        By  John Smith" (duplicate spaces)

let trimmed = raw.replace(/\s{2,}/g, ' ').trim();
//=> "By John Smith"

Instead of cheerio, you could use htmlparser2. It lets you define callback methods for each time it encounters an opening tag, text, or a closing tag while parsing HTML.

This code results in the output string you want:

const htmlparser = require('htmlparser2');

let markup = `<div>
By<div><h2 class="authorh2">John Smith</h2></div>
</div>`;

var parts = [];
var parser = new htmlparser.Parser({
    onopentag: function(name, attributes){
        parts.push(' ');
    },
    ontext: function(text){
        parts.push(text);
    },
    onclosetag: function(tagName){
    // no-op
    }
}, {decodeEntities: true});

parser.write(markup);
parser.end();

// Join the parts and replace all occurances of 2 or more
// spaces with a single space.
const result = parts.join('').replace(/\ {2,}/g, ' ');

console.log(result); // By John Smith

Here's another example on how to use it: https://runkit./jfahrenkrug/htmlparser2-demo/1.0.0

Cheerio's text() method is mainly to be used to get clean text from scraping. As you have already experienced this is a little different from converting an HTML page to plain text. Using regex replacements to add a space will work, if you only need the text for indexing. For some other scenarios, like converting to audio for example, it won't always work, as you need to differentiate between a space and a new line.

My suggestion would be to use a library for converting HTML to markdown. One option would be turndown.

var TurndownService = require('turndown')

var turndownService = new TurndownService()
var markdown = turndownService.turndown('<div>\nBy<div><h2>John Smith</h2></div></div>')

This will print out:

'By\n\nJohn Smith\n----------'

The last line is because of the H2 header. Markdown is far easier to clean, you probably only need to remove URLs and images. Text display is also easier to be read by humans.

If you want a clean text representation of the content, I would remend using lynx (used by Project Gutenberg) or pandoc. Both can be installed and then called from node using spawn. These will provide a cleaner text representation than running puppeteer and using textContent or innerText.

You could also try walking the DOM and adding new lines depending on the node type.

import "./styles.css";
import cheerio from "cheerio";

const NODE_TYPES = {
  TEXT: "text",
  ELEMENT: "tag"
};

const INLINE_ELEMENTS = [
  "a",
  "abbr",
  "acronym",
  "audio",
  "b",
  "bdi",
  "bdo",
  "big",
  "br",
  "button",
  "canvas",
  "cite",
  "code",
  "data",
  "datalist",
  "del",
  "dfn",
  "em",
  "embed",
  "i",
  "iframe",
  "img",
  "input",
  "ins",
  "kbd",
  "label",
  "map",
  "mark",
  "meter",
  "noscript",
  "object",
  "output",
  "picture",
  "progress",
  "q",
  "ruby",
  "s",
  "samp",
  "script",
  "select",
  "slot",
  "small",
  "span",
  "strong",
  "sub",
  "sup",
  "svg",
  "template",
  "textarea",
  "time",
  "u",
  "tt",
  "var",
  "video",
  "wbr"
];

const content = `
<div>
  By March
  <div>
    <h2 class="authorh2">John Smith</h2>
    <div>line1</div>line2
         line3
    <ul>
      <li>test</li>
      <li>test2</li>
      <li>test3</li>
    </ul>
  </div>
</div>
`;

const isInline = (element) => INLINE_ELEMENTS.includes(element.name);
const isBlock = (element) => isInline(element) === false;
const walkTree = (node, callback, index = 0, level = 0) => {
  callback(node, index, level);
  for (let i = 0; i < (node.children || []).length; i++) {
    walkTree(node.children[i], callback, i, ++level);
    level--;
  }
};

const docFragText = [];
const cheerioFn = cheerio.load(content);
const docFrag = cheerioFn("body")[0];

walkTree(docFrag, (element) => {
  if (element.name === "body") {
    return;
  }

  if (element.type === NODE_TYPES.TEXT) {
    const parentElement = element.parent || {};
    const previousElement = element.prev || {};

    let textContent = element.data
      .split("\n")
      .map((nodeText, index) => (/\w/.test(nodeText) ? nodeText + "\n" : ""))
      .join("");

    if (textContent) {
      if (isInline(parentElement) || isBlock(previousElement)) {
        textContent = `${textContent}`;
      } else {
        textContent = `\n${textContent}`;
      }
      docFragText.push(textContent);
    }
  }
});

console.log(docFragText.join(""));

Existing answers use regex or other libraries, but neither is necessary. The trick to dealing with text nodes in Cheerio is to use .contents():

const cheerio = require("cheerio"); // 1.0.0-rc.12

const html = `
<div>
  By<div><h2 class="authorh2">John Smith</h2></div>
</div>`;

const $ = cheerio.load(html);
console.log($("div").contents().first().text().trim()); // => By

If you're not certain the text node will always be the first child, you can grab the first text node among all children as follows:

const text = $(
  [...$("div").contents()].find(e => e.type === "text")
)
  .text()
  .trim();
console.log(text); // => By

Hopefully needless to say, but the "John Smith" part is standard Cheerio:

const name = $("div").find("h2").text().trim();
console.log(name); // => John Smith

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

javascript - Cheerio - Get text with html tags replaced by white spaces - Stack Overflow

7 Answers 7

与本文相关的文章

评论列表(0)