最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Extract only text content from a web page - Stack Overflow

programmeradmin5浏览0评论

I need to extract all the text content from a web page. I have used 'document.body.textContent'. But I get the javascript content as well.How do I ensure that I get only the readable text content?

function myFunction() {
  var str = document.body.textContent
  alert(str);
}
<html>
<title>Test Page for Text extraction</title>

<head>I hope this works</head>
<script src=".1.3/jquery.min.js"></script>

<body>
  <p>Test on this content to change the 5th word to a link
    <p>
      <button onclick="myFunction()">Try it</button>
</body>
</hmtl>

I need to extract all the text content from a web page. I have used 'document.body.textContent'. But I get the javascript content as well.How do I ensure that I get only the readable text content?

function myFunction() {
  var str = document.body.textContent
  alert(str);
}
<html>
<title>Test Page for Text extraction</title>

<head>I hope this works</head>
<script src="https://ajax.googleapis./ajax/libs/jquery/2.1.3/jquery.min.js"></script>

<body>
  <p>Test on this content to change the 5th word to a link
    <p>
      <button onclick="myFunction()">Try it</button>
</body>
</hmtl>

Share Improve this question asked Sep 28, 2015 at 14:49 vjravivjravi 861 silver badge6 bronze badges
Add a ment  | 

2 Answers 2

Reset to default 5

Just remove the tags you dont want read before doing body.textContent.

function myFunction() {
  var bodyScripts = document.querySelectorAll("body script");
  for(var i=0; i<bodyScripts.length; i++){
      bodyScripts[i].remove();
  }
  var str = document.body.textContent;
  document.body.innerHTML = '<pre>'+str+'</pre>';
}
<html>
<title>Test Page for Text extraction</title>

<head>I hope this works</head>
<script src="https://ajax.googleapis./ajax/libs/jquery/2.1.3/jquery.min.js"></script>

<body>
  <p>Test on this content to change the 5th word to a link
    <p>
      <button onclick="myFunction()">Try it</button>
</body>
</hmtl>

Try document.body.innerText.

This MDN article describes the differences between textContent and innerText:

Don't get confused by the differences between Node.textContent and HTMLElement.innerText. Although the names seem similar, there are important differences:

  • textContent gets the content of all elements, including <script> and <style> elements. In contrast, innerText only shows "human-readable" elements.
  • textContent returns every element in the node. In contrast, innerText is aware of styling and won't return the text of "hidden" elements.
发布评论

评论列表(0)

  1. 暂无评论