最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - How to get an Array of all words used on a page - Stack Overflow

programmeradmin3浏览0评论

So I'm trying to get an array of all the words used in my web page.

Should be easy, right?

The problem I run into is that $("body").text().split(" ") returns an array where the words at the beginning of one element and end of another are joined as one.

i.e:

<div id="1">Hello
    <div id="2">World</div>
</div>

returns ["HelloWorld"] when I want it to return ["Hello", "World"].

I also tried:

wordArr = [];

function getText(target)
{    
    if($(this).children())
    {
        $(this).children(function(){getText(this)});
    }
    else
    {
        var testArr = $(this).text().split(" ");
        for(var i =0; i < testArr.length; i++)
            wordArr.push(testArr[i]);
    }

}

getText("body");

but $(node).children() is truthy for any node in the DOM that exists, so that didn't work.

I'm sure I'm missing something obvious, so I'd appreciate an extra set of eyes.

For what it's worth, I don't need unique words, just every word in the body of the document as an element in the array. I'm trying to use it to generate context and lexical co-occurrence with another set of words, so duplicates just up the contextual importance of a given word.

Thanks in advance for any ideas.

See Fiddle

So I'm trying to get an array of all the words used in my web page.

Should be easy, right?

The problem I run into is that $("body").text().split(" ") returns an array where the words at the beginning of one element and end of another are joined as one.

i.e:

<div id="1">Hello
    <div id="2">World</div>
</div>

returns ["HelloWorld"] when I want it to return ["Hello", "World"].

I also tried:

wordArr = [];

function getText(target)
{    
    if($(this).children())
    {
        $(this).children(function(){getText(this)});
    }
    else
    {
        var testArr = $(this).text().split(" ");
        for(var i =0; i < testArr.length; i++)
            wordArr.push(testArr[i]);
    }

}

getText("body");

but $(node).children() is truthy for any node in the DOM that exists, so that didn't work.

I'm sure I'm missing something obvious, so I'd appreciate an extra set of eyes.

For what it's worth, I don't need unique words, just every word in the body of the document as an element in the array. I'm trying to use it to generate context and lexical co-occurrence with another set of words, so duplicates just up the contextual importance of a given word.

Thanks in advance for any ideas.

See Fiddle

Share Improve this question edited Jun 4, 2013 at 15:10 Jason Nichols asked Jun 3, 2013 at 21:51 Jason NicholsJason Nichols 3,7603 gold badges32 silver badges51 bronze badges 1
  • see here stackoverflow./questions/298750/… it might be helpful to you – PSR Commented Jun 3, 2013 at 21:58
Add a ment  | 

4 Answers 4

Reset to default 7

How about something like this?

 var res = $('body  *').contents().map(function () {
    if (this.nodeType == 3 && this.nodeValue.trim() != "") 
        return this.nodeValue.trim();
}).get().join(" ");
console.log(res);

Demo

Get the array of words:

var res = $('body  *').contents().map(function () {
    if (this.nodeType == 3 && this.nodeValue.trim() != "") //check for nodetype text and ignore empty text nodes
        return this.nodeValue.trim().split(/\W+/);  //split the nodevalue to get words.
}).get(); //get the array of words.

console.log(res);

Demo

function getText(target) {
    var wordArr = [];
    $('*',target).add(target).each(function(k,v) {
        var words  = $('*',v.cloneNode(true)).remove().end().text().split(/(\s+|\n)/);
        wordArr = wordArr.concat(words.filter(function(n){return n.trim()}));
    });
    return wordArr;
}

FIDDLE

you can do this

function getwords(e){
    e.contents().each(function(){
        if ( $(this).children().length > 0 ) {
            getwords($(this))
        }
        else if($.trim($(this).text())!=""){
            words=words.concat($.trim($(this).text()).split(/\W+/))
        }
    });
}    

http://jsfiddle/R55eM/

The question assumes that words are not internally separated by elements. If you simply create an array of words separated by white space and elements, you will end up with:

Fr<b>e</b>d

being read as

['Fr', 'e', 'd']; 

Another thing to consider is punctuation. How do you deal with: "There were three of them: Mark, Sue and Tom. They were un-remarkable. One—the red head—was in the middle." Do you remove all punctuation? Or replace it with white space before trimming? How do you re-join words that are split by markup or characters that might be inter–word or intra–word punctuation? Note that while it is popular to write a dash between words with a space at either side, "correct" punctuation uses an m dash with no spaces.

Not so simple…

Anyhow, an approach that just splits on spaces and elements using recursion and works in any browser in use without any library support is:

function getWords(element) {
  element = element || document.body;
  var node, nodes = element.childNodes;
  var words = [];
  var text, i=0;

    while (node = nodes[i++]) {

    if (node.nodeType == 1) {
      words = words.concat(getWords(node));

    } else if (node.nodeType == 3) {
      text = node.data.replace(/^\s+|\s+$/g,'').replace(/\s+/g,' ');
      words = !text.length? words : words.concat(text.split(/\s/));
    }
  }
  return words;
}

but it does not deal with the issues above.

Edit

To avoid script elements, change:

    if (node.nodeType == 1) {

to

    if (node.nodeType == 1 && node.tagName.toLowerCase() != 'script') {

Any element that should be avoided can be added to the condition. If a number of element types should be avoided, you can do:

var elementsToAvoid = {script:'script', button:'button'};
...
    if (node.nodeType == 1 && node.tagName && !(node.tagName.toLowerCase() in elementsToAvoid)) {
发布评论

评论列表(0)

  1. 暂无评论