最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

arrays - Tokenize in JavaScript - Stack Overflow

programmeradmin0浏览0评论

If I have a string, how can I split it into an array of words and filter out some stopwords? I only want words of length 2 or greater.

If my string is

var text = "This is a short text about StackOverflow.";

I can split it with

var words = text.split(/\W+/);

But using split(/\W+/), I get all words. I could check if the words have a length of at least 2 with

function validate(token) {
  return /\w{2,}/.test(token);
}

but I guess I could do this smarter/faster with regexp.

I also have an array var stopwords = ['has', 'have', ...] which shouldn't be allowed in the array.

Actually, if I can find a way to filter out stopwords, I could just add all letters a, b, c, ..., z to the stopwords array to only accept words with at least 2 characters.

If I have a string, how can I split it into an array of words and filter out some stopwords? I only want words of length 2 or greater.

If my string is

var text = "This is a short text about StackOverflow.";

I can split it with

var words = text.split(/\W+/);

But using split(/\W+/), I get all words. I could check if the words have a length of at least 2 with

function validate(token) {
  return /\w{2,}/.test(token);
}

but I guess I could do this smarter/faster with regexp.

I also have an array var stopwords = ['has', 'have', ...] which shouldn't be allowed in the array.

Actually, if I can find a way to filter out stopwords, I could just add all letters a, b, c, ..., z to the stopwords array to only accept words with at least 2 characters.

Share Improve this question asked Aug 24, 2015 at 17:47 JamgreenJamgreen 11.1k32 gold badges122 silver badges232 bronze badges 3
  • This can be easily done using arrays and filter methods, are you looking to do all this with regex instead? – juvian Commented Aug 24, 2015 at 17:50
  • 2 I don't think there's anything wrong with text.split(/\W+/).filter(validate). No need to write an overplicated regex. – Bergi Commented Aug 24, 2015 at 17:50
  • You can get rid of non-word symbols and all words that are less than 1 in length with text.split(/\W+|\b\w\b/). – Wiktor Stribiżew Commented Aug 24, 2015 at 17:50
Add a ment  | 

5 Answers 5

Reset to default 3

I would do what you started: split by /W+/ and then validate each token (length and stopwords) in the array by using .filter().

var text = "This is a short text about StackOverflow.";
var stopwords = ['this'];

var words = text.split(/\W+/).filter(function(token) {
    token = token.toLowerCase();
    return token.length >= 2 && stopwords.indexOf(token) == -1;
});

console.log(words); // ["is", "short", "text", "about", "StackOverflow"]

You could easily tweak a regex to look for words >= 2 characters, but there's no point if you're already going to need to post-process to remove stopwords (token.length will be faster than any fancy regex you write).

Easy with Ramda:

var text       = "This is a short text about how StackOverflow has gas.";
var stopWords  = ['have', 'has'];
var isLongWord = R.pose(R.gt(R.__, 2), R.length);
var isGoWord   = R.pose(R.not, R.contains(R.__, stopWords));
var tokenize   = R.pose(R.filter(isGoWord), R.filter(isLongWord), R.split(' '));

tokenize(text); // ["This", "short", "text", "about", "how", "StackOverflow", "gas."]

http://bit.ly/1V5bVrP

What about splitting on something like this if you want to use a pure regex approach:

\W+|\b\w{1,2}\b

https://regex101./r/rB4cJ4/1

Something like this?

function filterArray(a, num_words, stop_words) {
    b = [];
    for (var ct = 0; ct <= a.length - 1; ct++) {
        if (!(a[ct] <= num_words) && !ArrayContains[a[ct], stop_words) {
            b.push(a[ct]);
        }
    }
    return b
}
function ArrayContains(word, a) {
    for (var ct = 0; ct <= a.length - 1; ct++) {
        if (word == a[ct]) {
            return true
        }
        return false
    }
}

var words = "He walks the dog";
var stops = ["dog"]
var a = words.split(" ");
var f = filterArray(a, 2, stops);

This should be help

(?:\b\W*\w\W*\b)+|\W+

output:

ThisisashorttextaboutStackOverflow. A..Zabc..xyz.

where is matched string.

发布评论

评论列表(0)

  1. 暂无评论