最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Finding keywords in texts - Stack Overflow

programmeradmin0浏览0评论

I have an array with incidents that has happened, that are written in free text and therefore aren't following a pattern except for some keywords, eg. "robbery", "murderer", "housebreaking", "car accident" etc. Those keywords can be anywhere in the text, and I want to find those keywords and add those to categories, eg. "Robberies".

In the end, when I have checked all the incidents I want to have a list of categories like this:

Robberies: 14
Murder attempts: 2
Car accidents: 5
...

The array elements can look like this:

incidents[0] = "There was a robbery on Amest Ave last night...";
incidents[1] = "There has been a report of a murder attempt...";
incidents[2] = "Last night there was a housebreaking in...";
...

I guess the best here is to use regular expressions to find the keywords in the texts, but I really suck at regexp and therefore need some help here.

The regular expressions is not correct below, but I guess this structure would work? Is there a better way of doing this to avoid DRY?

var trafficAccidents = 0,
    robberies = 0,
    ...

function FindIncident(incident) {
    if (incident.match(/car accident/g)) {
        trafficAccidents += 1;
    }
    else if (incident.match(/robbery/g)) {
        robberies += 1;
    }
    ...
}

Thanks a lot in advance!

I have an array with incidents that has happened, that are written in free text and therefore aren't following a pattern except for some keywords, eg. "robbery", "murderer", "housebreaking", "car accident" etc. Those keywords can be anywhere in the text, and I want to find those keywords and add those to categories, eg. "Robberies".

In the end, when I have checked all the incidents I want to have a list of categories like this:

Robberies: 14
Murder attempts: 2
Car accidents: 5
...

The array elements can look like this:

incidents[0] = "There was a robbery on Amest Ave last night...";
incidents[1] = "There has been a report of a murder attempt...";
incidents[2] = "Last night there was a housebreaking in...";
...

I guess the best here is to use regular expressions to find the keywords in the texts, but I really suck at regexp and therefore need some help here.

The regular expressions is not correct below, but I guess this structure would work? Is there a better way of doing this to avoid DRY?

var trafficAccidents = 0,
    robberies = 0,
    ...

function FindIncident(incident) {
    if (incident.match(/car accident/g)) {
        trafficAccidents += 1;
    }
    else if (incident.match(/robbery/g)) {
        robberies += 1;
    }
    ...
}

Thanks a lot in advance!

Share Improve this question edited Jan 10, 2013 at 5:32 holyredbeard asked Jan 9, 2013 at 23:35 holyredbeardholyredbeard 21.3k32 gold badges111 silver badges174 bronze badges 1
  • 1 This sounds a little off-place for Javascript, but your method is on the right path for what you're trying to do. – caiosm1005 Commented Jan 9, 2013 at 23:44
Add a ment  | 

7 Answers 7

Reset to default 2

The following code shows an approach you can take. You can test it here

var INCIDENT_MATCHES = {
  trafficAccidents: /(traffic|car) accident(?:s){0,1}/ig,
  robberies: /robbery|robberies/ig,
  murder: /murder(?:s){0,1}/ig
};

function FindIncidents(incidentReports) {
  var incidentCounts = {};
  var incidentTypes = Object.keys(INCIDENT_MATCHES);
  incidentReports.forEach(function(incident) {
    incidentTypes.forEach(function(type) {
      if(typeof incidentCounts[type] === 'undefined') {
        incidentCounts[type] = 0;
      }
      var matchFound = incident.match(INCIDENT_MATCHES[type]);
      if(matchFound){
          incidentCounts[type] += matchFound.length;
      };
    });
  });

  return incidentCounts;
}

Regular expressions make sense, since you'll have a number of strings that meet your 'match' criteria, even if you only consider the differences in plural and singular forms of 'robbery'. You also want to ensure that your matching is case-insensitive.

You need to use the 'global' modifier on your regexes so that you match strings like "Murder, Murder, murder" and increment your count by 3 instead of just 1.

This allows you to keep the relationship between your match criteria and incident counters together. It also avoids the need for global counters (granted INCIDENT_MATCHES is a global variable here, but you can readily put that elsewhere and take it out of the global scope.

Actually, I would kind of disagree with you here . . . I think string functions like indexOf will work perfectly fine.

I would use JavaScript's indexOf method which takes 2 inputs:

string.indexOf(value,startPos);

So one thing you can do is define a simple temporary variable as your cursor as such . . .

function FindIncident(phrase, word) {
    var cursor = 0;
    var wordCount = 0;
    while(phrase.indexOf(word,cursor) > -1){
        cursor = incident.indexOf(word,cursor);
        ++wordCount;        
    }
    return wordCount;
}

I have not tested the code but hopefully you get the idea . . .

Be particularly careful of the starting position if you do use it.

RegEx makes my head hurt too. ;) If you're looking for exact matches and aren't worried about typos and misspellings, I'd search the incident strings for substrings containing the keywords you're looking for.

incident = incident.toLowerCase();
if incident.search("car accident") > 0 {
    trafficAccidents += 1;
}
else if incident.search("robbery") > 0 {
    robberies += 1;
}
...

Use an array of objects to store all the many different categories you're searching for, plete with an appropiate regular expression and a count member, and you can write the whole thing in four lines.

var categories = [
    {
        regexp: /\brobbery\b/i
        , display: "Robberies"
        , count: 0
    }
    , {
        regexp: /\bcar accidents?\b/i
        , display: "Car Accidents"
        , count: 0
    }
    , {
        regexp: /\bmurder\b/i
        , display: "Murders"
        , count: 0
    }
];

var incidents = [ 
    "There was a robbery on Amest Ave last night..."
    , "There has been a report of an murder attempt..."
    , "Last night there was a housebreaking in..."
];

for(var x = 0; x<incidents.length; x++)
    for(var y = 0; y<categories.length; y++)
        if (incidents[x].match(categories[y].regexp))
            categories[y].count++;

Now, no matter what you need, you can simply edit one section of code, and it will propagate through your code.

This code has the potential to categorize each incident in multiple categories. To prevent that, just add a 'break' statement to the if block.

You could do something like this which will grab all words found on each item in the array and it will return an object with the count:

var words = ['robbery', 'murderer', 'housebreaking', 'car accident'];

function getAllIncidents( incidents ) {
  var re = new RegExp('('+ words.join('|') +')', 'i')
    , result = {};
  incidents.forEach(function( txt ) {
    var match = ( re.exec( txt ) || [,0] )[1];
    match && (result[ match ] = ++result[ match ] || 1);
  });
  return result;
}

console.log( getAllIncidents( incidents ) );
//^= { housebreaking: 1, car accident: 2, robbery: 1, murderer: 2 }

This is more a a quick prototype but it could be improved with plurals and multiple keywords.

Demo: http://jsbin./idesoc/1/edit

Use an object to store your data.

events = [
    { exp : /\brobbery|robberies\b/i, 
    //       \b                      word boundary
    //         robbery               singular
    //                |              or
    //                 robberies     plural
    //                          \b   word boundary
    //                            /i case insensitive
      name : "robbery",
      count: 0
    },
    // other objects here
]

var i = events.length;    
while( i-- ) {

    var j = incidents.length; 
    while( j-- ) {

        // only checks a particular event exists in incident rather than no. of occurrences
        if( events[i].exp.test( incidents[j] ) { 
            events[i].count++;
        }
    }
}

Yes, that's one way to do it, although matching plain-words with regex is a bit of overkill — in which case, you should be using indexOf as rbtLong suggested.

You can further sophisticate it by:

  • appending the i flag (match lowercase and uppercase characters).
  • adding possible word variations to your expression. robbery could be translated into robber(y|ies), thus matching both singular and plural variations of the word. car accident could be (car|truck|vehicle|traffic) accident.

Word boundaries \b

Don't use this. It'll require having non-alphanumeric characters surrounding your matching word and will prevent matching typos. You should make your queries as abrangent as possible.


if (incident.match(/(car|truck|vehicle|traffic) accident/i)) {
    trafficAccidents += 1;
}
else if (incident.match(/robber(y|ies)/i)) {
    robberies += 1;
}

Notice how I discarded the g flag; it stands for "global match" and makes the parser continue searching the string after the first match. This seems unnecessary as just one confirmed occurrence is enough for your needs.

This website offers an excellent introduction to regular expressions

http://www.regular-expressions.info/tutorial.html

发布评论

评论列表(0)

  1. 暂无评论