I have a nodejs script that reads in a file and counts word frequencies. I currently feed each line into a function:
function getWords(line) {
return line.match(/\b\w+\b/g);
}
This matches almost everything, except it misses contractions
getWords("I'm") -> {"I", "m"}
However, I cannot just include apostrophes, as I would want matched apostrophes to be word boundaries:
getWords("hey'there'") -> {"hey", "there"}
Is there a way capture contractions while still treating other apostrophes as word boundaries?
I have a nodejs script that reads in a file and counts word frequencies. I currently feed each line into a function:
function getWords(line) {
return line.match(/\b\w+\b/g);
}
This matches almost everything, except it misses contractions
getWords("I'm") -> {"I", "m"}
However, I cannot just include apostrophes, as I would want matched apostrophes to be word boundaries:
getWords("hey'there'") -> {"hey", "there"}
Is there a way capture contractions while still treating other apostrophes as word boundaries?
Share Improve this question asked Dec 31, 2014 at 2:59 EhrykEhryk 1,9883 gold badges27 silver badges47 bronze badges 8-
How can you tell that
I'm
should be split buthey'there'
should not? Sounds like this might require a dictionary? – Aaron Dufour Commented Dec 31, 2014 at 3:31 - will "hey'there'" really appear like that, or will it have a space like "hey 'there'"? – Wesley Smith Commented Dec 31, 2014 at 3:38
-
3
What if the input is
"I'm Steve O'Conner's 'friend'"
? How would you know thatO'Conner's
is actually one word, not three? Or what if the matched apostrophes you mention contain a contraction with another apostrophe? – nnnnnn Commented Dec 31, 2014 at 3:39 - @nnnnnn my answer below seems to cover that case but it could use more testing – Wesley Smith Commented Dec 31, 2014 at 3:46
- 1 My question is, for the record, neither a joke nor rhetorical. You're going to have a hard time getting the answer you want unless you provide actual criteria for making the determination. @DelightedD0D's answer is good, but it drops the apostrophe from words like "'twas" and "'ow", which are also contractions, and it's not clear whether that's important to you. – Aaron Dufour Commented Dec 31, 2014 at 4:15
2 Answers
Reset to default 5The closest I believe you could get with regex would be line.match(/(?!'.*')\b[\w']+\b/g)
but be aware that if there is no space between a word and a '
, it will get treated as a contraction.
As Aaron Dufour mentioned, there would be no way for the regex by itself to know that I'm
is a contraction but hey'there
isn't.
See below:
You can match letters and a possible apostrophe followed by letters.
line.match(/[A-Za-z]+('[A-Za-z]+)?/g