Given the string:
© 2010 Women’s Flat Track Derby Association (WFTDA)
I want:
2010 -- Women's -- Flat
Women's -- Flat -- Track
Track -- Derby -- Association
I'm using regex:
([a-zA-Z]+)\s([A-Z][a-z]*)\s([a-zA-Z]+)
It's only returning:
s -- Flat -- Track
Given the string:
© 2010 Women’s Flat Track Derby Association (WFTDA)
I want:
2010 -- Women's -- Flat
Women's -- Flat -- Track
Track -- Derby -- Association
I'm using regex:
([a-zA-Z]+)\s([A-Z][a-z]*)\s([a-zA-Z]+)
It's only returning:
s -- Flat -- Track
Share
Improve this question
edited Nov 16, 2010 at 22:11
Bart Kiers
170k37 gold badges306 silver badges295 bronze badges
asked Nov 16, 2010 at 22:09
CaveatrobCaveatrob
13.3k33 gold badges110 silver badges189 bronze badges
1
- Sorry - it's ultraedit JS, so probably javascript would work. – Caveatrob Commented Nov 16, 2010 at 22:10
2 Answers
Reset to default 12This problem isn't straightforward, but to understand why, you need to understand how the regular expression engine operates on your string.
Let's consider the pattern [a-z]{3}
(match 3 successive characters between a and z) on the target string abcdef
. The engine starts from the left side of the string (before the a
), and sees that a
matches [a-z]
, so it advances one position. Then, it sees that b
matches [a-z]
and advances again. Finally, it sees that c
matches, advances again (to before d
) and returns abc
as a match.
If the engine is set up to return multiple matches, it will now try to match again, but it keeps its positional information (so, like above, it'll match and return def
).
Because the engine has already moved past the b
while matching abc
, bcd
will never be considered as a match. For this same reason, in your expression, once a group of words is matched, the engine will never consider words within the first match to be a part of the next one.
In order to get around this, you need to use capturing groups inside of lookaheads to collect matching words that appear later in the string:
var str = "2010 Women's Flat Track Derby Association",
regex = /([a-z0-9']+)(?=\s+([a-z0-9']+)\s+([a-z0-9']+))/ig,
match;
while (match = regex.exec(str))
{
var group1 = match[1], group2 = match[2], group3 = match[3];
console.log("Found match: " + group1 + " -- " + group2 + " -- " + group3);
}
This results in:
2010 -- Women's -- Flat
Women's -- Flat -- Track
Flat -- Track -- Derby
Track -- Derby -- Association
See this in action at http://jsfiddle/jRgXm/.
The regular expression searches for what you seem to be defining as a word ([a-z0-9']+)
, and captures it into subgroup 1, and then uses a lookahead (which is a zero-width assertion, so it doesn't advance the engine's cursor), that captures the next two words into subgroups 2 and 3.
However, if you are using the actual Javascript engine, you must RegExp.exec
and loop over the results (see this question for a discussion of why) or use the new matchAll
method (ES2020). I don't know how UltraEdit's engine is implemented, but hopefully it can do a global search and also collect subgroups.
Just for pleteness, here's the example above using ES2020' matchAll
(the first element in each returned array is the total match, then the subsequent elements are the capture groups):
const str = "2010 Women's Flat Track Derby Association";
const regex = /([a-z0-9']+)(?=\s+([a-z0-9']+)\s+([a-z0-9']+))/ig;
console.log([...str.matchAll(regex)]);
I'm using some generic regex tester, so I can't guarantee it will work for you but...
([A-Z0-9][\w’]+)\s([A-Z][\w]+)\s([A-Z][\w]+)
Three words starting with a number or capital letter followed by letters/numbers or that funky apostrophe, separated by spaces. Works for me.
Edit: I assume you can loop through, repeating the matcher in JS i've never used it.