javascript - Why does string.replace(W*g,'_') prepend all characters?

I've been learning regexp in js an encountered a situation that I didn't understand.

I ran a test of the replace function with the following regexp:

/\W*/g

And expected it prepend the beginning of the string and proceed to replace all non-word characters.

The Number is (123)(234)

would become:

_The_Number_is__123___234_

This would be prepending the string because it has at least zero instances, and then replacing all non-breaking spaces and non-word characters.

Instead, it prepended every character and replaced all non-word characters.

_T_h_e__N_u_m_b_e_r__i_s__1_2_3__2_3_4__

Why did it do this?

I've been learning regexp in js an encountered a situation that I didn't understand.

I ran a test of the replace function with the following regexp:

/\W*/g

And expected it prepend the beginning of the string and proceed to replace all non-word characters.

The Number is (123)(234)

would become:

_The_Number_is__123___234_

This would be prepending the string because it has at least zero instances, and then replacing all non-breaking spaces and non-word characters.

Instead, it prepended every character and replaced all non-word characters.

_T_h_e__N_u_m_b_e_r__i_s__1_2_3__2_3_4__

Why did it do this?

Share Improve this question edited Mar 3, 2017 at 21:45 asked Mar 3, 2017 at 21:31 Judd Franklin 5702 gold badges5 silver badges16 bronze badges

Why do you expect _ at the start of the string? If you fix the regex and use /\W/g you would get The_Number_is__123__234_. Isn't that the expected result? – Wiktor Stribiżew Commented Mar 3, 2017 at 21:34
Since I used * I assumed that it would start with an underscore since there will automatically be at least zero instances of non-word characters. – Judd Franklin Commented Mar 3, 2017 at 21:38
1 So, what are the actual requirements? – Wiktor Stribiżew Commented Mar 3, 2017 at 21:41
Ok, you need .replace(/\W|^/g, '_') then – Wiktor Stribiżew Commented Mar 3, 2017 at 21:44
Thanks for the working answer. It looks like you are matching start of input and then a global search for all non-word characters. I think I just didn't understand the mechanics of how the search is conducted. When a match is found, the replace function does a replacement and then moves to the next character, starting the search over. Right? – Judd Franklin Commented Mar 3, 2017 at 21:55

| Show 9 more comments

6 Answers 6

Sorted by: Reset to default 9

The problem is the meaning of \W*. It means "0 or more non-word characters". This means that the empty string "" would match, given that it is indeed 0 non-word characters.

So the regex matches before every character in the string and at the end, hence why all the replacements are done.

You want either /\W/g (replacing each individual non-word character) or /\W+/g (replacing each set of consecutive non-word characters).

"The Number is (123)(234)".replace(/\W/g, '_')  // "The_Number_is__123__234_"
"The Number is (123)(234)".replace(/\W+/g, '_') // "The_Number_is_123_234_"

TL;DR

Never use a pattern that can match an empty string in a regex replace method if your aim is to replace and not insert text

To replace all separate occurrences of a non-word char in a string, use .replace(/\W/g, '_') (that is, remove * quantifier that matches zero or more occurrences of the quantified subpattern)

To replace all chunks of non-word chars in a string with a single pattern, use .replace(/\W+/g, '_') (that is, replace * quantifier with + that matches one or more occurrences of the quantified subpattern)

Note: the solution below is tailored for the OP much more specific requirements.

A string is parsed by the JS regex engine as a sequence of chars and locations in between them. See the following diagram where I marked locations with hyphens:

  -T-h-e- -N-u-m-b-e-r- -i-s- -(-1-2-3-)-(-2-3-4-)-
                                               |
  ||Location between T and h, etc. .............  |
  |1st symbol                                     |
start                     ->                     end

All these positions can be analyzed and matched with a regex.

Since /\W*/g is a regex matching all non-overlapping occurrences (due to g modifier) of 0 and more (due to * quantifier) non-word chars, all the positions before word chars are matched. Between T and h, there is a location tested with the regex, and as there is no non-word char (h is a word char), the empty match is returned (as \W* can match an empty string).

So, you need to replace the start of string and each non-word char with a _. Naive approach is to use .replace(/\W|^/g, '_'). However, there is a caveat: if a string starts with a non-word character, no _ will get appended at the start of the string:

console.log("Hi there.".replace(/\W|^/g, '_'));  // _Hi_there_
console.log(" Hi there.".replace(/\W|^/g, '_')); // _Hi_there_

Note that here, \W comes first in the alternation and "wins" when matching at the beginning of the string: the space is matched and then no start position is found at the next match iteration.

You may now think you can match with /^|\W/g. Look here:

console.log("Hi there.".replace(/^|\W/g, '_'));  // _Hi_there_
console.log(" Hi there.".replace(/^|\W/g, '_')); // _ Hi_there_

The _ Hi_there_ second result shows how JS regex engine handles zero-width matches during a replace operation: once a zero-width match (here, it is the position at the start of the string) is found, the replacement occurs, and the RegExp.lastIndex property is incremented, thus proceeding to the position after the first character! That is why the first space is preserved, and no longer matched with \W.

A solution is to use a consuming pattern that will not allow zero-width matches:

console.log("Hi there.".replace(/^(\W?)|\W/g, function($0,$1) { return $1 ? "__" : "_"; }));
console.log(" Hi there.".replace(/^(\W?)|\W/g, function($0,$1) { return $1 ? "__" : "_"; }));

You can use RegExp /(^\W*){1}|\W(?!=\w)/g to match one \W at beginning of string or \W not followed by \w

var str = "The Number is (123)(234)";
var res = str.replace(/(^\W*){1}|\W(?!=\w)/g, "_");
console.log(res);

You should have used /\W+/g instead.

"*" means all characters by itself.

It's because you're using the * operator. That matches zero or more characters. So between every character matches. If you replace the expression with /\W+/g it works as you expected.

This should work for you

Find: (?=.)(?:^\W|\W$|\W|^|(.)$)
Replace: $1_

Cases explained:

 (?= . )       # Must be at least 1 char
 (?:           # Ordered Cases:
      ^ \W          # BOS + non-word (consumes bos)
   |  \W $          # Non-word + EOS (consumes eos)
   |  \W            # Non-word
   |  ^             # BOS
   |  ( . )         # (1), Any char + EOS
      $ 
 )

Note this could have been done without the lookahead via
(?:^\W|\W$|\W|^$)

But, this will insert a single _ on an empty string.
So, it ends up being more elaborate.
All in all though, it's a simple replacement.
Unlike Stribnez's solution, no callback logic is required
on the replace side.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

javascript - Why does string.replace(W*g,'_') prepend all characters? - Stack Overflow

6 Answers 6

与本文相关的文章

评论列表(0)