最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

java - why does B works but not b - Stack Overflow

programmeradmin1浏览0评论

Wanted to match a word that ends with # like

hi hello# world#

I tried to use boundary

\b\w+#\b

and it doesn't match.I thought \b is a non word boundary but it doesn't seem so from this case


Surprisingly

\b\w+#\B

matches!

So why does \B works here and not \b!Also why doesn't \b work in this case!


NOTE: Yes we can use \b\w+#(?=\s|$) but I want to know why \B works in this case!

Wanted to match a word that ends with # like

hi hello# world#

I tried to use boundary

\b\w+#\b

and it doesn't match.I thought \b is a non word boundary but it doesn't seem so from this case


Surprisingly

\b\w+#\B

matches!

So why does \B works here and not \b!Also why doesn't \b work in this case!


NOTE: Yes we can use \b\w+#(?=\s|$) but I want to know why \B works in this case!

Share Improve this question edited May 18, 2013 at 10:29 Anirudha asked May 18, 2013 at 10:22 AnirudhaAnirudha 32.9k8 gold badges71 silver badges90 bronze badges 5
  • 1 read http://www.regular-expressions.info/wordboundaries.html – jlordo Commented May 18, 2013 at 10:27
  • @Anirudh I think that's because of the space after the first #. – Maroun Commented May 18, 2013 at 10:31
  • @MarounMaroun yes indeed and that space should have been matched by \b – Anirudha Commented May 18, 2013 at 10:33
  • It has nothing to do with the space... everything to do with the #. – Ayman Safadi Commented May 18, 2013 at 10:36
  • @AymanSafadi It does have something to do with the space, because the pattern does match the string hi hello#world#. – tom Commented May 18, 2013 at 10:59
Add a ment  | 

3 Answers 3

Reset to default 6

Definition of word boundary \b

Defining word boundary in word is imprecise. Let me define the word boundary with look-ahead, look-behind, and short-hand word character class \w.

A word boundary \b is equivalent to:

(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))

Which means:

  • Right ahead, there is (at least) a character that is a word character, and right behind, we cannot find a word character (either the character is not a word character, or it is the start of the string).

    OR

  • Right behind, there is (at least) a character that is a word character, and right ahead, we cannot find a word character (either the character is not a word character, or it is the end of the string).

(Note how similar this is to the expansion of XOR into conjunction and disjunction)

A non-word boundary \B is equivalent to:

(?:(?<!\w)(?!\w)|(?<=\w)(?=\w))

Which means:

  • Right ahead and right behind, we cannot find any word character. Note that empty string is consider a non-word boundary under this definition.

    OR

  • Right ahead and right behind, both sides are word characters. Note that this branch requires 2 characters, i.e. cannot occur at the beginning or the end of a non-empty string.

(Note how similar this is to the expansion of XNOR into conjunction and disjunction).

Definition of word character \w

Since the definition of \b and \B depends on definition of \w1, you need to consult the specific documentation to know exactly what \w matches.

1 Most of the regex flavors define \b based on \w. Well, except for Java [Point 9], where in default mode, \w is ASCII-only and \b is partially Unicode-aware.

  • In JavaScript, it would be [A-Za-z0-9_] in default mode.

  • In .NET, \w by default would match [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\P{Lm}\p{Nd}\p{Pc}], and it will have the same behaviour as JavaScript if ECMAScript option is specified. In the list of characters in Pc category, you only have to know that space (ASCII 32) is not included.

Answer to the question

With the definition above, answering the question bees easy:

"hi hello# world#"

In hello#, after # is space (U+0020, in Zs category), which is not a word character, and # is not a word character itself (in Unicode, it is in Po category). Therefore, \B can match here. The branch (?<!\w)(?!\w) is used in this case.

In world#, after # is end of string. Since # is not a word character, and we cannot find any word character ahead (there is nothing there), \B can match the empty string just after #. The branch (?<!\w)(?!\w) is also used in this case.

Addendum

Alan Moore gives quite a good summary in the ment:

I think the key point to remember is that regexes can't read. That is, they don't deal in words, only in characters. When we say \b matches the beginning or end of a word, we don't mean it identifies a word and then seeks out its endpoints, like a human would. All it can see is the character before the current position and the character after the current position. Thus, \b only indicates that the current position could be a word boundary. It's up to you to make sure the characters on either side what they should be.

The pound # symbol is not considered a "word boundary".

\b\w+#\b doesn't work because w+# is not considered a word, therefore it will not match world#.
\b\w+6\b on the other hand is, therefore it will match world6.

"Word Characters" are defined by: [A-Za-z0-9_].

Simply put: \b allows you to perform a "whole words only" search using a regular expression in the form of \bword\b. A "word character" is a character that can be used to form words. All characters that are not "word characters" are "non-word characters".

— http://www.regular-expressions.info/wordboundaries.html

The # and the space are both non-word characters, so the invisible boundary between them is not a word boundary. Therefore \b will not match it and \B will match it.

发布评论

评论列表(0)

  1. 暂无评论