最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - meaning of (?) in regexis (w+)([^>]*?) a redundancy? - Stack Overflow

programmeradmin0浏览0评论

this regular expression should match an html start tag, I think.

var results = html.match(/<(\/?)(\w+)([^>]*?)>/);

I see it should first capture the <, but then I am confused what this capture (\/?) accomplishes. Am I correct in reasoning that the ([^>]*?)> searches for every character except > >= 0 times? If so, why is the (\w+) capture necessary? Doesn't it fall within the purview of [^>]*?

this regular expression should match an html start tag, I think.

var results = html.match(/<(\/?)(\w+)([^>]*?)>/);

I see it should first capture the <, but then I am confused what this capture (\/?) accomplishes. Am I correct in reasoning that the ([^>]*?)> searches for every character except > >= 0 times? If so, why is the (\w+) capture necessary? Doesn't it fall within the purview of [^>]*?

Share Improve this question asked Jul 3, 2013 at 16:40 12527481252748 15.4k34 gold badges116 silver badges241 bronze badges 1
  • it finds end tags, you know </b> instead of <b>... the \w captures the tag name to a parameter to use in replacement instead of bundling it with the attrib section... for a match you don't need it, but if help the regexp if recycled into a replace()... – dandavis Commented Jul 3, 2013 at 16:43
Add a comment  | 

5 Answers 5

Reset to default 4

Take it token by token:

  • / begin regex literal
  • < match a literal <
  • (\/?) match 0 or 1 (?) literal /, which is escaped by the \
  • (\w+) match one or more "word characters"
  • ([^>]*?) lazily* match zero or more (*?) of anything that is not a >
  • > match a literal >
  • / end regex literal

lazily* - adding "?" after a repetition quantifier will make it perform lazily, meaning the regex will match the preceding token the minimum number of times. See the documentation.

So essentially this regular expression will match "<", potentially followed by a "/", followed by any number of letters, digits, or underscores, followed by anything that is not a ">", and finally followed by a ">".

That being said, the token (\w+) is not redundant, as it ensures there is at least one word character in between < and >.

Please be aware that attempting to parse HTML with regular expressions is generally a bad idea.

Using the power of debuggex to generate you an image :)

<(\/?)(\w+)([^>]*?)>

Will be evaluated like this

Edit live on Debuggex

As you can see, it matches HTML-tags (opening and closing tags). The regex contains three capture groups, capturing the following:

  1. (\/?) existence of / (it's a closing tag, if present)
  2. (\w+) name of the tag
  3. ([^>]*?) everything else until the tag closes (e.g. attributes)

This way it matches <a href="#">. Interestingly it does not match <a data-fun="fun>nofun"> correctly because it stops at the > within the data-fun attribute. Although (I think) > is valid in an attribute value.

Another funny thing is, that the tag-name capture, does not capture all theoretically valid XHTML tags. XHTML allows Letter | Digit | '.' | '-' | '_' | ':' | .. (source: XHTML spec). (\w+), however, does not match ., -, and :. An imaginary <.foobar> tag will not be matched by this regex. This should not have any real life impact, though.

You see that parsing HTML using RgExes is a risky thing. You might be better of with a HTML parser.

(\/?) matches, and catches any closing tag, such as </i> maybe, or </strong> if you're familiar with them?

Another thing to note is that \w is really the character class [a-zA-Z_\d], so that other characters like =, ", etc are not matched, and will however be matched by [^>]. And yes, you are correct about that bit.

To answer your last question, (\w+) and ([^>]*?) are not redundant. They both serve important functions in the expression.

This expression finds start or end tags.

(\/?) matches a /, but the ? makes it optional.

(\w+) matches word characters, intended to match the tag name here.

([^>]*?) is intended to match attributes.

So if you had the string <div class="text">,

The (\w+) in the expression would match div and the ([^>]*?) would match class="text"

Demo (in ruby, not javascript, but it makes no difference): http://www.rubular.com/r/bhw2O28qUr

To summarise, it's to capture end tags.

发布评论

评论列表(0)

  1. 暂无评论