最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

regex - Python's regular expression: Non greedy optional group followed by another optional group - Stack Overflow

programmeradmin1浏览0评论

I am using the following regular expression in python: ^( .+?)?( Com:.*)?$

(This regex might look dumb but it's actually part of a bigger more complex string, I have just extracted the problematic part.)

And I have those 2 different strings:

  1. abc Com: 123
  2. Com: 123

With the first string, group 1 matches only " abc" and group 2 matches " Com: 123". This is what I expect since group 1 is non greedy.

However with the second string, I would expect group 1 to match nothing and group 2 to match again " Com: 123". But no, group 1 matches the entire string and group 2 nothing. I don't get it.

What's going on here? Could someone explain?

Here is the link to the regex:

Thanks in advance

I am using the following regular expression in python: ^( .+?)?( Com:.*)?$

(This regex might look dumb but it's actually part of a bigger more complex string, I have just extracted the problematic part.)

And I have those 2 different strings:

  1. abc Com: 123
  2. Com: 123

With the first string, group 1 matches only " abc" and group 2 matches " Com: 123". This is what I expect since group 1 is non greedy.

However with the second string, I would expect group 1 to match nothing and group 2 to match again " Com: 123". But no, group 1 matches the entire string and group 2 nothing. I don't get it.

What's going on here? Could someone explain?

Here is the link to the regex: https://regex101.com/r/E4idh8/2

Thanks in advance

Share Improve this question asked Feb 6 at 22:10 SerbenetSerbenet 992 bronze badges 0
Add a comment  | 

4 Answers 4

Reset to default 4

Your first group isn't non-greedy, it's only a part of it which is (the ., which gets a +? modifier); the group itself has a simple ?, as has group 2.

Transforming this first ? to ?? will make it truly ungreedy (at least less greedy than the second one):

^( .+?)??( Com:.*)?$

(here with the modified test)

Try:

^((?! Com) (?:(?! Com).)+)?( Com:.*)?$

See: regex101


Explanation

  • ^ ... $: Anchors the string to the whole line.
  • ( ... )?: Match optional to group 1:
    • (?! Com) (?: ... )+): A space not followed by "Com"
    • (?: ... )+: then repeatedly
      • (?! Com).: match only characters when the following string is not " Com". See Tempered Greed
  • ( Com:.*)?: Match "Com: ..." as you already did.

The issue that I see is with the space and the + in the first group, i.e. the minimum capture requirement in the first of two optional groups.

This is why the first group, even if it is lazy, can and will to capture the Com: 123 at the beginning of the line.

The first capture group ( .+?)?:

  • Is immediately after ^ the beginning of the line.
  • Is lazy (...+?)
  • Is optional.
  • It requires a minimum of two characters to match (two characters is the laziest option):
    • a space and
    • at least one or more characters .+.
  • Located before the second group (reading from left to right): It will get to try to match first before second optional group gets a shot.

The second capture group ( Com:.*)?:

  • Is also optional
  • Located after the first group (reading from left to right): It will have and opportunity to match only after the first group has tried.

This is why your pattern reads like ^( .+?)( Com:.*)?$.

When Com: 123 is at the beginning of the line, the first group will attempt to grab the first two characters, and ., which are its minimum requirement. This is the laziest it can get. It does not have an option to try to match an empty string. After matching the minimum C there is only om: 123 left. This no longer matches the second group, so the first lazy group has to continue munching away all the way to the end $.

The "super lazy" solution by @Guillaume Outters is elegant and perfect, because it allows you to keep the requirement for a space followed by one character as the minimum match for the first group.

However, to demonstrate the space-plus issue (i.e. the minimum requirement for first of two optional patterns) with the pattern you had, Here is a solution that would get you close:

^(.*?)?( Com:.*)?$

You would remove the space from the first group, because the period . will capture spaces as well. Also, you would want to change the .+ to .* so that the lazy does not have to capture anything. This way, because the first group capture is lazy and optional with no minimum capture requirement, when it sees a Com:123 ahead, it will stop right there and capture nothing, capture an empty string. And, more importantly it will not consume the first space and another character, allowing the second group to capture the entire Com:123.

There is a problem with this solution though. Although it captures the space in front of the characters at the beginning of both captured groups, it will also capture any string that does not have a space at the beginning of the line. This can definitely be a problem.

Link: https://regex101.com/r/nISB75/1

This is why the solution by @Guillaume Outters is an the perfect solution to guarantee the desired outcome.

For comparison, @Guillaume Outters solution ^( .+?)??( Com:.*)?$ with additional test strings: https://regex101.com/r/MobsDN/2


Great Cheat Sheet on Quantifiers: https://www.rexegg.com/regex-quantifiers.php#cheat_sheet

Group 2 of your regex ^( .+?)?( Com:.*)?$ follows a ?, which means group 1 will match the whole line even if group 2 matches nothing, because group 1 takes the match first. In this case, The whole line Com: 123 will be matched as group 1 with nothing left to the group 2.

I suggest use ^((?!Com:).*?)( Com:.*)$ to match this case.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论