最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Regex matching with negative look-behind assertion - Stack Overflow

programmeradmin2浏览0评论

I am parsing C source files. I want to match all the variables (in snake-case format) that end in _VALUE and don't begin with CANA_, CANB_... ,CANF_. I need to match the whole variable name for later substitution.

This is my current setup with python

import re

def signal_ending_VALUE_updater(match: re.Match) -> str:
    groups = match.groupdict()
    return some_operation_on(group["SIGNAL_NAME"])

REGEX=r"(?<!CAN[A-F]_)\b(?P<SIGNAL_NAME>\w+_VALUE)\b"

with open(file_path,'r') as f:
   content = f.read()
   content_new = re.sub(REGEX,signal_ending_VALUE_updater,content)

Unfortunately this regex doesn't work all the times, for example if we try this testacase

test="        shared->option.mem = ((canAGetScuHmiVehReqLiftModBtnSt() == CANA_SCU_HMI_VEH_REQ_LIFT_MOD_BTN_ST_PRESSED_VALUE) ||"
re.find(REGEX,test)

Will return the variable (CANA_SCU_HMI...) that I don't want to match. What am I not considering in the regex?

The idea behind the regex is:

  • (?<!CAN[A-F]_): with a negative-lookbehind ensure the match does not start with CAN followed by one of the letters A, B, C, D, E, or F, and an underscore (_).
  • \b: word boundary, ensuring that we are matching whole words and not one part of a word
  • (?P<SIGNAL_NAME>\w+_VALUE):
    • (?P<SIGNAL_NAME>...): group match with the name SIGNAL_NAME
    • \w same as [a-zA-Z0-9_] will match snakecase variable names
    • + ensures one or more of before
    • _VALUE matches the literal string _VALUE at the end of the variable name.
  • \b This again is a word boundary that ensures the match ends right after the variable name.

I am parsing C source files. I want to match all the variables (in snake-case format) that end in _VALUE and don't begin with CANA_, CANB_... ,CANF_. I need to match the whole variable name for later substitution.

This is my current setup with python

import re

def signal_ending_VALUE_updater(match: re.Match) -> str:
    groups = match.groupdict()
    return some_operation_on(group["SIGNAL_NAME"])

REGEX=r"(?<!CAN[A-F]_)\b(?P<SIGNAL_NAME>\w+_VALUE)\b"

with open(file_path,'r') as f:
   content = f.read()
   content_new = re.sub(REGEX,signal_ending_VALUE_updater,content)

Unfortunately this regex doesn't work all the times, for example if we try this testacase

test="        shared->option.mem = ((canAGetScuHmiVehReqLiftModBtnSt() == CANA_SCU_HMI_VEH_REQ_LIFT_MOD_BTN_ST_PRESSED_VALUE) ||"
re.find(REGEX,test)

Will return the variable (CANA_SCU_HMI...) that I don't want to match. What am I not considering in the regex?

The idea behind the regex is:

  • (?<!CAN[A-F]_): with a negative-lookbehind ensure the match does not start with CAN followed by one of the letters A, B, C, D, E, or F, and an underscore (_).
  • \b: word boundary, ensuring that we are matching whole words and not one part of a word
  • (?P<SIGNAL_NAME>\w+_VALUE):
    • (?P<SIGNAL_NAME>...): group match with the name SIGNAL_NAME
    • \w same as [a-zA-Z0-9_] will match snakecase variable names
    • + ensures one or more of before
    • _VALUE matches the literal string _VALUE at the end of the variable name.
  • \b This again is a word boundary that ensures the match ends right after the variable name.
Share Improve this question asked 2 days ago Jhonathan AsimovJhonathan Asimov 692 silver badges9 bronze badges 3
  • 4 Did you mean a negative lookahead \b(?! like \b(?!CAN[A-F]_)(?P<SIGNAL_NAME>\w+_VALUE)\b regex101/r/ZLVEEn/1 – The fourth bird Commented 2 days ago
  • No I meant look-behind but your solution works, thank you! I tried using the negative look-ahead before going to the look-behind but was not able to make it work, what I missed was the \b at the beginning that would make incorrect matches (like this one regex101/r/ZLVEEn/3). Why that \b fixes it and instead I get that strange behaviour like in the link i provided? – Jhonathan Asimov Commented 2 days ago
  • 1 If you don't use that \b then the engine move 1 character forward, and from that position there is a match because the assertion is true and there is no other "rule" like a word boundary – The fourth bird Commented 2 days ago
Add a comment  | 

1 Answer 1

Reset to default 2

This part of your regex (?<!CAN[A-F]_)\b asserts that this pattern CAN[A-F]_ does not occur directly to the left of the current position followed by a word boundary.

You get a match for this text CANA_SCU_HMI_VEH_REQ_LIFT_MOD_BTN_ST_PRESSED_VALUE because at the beginning of that text, that assertion is true.

What you can do instead is start with a word boundary, and then assert that what is directly to the right does not match the pattern CAN[A-F]_

\b(?!CAN[A-F]_)(?P<SIGNAL_NAME>\w+_VALUE)\b

See a regex 101 demo

发布评论

评论列表(0)

  1. 暂无评论