python - Regex matching with negative look-behind assertion

I am parsing C source files. I want to match all the variables (in snake-case format) that end in _VALUE and don't begin with CANA_, CANB_... ,CANF_. I need to match the whole variable name for later substitution.

This is my current setup with python

import re

def signal_ending_VALUE_updater(match: re.Match) -> str:
    groups = match.groupdict()
    return some_operation_on(group["SIGNAL_NAME"])

REGEX=r"(?<!CAN[A-F]_)\b(?P<SIGNAL_NAME>\w+_VALUE)\b"

with open(file_path,'r') as f:
   content = f.read()
   content_new = re.sub(REGEX,signal_ending_VALUE_updater,content)

Unfortunately this regex doesn't work all the times, for example if we try this testacase

test="        shared->option.mem = ((canAGetScuHmiVehReqLiftModBtnSt() == CANA_SCU_HMI_VEH_REQ_LIFT_MOD_BTN_ST_PRESSED_VALUE) ||"
re.find(REGEX,test)

Will return the variable (CANA_SCU_HMI...) that I don't want to match. What am I not considering in the regex?

The idea behind the regex is:

(?<!CAN[A-F]_): with a negative-lookbehind ensure the match does not start with CAN followed by one of the letters A, B, C, D, E, or F, and an underscore (_).
\b: word boundary, ensuring that we are matching whole words and not one part of a word
(?P<SIGNAL_NAME>\w+_VALUE):
- (?P<SIGNAL_NAME>...): group match with the name SIGNAL_NAME
- \w same as [a-zA-Z0-9_] will match snakecase variable names
- + ensures one or more of before
- _VALUE matches the literal string _VALUE at the end of the variable name.
\b This again is a word boundary that ensures the match ends right after the variable name.

This is my current setup with python

import re

def signal_ending_VALUE_updater(match: re.Match) -> str:
    groups = match.groupdict()
    return some_operation_on(group["SIGNAL_NAME"])

REGEX=r"(?<!CAN[A-F]_)\b(?P<SIGNAL_NAME>\w+_VALUE)\b"

with open(file_path,'r') as f:
   content = f.read()
   content_new = re.sub(REGEX,signal_ending_VALUE_updater,content)

Unfortunately this regex doesn't work all the times, for example if we try this testacase

test="        shared->option.mem = ((canAGetScuHmiVehReqLiftModBtnSt() == CANA_SCU_HMI_VEH_REQ_LIFT_MOD_BTN_ST_PRESSED_VALUE) ||"
re.find(REGEX,test)

Will return the variable (CANA_SCU_HMI...) that I don't want to match. What am I not considering in the regex?

The idea behind the regex is:

(?<!CAN[A-F]_): with a negative-lookbehind ensure the match does not start with CAN followed by one of the letters A, B, C, D, E, or F, and an underscore (_).
\b: word boundary, ensuring that we are matching whole words and not one part of a word
(?P<SIGNAL_NAME>\w+_VALUE):
- (?P<SIGNAL_NAME>...): group match with the name SIGNAL_NAME
- \w same as [a-zA-Z0-9_] will match snakecase variable names
- + ensures one or more of before
- _VALUE matches the literal string _VALUE at the end of the variable name.
\b This again is a word boundary that ensures the match ends right after the variable name.

Share Improve this question asked 2 days ago Jhonathan Asimov 692 silver badges9 bronze badges

4 Did you mean a negative lookahead \b(?! like \b(?!CAN[A-F]_)(?P<SIGNAL_NAME>\w+_VALUE)\b regex101/r/ZLVEEn/1 – The fourth bird Commented 2 days ago
No I meant look-behind but your solution works, thank you! I tried using the negative look-ahead before going to the look-behind but was not able to make it work, what I missed was the \b at the beginning that would make incorrect matches (like this one regex101/r/ZLVEEn/3). Why that \b fixes it and instead I get that strange behaviour like in the link i provided? – Jhonathan Asimov Commented 2 days ago
1 If you don't use that \b then the engine move 1 character forward, and from that position there is a match because the assertion is true and there is no other "rule" like a word boundary – The fourth bird Commented 2 days ago

Add a comment |

1 Answer 1

Sorted by: Reset to default 2

This part of your regex (?<!CAN[A-F]_)\b asserts that this pattern CAN[A-F]_ does not occur directly to the left of the current position followed by a word boundary.

You get a match for this text CANA_SCU_HMI_VEH_REQ_LIFT_MOD_BTN_ST_PRESSED_VALUE because at the beginning of that text, that assertion is true.

What you can do instead is start with a word boundary, and then assert that what is directly to the right does not match the pattern CAN[A-F]_

\b(?!CAN[A-F]_)(?P<SIGNAL_NAME>\w+_VALUE)\b

See a regex 101 demo

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Regex matching with negative look-behind assertion - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)