最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

RegEx match multiline blocks of text and return those that contain specific string - lookaround - Stack Overflow

programmeradmin0浏览0评论

this might seem like a common query but there is a nuance that doesn't seem to be addressed in other examples I've found (other than this: How to write a REGEX search that will search through blocks of multiline patterns and only match on a block that contains a specified string? - although this has more readily defined start and end of blocks).

I have a file of data with header blocks; due to the flexibility of the header format they can include a variety of alternative identifiers.

I'm trying to identify which header blocks don't include a specific identifier and want to capture blocks that do with RegEx - and then remove them.

The nub of my issue is understanding what string the lookforward or lookbehind functions will examine (I think). I've read through: .php; /; .html and I'm not sure how to force (encourage?) the functions to examine the required text block - for instance, do they apply to capture groups?

Here's my sample text (note: empty lines between blocks for clarity; block always starts with #ZRXP and #LAYOUT is always a separate line, other fields are variable and may be on an arbitrary number of lines and in an arbitrary order):

#ZRXPVERSION2209.265|*|ZRXPMODEStandard|*|ZRXPCREATORZRXP-Fileexport|*|TZUTC-0|*|
#TSPATH/1/335582/RS/HDay.CmdTotal.P|*|CUNITmm|*|RINVAL-777|*|SNAMEDeskry Shiel|*|
#SANR335582|*|CNAMERS|*|
#LAYOUT(timestamp,value)|*|

#ZRXPVERSION2209.265|*|ZRXPMODEStandard|*|ZRXPCREATORZRXP-Fileexport|*|TZUTC-0|*|
#REXCHANGE849056.RS.DayTotal|*|TSPATH/1/115220/RS/HDay.CmdTotal.P|*|CUNITmm|*|
#RINVAL-777|*|SNAMEDunecht House|*|SANR115220|*|CNAMERS|*|
#LAYOUT(timestamp,value)|*|

#ZRXPVERSION2209.265|*|ZRXPMODEStandard|*|ZRXPCREATORZRXP-Fileexport|*|TZUTC-0|*|
#REXCHANGE758274.RS.DayTotal|*|TSPATH/2/115348/RS/HDay.CmdTotal.P|*|CUNITmm|*|
#RINVAL-777|*|SNAMEGreenland|*|SANR115348|*|CNAMERS|*|
#LAYOUT(timestamp,value)|*|

#ZRXPVERSION2209.265|*|ZRXPMODEStandard|*|ZRXPCREATORZRXP-Fileexport|*|TZUTC-0|*|
#TSPATH/2/724176/RS/HDay.CmdTotal.P|*|CUNITmm|*|RINVAL-777|*|
#SNAMEHarlosh (Dunvegan Skye)|*|SANR724176|*|CNAMERS|*|
#LAYOUT(timestamp,value)|*|

I want to return (and then remove, i.e. replace with "") just the blocks that include the #REXCHANGE identifier, i.e. blocks #2 and #3.

I can capture the blocks easily enough with ^#ZRXPVERSION.*#LAYOUT\(timestamp,value\)\|\*\|$

But I can't work out how to apply either (?=#REXCH) or (?<=#REXCH) to select only the 2nd and 3rd block in my sample.

None of these work (PCRE with gmsU flags - finally Notepad++, using Boost regular expression library v1.85):

^#ZRXPVERSION((?=#REXCH).*#LAYOUT\(timestamp,value\)\|\*\|$)
^#ZRXPVERSION(.*#LAYOUT\(timestamp,value\)\|\*\|$)(?<#REXCH)
^#ZRXPVERSION(.*#LAYOUT\(timestamp,value\)\|\*\|)(?<#REXCH)$
^#ZRXPVERSION(?!#LAYOUT\(timestamp,value\)\|\*\|$)#REXCH(.*?)(#LAYOUT\(timestamp,value\)\|\*\|)$

I think what I'm fundamentaly not grasping is what the scope of the lookforward/behind is - and also how to work around the limitations on lookbehind regarding use of .*

Any suggestions for both a successful RegEx (including alternative approach) or a resource that explains the aspect of lookarounds that I'm missing welcome.

TIA - JS

this might seem like a common query but there is a nuance that doesn't seem to be addressed in other examples I've found (other than this: How to write a REGEX search that will search through blocks of multiline patterns and only match on a block that contains a specified string? - although this has more readily defined start and end of blocks).

I have a file of data with header blocks; due to the flexibility of the header format they can include a variety of alternative identifiers.

I'm trying to identify which header blocks don't include a specific identifier and want to capture blocks that do with RegEx - and then remove them.

The nub of my issue is understanding what string the lookforward or lookbehind functions will examine (I think). I've read through: https://www.rexegg.com/regex-lookarounds.php; https://formulashq.com/lookbehind-regular-expressions-regex-explained/; https://www.regular-expressions.info/lookaround.html and I'm not sure how to force (encourage?) the functions to examine the required text block - for instance, do they apply to capture groups?

Here's my sample text (note: empty lines between blocks for clarity; block always starts with #ZRXP and #LAYOUT is always a separate line, other fields are variable and may be on an arbitrary number of lines and in an arbitrary order):

#ZRXPVERSION2209.265|*|ZRXPMODEStandard|*|ZRXPCREATORZRXP-Fileexport|*|TZUTC-0|*|
#TSPATH/1/335582/RS/HDay.CmdTotal.P|*|CUNITmm|*|RINVAL-777|*|SNAMEDeskry Shiel|*|
#SANR335582|*|CNAMERS|*|
#LAYOUT(timestamp,value)|*|

#ZRXPVERSION2209.265|*|ZRXPMODEStandard|*|ZRXPCREATORZRXP-Fileexport|*|TZUTC-0|*|
#REXCHANGE849056.RS.DayTotal|*|TSPATH/1/115220/RS/HDay.CmdTotal.P|*|CUNITmm|*|
#RINVAL-777|*|SNAMEDunecht House|*|SANR115220|*|CNAMERS|*|
#LAYOUT(timestamp,value)|*|

#ZRXPVERSION2209.265|*|ZRXPMODEStandard|*|ZRXPCREATORZRXP-Fileexport|*|TZUTC-0|*|
#REXCHANGE758274.RS.DayTotal|*|TSPATH/2/115348/RS/HDay.CmdTotal.P|*|CUNITmm|*|
#RINVAL-777|*|SNAMEGreenland|*|SANR115348|*|CNAMERS|*|
#LAYOUT(timestamp,value)|*|

#ZRXPVERSION2209.265|*|ZRXPMODEStandard|*|ZRXPCREATORZRXP-Fileexport|*|TZUTC-0|*|
#TSPATH/2/724176/RS/HDay.CmdTotal.P|*|CUNITmm|*|RINVAL-777|*|
#SNAMEHarlosh (Dunvegan Skye)|*|SANR724176|*|CNAMERS|*|
#LAYOUT(timestamp,value)|*|

I want to return (and then remove, i.e. replace with "") just the blocks that include the #REXCHANGE identifier, i.e. blocks #2 and #3.

I can capture the blocks easily enough with ^#ZRXPVERSION.*#LAYOUT\(timestamp,value\)\|\*\|$

But I can't work out how to apply either (?=#REXCH) or (?<=#REXCH) to select only the 2nd and 3rd block in my sample.

None of these work (PCRE with gmsU flags - finally Notepad++, using Boost regular expression library v1.85):

^#ZRXPVERSION((?=#REXCH).*#LAYOUT\(timestamp,value\)\|\*\|$)
^#ZRXPVERSION(.*#LAYOUT\(timestamp,value\)\|\*\|$)(?<#REXCH)
^#ZRXPVERSION(.*#LAYOUT\(timestamp,value\)\|\*\|)(?<#REXCH)$
^#ZRXPVERSION(?!#LAYOUT\(timestamp,value\)\|\*\|$)#REXCH(.*?)(#LAYOUT\(timestamp,value\)\|\*\|)$

I think what I'm fundamentaly not grasping is what the scope of the lookforward/behind is - and also how to work around the limitations on lookbehind regarding use of .*

Any suggestions for both a successful RegEx (including alternative approach) or a resource that explains the aspect of lookarounds that I'm missing welcome.

TIA - JS

Share Improve this question edited 5 hours ago jack_sprat asked Feb 6 at 16:36 jack_spratjack_sprat 31 silver badge5 bronze badges 5
  • Please specify the host language are you using for your regular expressions. – Booboo Commented Feb 6 at 17:31
  • Does each block consists of 4 lines with newlines between them or is that just the way you formatted it? I ask because the dot character (.) does not normally match a newline. It would also make your question clearer if you showed explicitly what the expected output results should be. – Booboo Commented Feb 6 at 17:43
  • @Booboo: Host language: PCRE and Boost (as per OP) re. formatting: you're correct, the spaces between blocks are just my formatting, for clarity; however, I already have RegEx to remove data and empty lines to leave just header blocks, plus the header block is spread over 4 lines and hence includes 3 \n. – jack_sprat Commented Feb 6 at 19:08
  • You say, "I want to return (and then remove, i.e. replace with "") just the blocks that include the #REXCHANGE identifier, i.e. blocks #2 and #3." If you really wanted to return just those blocks after they have been replaced with "", you would be returning the empty string. The point is that this language, at least to me, is very unclear. That is why I asked you to update the question with what you actually expect the final results to be. – Booboo Commented Feb 6 at 21:52
  • hmmm , not really - I want the RegEx function to return, i.e. identify, the blocks containing #REXCH; then I can do whatever, in this case replace with "" empty string, i.e. delete. I'm trying to keep the solution general to make it applicable for others. Which is the point of SO. – jack_sprat Commented Feb 7 at 9:15
Add a comment  | 

4 Answers 4

Reset to default 1

Since the start and end is known, just have to examine all the characters inside
the block. Need a lookahead to make sure that the start of a new block doesn't occur
before the inner keyword #REXCH is found.
This uses the multiline mode and ^ to look for the start and end of block.
It aslo uses the dot-match-newline modifier (?s)

(?ms)^\#ZRXP(?:(?!\#ZRXP).)*?\#REXCH(?:(?!\#ZRXP).)*?^\#LAYOUT[^\r\n]*(?:\r?\n){0,2}

https://regex101.com/r/Bt8o7h/1
https://regex101.com/r/g8aXIg/1

(?ms)
^ \#ZRXP 
(?: (?! \#ZRXP ) . )*?
\#REXCH 
(?: (?! \#ZRXP ) . )*?
^ \#LAYOUT [^\r\n]* 
(?: \r? \n ){0,2}

The regex I would use not assuming anything about where in the block string '#REXCHANGE' might be is:

(?msx)                     # multiline, match newline with .
^\#ZRXPVERSION             # Look for this at the start of a line
(?:(?!^\#ZRXPVERSION).)*   # Eat characters one at a time without stepping over next block
\#REXCHANGE                # Match this
.*?                        # Non-greedy match shortest string until:
\(timestamp,value\)\|\*\|$ # We match the end of the block

The heart of the regex is (?:(?!^\#ZRXPVERSION).)*. This uses a negative lookahead (?:(?!^\#ZRXPVERSION).) that says that we can match the next character as long as the next characters are not '#ZRXPVERSION' and the final * says that we can match 0 or more such characters.

See regex demo.

Python Program

import re

s = '''#ZRXPVERSION2209.265|*|ZRXPMODEStandard|*|ZRXPCREATORZRXP-Fileexport|*|TZUTC-0|*|
#TSPATH/1/335582/RS/HDay.CmdTotal.P|*|CUNITmm|*|RINVAL-777|*|SNAMEDeskry Shiel|*|
#SANR335582|*|CNAMERS|*|
#LAYOUT(timestamp,value)|*|

#ZRXPVERSION2209.265|*|ZRXPMODEStandard|*|ZRXPCREATORZRXP-Fileexport|*|TZUTC-0|*|
#REXCHANGE849056.RS.DayTotal|*|TSPATH/1/115220/RS/HDay.CmdTotal.P|*|CUNITmm|*|
#RINVAL-777|*|SNAMEDunecht House|*|SANR115220|*|CNAMERS|*|
#LAYOUT(timestamp,value)|*|

#ZRXPVERSION2209.265|*|ZRXPMODEStandard|*|ZRXPCREATORZRXP-Fileexport|*|TZUTC-0|*|
#REXCHANGE758274.RS.DayTotal|*|TSPATH/2/115348/RS/HDay.CmdTotal.P|*|CUNITmm|*|
#RINVAL-777|*|SNAMEGreenland|*|SANR115348|*|CNAMERS|*|
#LAYOUT(timestamp,value)|*|

#ZRXPVERSION2209.265|*|ZRXPMODEStandard|*|ZRXPCREATORZRXP-Fileexport|*|TZUTC-0|*|
#TSPATH/2/724176/RS/HDay.CmdTotal.P|*|CUNITmm|*|RINVAL-777|*|
#SNAMEHarlosh (Dunvegan Skye)|*|SANR724176|*|CNAMERS|*|
#LAYOUT(timestamp,value)|*|'''

rex = r'''(?msx)           # multiline, match newline with .
^\#ZRXPVERSION             # Look for this at the start of a line
(?:(?!^\#ZRXPVERSION).)*   # Eat characters one at a time without stepping over next block
\#REXCHANGE                # Match this
.*?                        # Non-greedy match shortest string until:
\(timestamp,value\)\|\*\|$ # We match the end of the block
'''

for block in re.findall(rex, s):
    print(block)
    print('-' * 81)

Prints:

#ZRXPVERSION2209.265|*|ZRXPMODEStandard|*|ZRXPCREATORZRXP-Fileexport|*|TZUTC-0|*|
#REXCHANGE849056.RS.DayTotal|*|TSPATH/1/115220/RS/HDay.CmdTotal.P|*|CUNITmm|*|
#RINVAL-777|*|SNAMEDunecht House|*|SANR115220|*|CNAMERS|*|
#LAYOUT(timestamp,value)|*|
---------------------------------------------------------------------------------
#ZRXPVERSION2209.265|*|ZRXPMODEStandard|*|ZRXPCREATORZRXP-Fileexport|*|TZUTC-0|*|
#REXCHANGE758274.RS.DayTotal|*|TSPATH/2/115348/RS/HDay.CmdTotal.P|*|CUNITmm|*|
#RINVAL-777|*|SNAMEGreenland|*|SANR115348|*|CNAMERS|*|
#LAYOUT(timestamp,value)|*|
---------------------------------------------------------------------------------

EDIT: This doesn't solve the OP's problem.

The OP does not have newlines in between their blocks, so my answer doesn't apply. Original answer below:


I think the core of the problem here is you cannot differentiate between blocks (when \n\n occurs) using .* when the s flag is present, so when the engine tries to match to #REXCHANGE, it will skip between block boundaries. To solve that, we need an expression that allows \n but not \n\n.

This expression is something like ([^\n]+\n)+ when the s dot all flag is not present. So we can remove the s flag, leaving behind the gmU flags and use this expression. This matches against the blocks in your sample text with 2 capturing groups on regexr.com.

^#ZRXPVERSION([^\n]+\n)+#REXCHANGE([^\n]+\n)+#LAYOUT.*\|\*\|$

Note that when a capturing group is repeated multiple times, only the last capture is returned (at least on Python), so if capturing group results are important you might need to change the expression to fit your needs.

In many regex engines, lookbehind usage is restricted to fixed width strings, so I don't think you can use that. As for lookaheads, that involves not including the expression in the resulting match, and I don't think that applies here either. Lookaheads try to match something without "consuming" the match. Try running this expression on your sample text using a web regex editor to see if it helps you understand.

#ZRXPVERSION(?=2209)2209

RegEx101

Using PCRE 2:

/^(?=.*?#ZRXPVERSION).*\s*?(?=.*?#REXCHANGE)[\s\S]*?timestamp,value\)\|\*\|$/gm
Segment Explanation
^(?=.*?#ZRXPVERSION).*\s*?
Begins with a positive lookahead: Must have 0 or more of any character followed by literal: #ZRXPVERSION followed by 0 or more of any character followed by 0 or more whitespaces.
(?=.*?#REXCHANGE)[\s\S]*?
Another positive lookahead: Must have 0 or more of any character followed by literal: #REXCHANGE followed by 0 or more of anything.
timestamp,value\)\|\*\|)$
Literal timestamp, value)|*| and then end of line.
发布评论

评论列表(0)

  1. 暂无评论