最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python 3.x - linux sed expression to select text between markers - Stack Overflow

programmeradmin5浏览0评论

Here is a challenge for regex gurus. Need a very simple sed expression to select text between markers.

Here is an example text. Please mind it can contain any special chars, TABS and white spaces even though this example doesn't depict all possible combinations.

^[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~
  1. Select text between first matched start of marker M1 to last matched end of marker M3. The text to select from example is

bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh

  1. Select text between last matched start of marker M1 to first matched end of marker M3. The text to select from example is

ccc[$cM2ddddM2eeeee

I tried this but it select last start of marker to last end of marker

echo "^[[200~a^[[200~aaa aM1bb bbbM1ccc[\$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~"|sed -E "s|.*M1(.*)M3.*$|\1|g"

ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh

How it is possible? single sed regex expression would be the best. What I mean single regex is one for each above two requirements. i.e. two regex Also need the equivalent python re expression.

Here is a challenge for regex gurus. Need a very simple sed expression to select text between markers.

Here is an example text. Please mind it can contain any special chars, TABS and white spaces even though this example doesn't depict all possible combinations.

^[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~
  1. Select text between first matched start of marker M1 to last matched end of marker M3. The text to select from example is

bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh

  1. Select text between last matched start of marker M1 to first matched end of marker M3. The text to select from example is

ccc[$cM2ddddM2eeeee

I tried this but it select last start of marker to last end of marker

echo "^[[200~a^[[200~aaa aM1bb bbbM1ccc[\$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~"|sed -E "s|.*M1(.*)M3.*$|\1|g"

ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh

How it is possible? single sed regex expression would be the best. What I mean single regex is one for each above two requirements. i.e. two regex Also need the equivalent python re expression.

Share Improve this question asked Mar 27 at 1:38 Ramanan TRamanan T 4394 silver badges13 bronze badges 3
  • 1 THe regular expression is probably the same for any regex engine. The key here is the greediness of the partial matches. – LMC Commented Mar 27 at 1:52
  • Is guaranteed that the wanted appearance of the start marker will appear before the wanted appearance of the end marker? – John Bollinger Commented Mar 27 at 3:15
  • yes thats right - wanted appearance of the start marker will appear before the wanted appearance of the end marker – Ramanan T Commented Mar 27 at 3:43
Add a comment  | 

9 Answers 9

Reset to default 4

If you don't mind a compound sed script ... one approach for non-greedy matches is to change a delimiter to a string that does not occur in the input and then drive the match off of that replacement character.

While OP has stated the input could contain any special characters I'm going to assume the input cannot contain the NUL character (0x00).

Tweaking OP's current sed script ...

For OP's longer match:

$ echo "^[[200~a^[[200~aaa aM1bb bbbM1ccc[\$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~" | sed -E "s|M1|\x00|1; s|.*\x00(.*)M3.*$|\1|g"
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh

Where:

  • s|M1|\x00|1 - convert the first M1 to NUL
  • s|.*\x00(.*)M3.*$|\1|g - search for string between NUL and M3
  • since we've removed the NUL there's no need to convert a NUL back to M1

For OP's shorter match:

$ echo "^[[200~a^[[200~aaa aM1bb bbbM1ccc[\$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~" | sed -E "s|M3|\x00|1; s|.*M1(.*)\x00.*$|\1|g"
ccc[$cM2ddddM2eeeee

Where:

  • s|M3|\x00|1 - convert the first M3 to NUL
  • s|.*M1(.*)\x00.*$|\1|g - search for string between M1 and NUL`
  • since we've removed the NUL there's no need to convert a NUL back to M3

NOTE: I don't work with python but since these are (relatively) simple sed regexes I'm guessing they should be (relatively) easy to migrate to python ... ???

The second case is easy, even with sed:

$ a='^[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~'
$ sed -E 's/.*M1|M3.*//g' <<< "$a"
ccc[$cM2ddddM2eeeee

The first case is more complex because of the greediness of sed regexes. If you can use python or perl, instead of sed, you can harness their non-greedy .*? operator:

$ python -c 'import sys,re; print("\n".join(re.sub(r".*?M1|M3.*?","",l) for l in sys.stdin),end="")' <<< "$a"
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
$ perl -pe 's/.*?M1|M3.*?//g' <<< "$a"
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh

A bit shorter with python if you have only one line of text to process and if we pass it as an argument:

$ python -c 'import sys,re; print(re.sub(r".*?M1|M3.*?","",sys.argv[1]))' "$a"
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh

With sed, one possibility consists in first inserting separator characters that do not appear in the input string, for instance newlines, and then keeping only what appears between them. If your sed supports \n for newline in the replacement string of the substitute command:

$ sed -E 's/M1(.*)M3/\n\1\n/;s/.*\n(.*)\n.*/\1/' <<< "$a"
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh

Else, with any sed:

$ sed -E 's/M1(.*)M3/\
\1\
/;s/.*\n(.*)\n.*/\1/' <<< "$a"
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh

Note: as your shell is bash, if you absolutely want a one-liner you can use a $'...' character sequence:

$ sed -E $'s/M1(.*)M3/\\\n\\1\\\n/;s/.*\\n(.*)\\n.*/\\1/' <<< "$a"

How it is possible? single sed regex expression would be the best.

Presumably you mean you would prefer a single s command, as that's what you present in your example. A regex is one part of an s command, but the whole command is more than just a regex.

Unlike some other regex dialects, POSIX regular expressions (the kind used by sed) have no non-greedy, arbitrary-count quantifiers. That makes it a bit tricky to satisfy a requirement to match up to the first appearance of a multi-character substring when there may be more than one appearance, but it can be done. The trick revolves around explicitly matching leading partial matches to the marker. For example, this POSIX extended regular expression will match everything up to and including the first appearance of substring "M1" in the input, or it will not match if "M1" does not appear at all:

    ^[^M]*(M([^1M][^M]*)?)*M1

Breaking that down,

  • ^[^M]* matches a leading sequence of any number (including zero) of characters other than M
  • (M([^1M][^M]*)?)* matches any number of subsequences, each starting with an M and
    • ([^1M][^M]*)? optionally continuing with

      • [^1M] a character that is neither 1 nor M
      • [^M]* and then any number of characters other than M

      This provides for Ms that are not part of an M1, including runs of Ms of any length, including such runs immediately preceding (but not including) the first M1.

  • M1 matches itself

That's not too bad for a marker consisting of just two characters, especially when they are different from each other, but it gets nasty very quickly as the number of characters in the marker grows. Fortunately for you, your two problems feature only two-character markers.

That approach can help you get a single s command for each question, depending on to which end of the pattern you apply it. They will be a bit complex, but not outrageously so. I decline to do your homework for you in toto, however, so I leave the details as an exercise.

If you can use sed then you can use awk, the other mandatory POSIX text processing tool, so here's a simple solution using any awk:

$ awk 'match($0,/M1.*M3/) { print substr($0,RSTART+2,RLENGTH-4) }' file
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
$ awk 'sub(/.*M1/,"") && sub(/M3.*/,"")' file
ccc[$cM2ddddM2eeeee

If you really wanted to use sed for some reason, then using any sed that allows \n to mean newline in the regexp and replacement, e.g. GNU sed:

$ sed 's/M1\(.*\)M3.*/\n\1/; s/[^\n]*\n//' file
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
$ sed 's/.*M1//; s/M3.*//' file
ccc[$cM2ddddM2eeeee

I expect you can do the same as any of the above in python.

You can do what you want easily and efficiently with any POSIX-compliant shell, including Bash. No external utilities (sed, perl, python, awk) are required. When run with any non-prehistoric sh this code

string='^[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~'

first_to_last=${string#*M1}
first_to_last=${first_to_last%M3*}

last_to_first=${string##*M1}
last_to_first=${last_to_first%%M3*}

printf '%s\n' "$first_to_last"
printf '%s\n' "$last_to_first"

outputs

bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
ccc[$cM2ddddM2eeeee
  • See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for explanations of ${string#*M1} etc.
  • Also see What is meant by "Now you have two problems"?.

I am posting the solution that I am currently using. Thanks to @markp-fuso and @renaud-pacalet for the ideas input. I am able to solve it.

Python script

import re
text = "^[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~"

# Extract the text between the first M1 and the last 'M3'
result = re.search(r'M1(.*)M3.*$', text).group(1)

# Extract the text between the last M1 and the first 'M3'
result = re.sub(r'.*M1|M3.*','',text)

sed way of extracting is

# Extract the text between the first M1 and the last 'M3'
echo "^[[200~a^[[200~aaa aM1bb bbbM1ccc[\$cM2ddddM2eeeeeM3ffffff f:M3ggggggg M3:hhhhh hhM3:kkkkk~" | sed -E "s/M1(.*)M3.*$/\n\1/;s/.*\n(.*)/\1/"

# Extract the text between the last M1 and the first 'M3'
echo "^[[200~a^[[200~aaa aM1bb bbbM1ccc[\$cM2ddddM2eeeeeM3ffffff f:M3ggggggg M3:hhhhh hhM3:kkkkk~" | sed -E "s/.*M1|M3.*//g"

My patterns will not work with sed, but with perl. Since I do not have access to it, i cannot provide example code here, but I do have a python example.

import re

long_pattern=repile(r"(?<=M1).+(?=M3)")
short_pattern=repile(r"(?<=M1)(?:(?!M1|M3).)+(?=M3)")

s=r"^[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~"

print(
    long_pattern.findall(s),
    short_pattern.findall(s)
)

Long Pattern:

  • (?<=M1): Assert, that you are to the left of "M1"
  • .+: and match as much as possible
  • (?=M3): but ensure the end is followed by "M3"

Short Pattern:

  • (?<=M1): Assert, that you are to the left of "M1"
  • (?: ... )+: and match as much as possible
  • (?!M1|M3).: while making sure you never step over either "M1" or "M3"
  • (?=M3): but ensure the end is followed by "M3"

In python - much easier since it has more robust regex operations, like non-greedy.

import re
txt="[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~"
print( "longest  : " + re.search(r'M1(.*)M3',    txt).group(1) )
print( "shortest : " + re.search(r'.*M1(.*?)M3', txt).group(1) )

Executed:

$: python file.py
longest  : bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
shortest : ccc[$cM2ddddM2eeeee

Using sed is a little trickier, since it doesn't have as many tools. You say

can contain any special chars

If that's true, I'm not sure a sed regex can help you with absolute certainty without relying on other factors such fixed width fields.

If there is some character you know won't be in the text, such as the newline or null used in other solutions, then those are good.

Otherwise you're going to have to rely on statistics. If you make a specific series of characters as your replacement marker, the length and content of it determines the odds of it happening accidentally in your input stream. A sufficiently unlikely combination is almost as good as an impossibility - you have to decide what constitutes "sufficient".

This still isn't one regex, but in sed it won't matter functionally.

$: cat file

txt='[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~'
M=" SPECIFIC and UNLIKELY Marker that does NOT have suspicious metacharacters "
printf  "longest  : "
sed -E "s/M1/$M/; s/^.*$M(.*)M3.*$/\\1/" <<< "$txt"

printf "shortest : "
sed -E 's/.*M1//; s/M3.*//;' <<< "$txt"

$: ./file
longest  : bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
shortest : ccc[$cM2ddddM2eeeee

With Raku/Sparrow you can do like this by zooming in:

# try to find within outer capture
within: "M1" (.*) "M3"
  # then if succeeded
  # zoom in into the capture:
  regexp: "M1" (.*?) "M3"
end:

# dump captured data
code: <<RAKU
!raku
say capture()[0] if capture();
RAKU

It all works because regex are greedy by default and Sparrow zoom in feature allows to start looking from the most outer capture M1 .. M3 group , zooming to the most inner capture group M1 .. M3

发布评论

评论列表(0)

  1. 暂无评论