Here is a challenge for regex gurus. Need a very simple sed expression to select text between markers.
Here is an example text. Please mind it can contain any special chars, TABS and white spaces even though this example doesn't depict all possible combinations.
^[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~
- Select text between first matched start of marker M1 to last matched end of marker M3. The text to select from example is
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
- Select text between last matched start of marker M1 to first matched end of marker M3. The text to select from example is
ccc[$cM2ddddM2eeeee
I tried this but it select last start of marker to last end of marker
echo "^[[200~a^[[200~aaa aM1bb bbbM1ccc[\$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~"|sed -E "s|.*M1(.*)M3.*$|\1|g"
ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
How it is possible? single sed regex expression would be the best. What I mean single regex is one for each above two requirements. i.e. two regex Also need the equivalent python re expression.
Here is a challenge for regex gurus. Need a very simple sed expression to select text between markers.
Here is an example text. Please mind it can contain any special chars, TABS and white spaces even though this example doesn't depict all possible combinations.
^[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~
- Select text between first matched start of marker M1 to last matched end of marker M3. The text to select from example is
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
- Select text between last matched start of marker M1 to first matched end of marker M3. The text to select from example is
ccc[$cM2ddddM2eeeee
I tried this but it select last start of marker to last end of marker
echo "^[[200~a^[[200~aaa aM1bb bbbM1ccc[\$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~"|sed -E "s|.*M1(.*)M3.*$|\1|g"
ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
How it is possible? single sed regex expression would be the best. What I mean single regex is one for each above two requirements. i.e. two regex Also need the equivalent python re expression.
Share Improve this question asked Mar 27 at 1:38 Ramanan TRamanan T 4394 silver badges13 bronze badges 3- 1 THe regular expression is probably the same for any regex engine. The key here is the greediness of the partial matches. – LMC Commented Mar 27 at 1:52
- Is guaranteed that the wanted appearance of the start marker will appear before the wanted appearance of the end marker? – John Bollinger Commented Mar 27 at 3:15
- yes thats right - wanted appearance of the start marker will appear before the wanted appearance of the end marker – Ramanan T Commented Mar 27 at 3:43
9 Answers
Reset to default 4If you don't mind a compound sed
script ... one approach for non-greedy matches is to change a delimiter to a string that does not occur in the input and then drive the match off of that replacement character.
While OP has stated the input could contain any special characters I'm going to assume the input cannot contain the NUL
character (0x00
).
Tweaking OP's current sed
script ...
For OP's longer match:
$ echo "^[[200~a^[[200~aaa aM1bb bbbM1ccc[\$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~" | sed -E "s|M1|\x00|1; s|.*\x00(.*)M3.*$|\1|g"
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
Where:
s|M1|\x00|1
- convert the firstM1
toNUL
s|.*\x00(.*)M3.*$|\1|g
- search for string betweenNUL
andM3
- since we've removed the
NUL
there's no need to convert aNUL
back toM1
For OP's shorter match:
$ echo "^[[200~a^[[200~aaa aM1bb bbbM1ccc[\$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~" | sed -E "s|M3|\x00|1; s|.*M1(.*)\x00.*$|\1|g"
ccc[$cM2ddddM2eeeee
Where:
s|M3|\x00|1
- convert the firstM3
toNUL
s|.*M1(.*)\x00.*$|\1|g
- search for string betweenM1 and
NUL`- since we've removed the
NUL
there's no need to convert aNUL
back toM3
NOTE: I don't work with python
but since these are (relatively) simple sed
regexes I'm guessing they should be (relatively) easy to migrate to python
... ???
The second case is easy, even with sed
:
$ a='^[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~'
$ sed -E 's/.*M1|M3.*//g' <<< "$a"
ccc[$cM2ddddM2eeeee
The first case is more complex because of the greediness of sed
regexes. If you can use python
or perl
, instead of sed
, you can harness their non-greedy .*?
operator:
$ python -c 'import sys,re; print("\n".join(re.sub(r".*?M1|M3.*?","",l) for l in sys.stdin),end="")' <<< "$a"
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
$ perl -pe 's/.*?M1|M3.*?//g' <<< "$a"
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
A bit shorter with python
if you have only one line of text to process and if we pass it as an argument:
$ python -c 'import sys,re; print(re.sub(r".*?M1|M3.*?","",sys.argv[1]))' "$a"
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
With sed
, one possibility consists in first inserting separator characters that do not appear in the input string, for instance newlines, and then keeping only what appears between them. If your sed
supports \n
for newline in the replacement string of the substitute command:
$ sed -E 's/M1(.*)M3/\n\1\n/;s/.*\n(.*)\n.*/\1/' <<< "$a"
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
Else, with any sed
:
$ sed -E 's/M1(.*)M3/\
\1\
/;s/.*\n(.*)\n.*/\1/' <<< "$a"
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
Note: as your shell is bash
, if you absolutely want a one-liner you can use a $'...'
character sequence:
$ sed -E $'s/M1(.*)M3/\\\n\\1\\\n/;s/.*\\n(.*)\\n.*/\\1/' <<< "$a"
How it is possible? single sed regex expression would be the best.
Presumably you mean you would prefer a single s
command, as that's what you present in your example. A regex is one part of an s
command, but the whole command is more than just a regex.
Unlike some other regex dialects, POSIX regular expressions (the kind used by sed
) have no non-greedy, arbitrary-count quantifiers. That makes it a bit tricky to satisfy a requirement to match up to the first appearance of a multi-character substring when there may be more than one appearance, but it can be done. The trick revolves around explicitly matching leading partial matches to the marker. For example, this POSIX extended regular expression will match everything up to and including the first appearance of substring "M1" in the input, or it will not match if "M1" does not appear at all:
^[^M]*(M([^1M][^M]*)?)*M1
Breaking that down,
^[^M]*
matches a leading sequence of any number (including zero) of characters other thanM
(M([^1M][^M]*)?)*
matches any number of subsequences, each starting with anM
and([^1M][^M]*)?
optionally continuing with[^1M]
a character that is neither1
norM
[^M]*
and then any number of characters other thanM
This provides for
M
s that are not part of anM1
, including runs ofM
s of any length, including such runs immediately preceding (but not including) the firstM1
.
M1
matches itself
That's not too bad for a marker consisting of just two characters, especially when they are different from each other, but it gets nasty very quickly as the number of characters in the marker grows. Fortunately for you, your two problems feature only two-character markers.
That approach can help you get a single s
command for each question, depending on to which end of the pattern you apply it. They will be a bit complex, but not outrageously so. I decline to do your homework for you in toto, however, so I leave the details as an exercise.
If you can use sed
then you can use awk
, the other mandatory POSIX text processing tool, so here's a simple solution using any awk
:
$ awk 'match($0,/M1.*M3/) { print substr($0,RSTART+2,RLENGTH-4) }' file
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
$ awk 'sub(/.*M1/,"") && sub(/M3.*/,"")' file
ccc[$cM2ddddM2eeeee
If you really wanted to use sed
for some reason, then using any sed that allows \n
to mean newline in the regexp and replacement, e.g. GNU sed:
$ sed 's/M1\(.*\)M3.*/\n\1/; s/[^\n]*\n//' file
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
$ sed 's/.*M1//; s/M3.*//' file
ccc[$cM2ddddM2eeeee
I expect you can do the same as any of the above in python
.
You can do what you want easily and efficiently with any POSIX-compliant shell, including Bash. No external utilities (sed
, perl
, python
, awk
) are required. When run with any non-prehistoric sh
this code
string='^[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~'
first_to_last=${string#*M1}
first_to_last=${first_to_last%M3*}
last_to_first=${string##*M1}
last_to_first=${last_to_first%%M3*}
printf '%s\n' "$first_to_last"
printf '%s\n' "$last_to_first"
outputs
bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
ccc[$cM2ddddM2eeeee
- See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for explanations of
${string#*M1}
etc. - Also see What is meant by "Now you have two problems"?.
I am posting the solution that I am currently using. Thanks to @markp-fuso and @renaud-pacalet for the ideas input. I am able to solve it.
Python script
import re
text = "^[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~"
# Extract the text between the first M1 and the last 'M3'
result = re.search(r'M1(.*)M3.*$', text).group(1)
# Extract the text between the last M1 and the first 'M3'
result = re.sub(r'.*M1|M3.*','',text)
sed way of extracting is
# Extract the text between the first M1 and the last 'M3'
echo "^[[200~a^[[200~aaa aM1bb bbbM1ccc[\$cM2ddddM2eeeeeM3ffffff f:M3ggggggg M3:hhhhh hhM3:kkkkk~" | sed -E "s/M1(.*)M3.*$/\n\1/;s/.*\n(.*)/\1/"
# Extract the text between the last M1 and the first 'M3'
echo "^[[200~a^[[200~aaa aM1bb bbbM1ccc[\$cM2ddddM2eeeeeM3ffffff f:M3ggggggg M3:hhhhh hhM3:kkkkk~" | sed -E "s/.*M1|M3.*//g"
My patterns will not work with sed, but with perl. Since I do not have access to it, i cannot provide example code here, but I do have a python example.
import re
long_pattern=repile(r"(?<=M1).+(?=M3)")
short_pattern=repile(r"(?<=M1)(?:(?!M1|M3).)+(?=M3)")
s=r"^[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~"
print(
long_pattern.findall(s),
short_pattern.findall(s)
)
Long Pattern:
(?<=M1)
: Assert, that you are to the left of "M1".+
: and match as much as possible(?=M3)
: but ensure the end is followed by "M3"
Short Pattern:
(?<=M1)
: Assert, that you are to the left of "M1"(?: ... )+
: and match as much as possible(?!M1|M3).
: while making sure you never step over either "M1" or "M3"(?=M3)
: but ensure the end is followed by "M3"
In python - much easier since it has more robust regex operations, like non-greedy.
import re
txt="[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~"
print( "longest : " + re.search(r'M1(.*)M3', txt).group(1) )
print( "shortest : " + re.search(r'.*M1(.*?)M3', txt).group(1) )
Executed:
$: python file.py
longest : bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
shortest : ccc[$cM2ddddM2eeeee
Using sed
is a little trickier, since it doesn't have as many tools.
You say
can contain any special chars
If that's true, I'm not sure a sed
regex can help you with absolute certainty without relying on other factors such fixed width fields.
If there is some character you know won't be in the text, such as the newline or null used in other solutions, then those are good.
Otherwise you're going to have to rely on statistics. If you make a specific series of characters as your replacement marker, the length and content of it determines the odds of it happening accidentally in your input stream. A sufficiently unlikely combination is almost as good as an impossibility - you have to decide what constitutes "sufficient".
This still isn't one regex, but in sed
it won't matter functionally.
$: cat file
txt='[[200~a^[[200~aaa aM1bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hhM3kkkkk~'
M=" SPECIFIC and UNLIKELY Marker that does NOT have suspicious metacharacters "
printf "longest : "
sed -E "s/M1/$M/; s/^.*$M(.*)M3.*$/\\1/" <<< "$txt"
printf "shortest : "
sed -E 's/.*M1//; s/M3.*//;' <<< "$txt"
$: ./file
longest : bb bbbM1ccc[$cM2ddddM2eeeeeM3ffffff fM3ggggggg M3hhhhh hh
shortest : ccc[$cM2ddddM2eeeee
With Raku/Sparrow you can do like this by zooming in:
# try to find within outer capture
within: "M1" (.*) "M3"
# then if succeeded
# zoom in into the capture:
regexp: "M1" (.*?) "M3"
end:
# dump captured data
code: <<RAKU
!raku
say capture()[0] if capture();
RAKU
It all works because regex are greedy by default and Sparrow zoom in feature allows to start looking from the most outer capture M1 .. M3 group , zooming to the most inner capture group M1 .. M3