I need to replace all instances of a character (period in my case) in 1+ portions/segments/ranges of a string. I'm using Bash on Linux. Ideally the solution is in Bash, but if it's either not possible or terribly complex I can call any app commonly found on Linux (sed, Python, etc).
Example:
Starting String: "<mark>foo.bar.baz</mark> blah. blah. blah. <mark>abc.def.ghi</mark> ...
" .
Needed transformation: Replace all periods ".
" between <mark>
and </mark>
with the string "<wbr />
" .
Desired Result: "<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>
" .
EDITS:
The starting string will never contain <mark>
or </mark>
within a set of them (ie. the range markers are never nested).
I'm asking for help with some built-in Bash capability to perform this. The obvious mechanism is to try to find and , and then perform substitution in the content between. I know Bash can do offset finding (in an indirect way), and substitution. But can it be performed on a subset?
For the comments regarding parsing this as XML: I did not say this is XML so you should not assume it. Ultimately it's irrelevant to my question; the range markers can be anything.
Here's something I got working. It's not pure Bash, but it's simple.
while $(echo "${my_str}" | grep -E '<mark>[^.]*\.[^<]*</mark>' >/dev/null 2>&1) ; do
my_str=$(echo "${my_str}" | sed -E -e 's,(<mark>[^.]*)\.([^<]*</mark>),\1<wbr />\2,g')
done
I need to replace all instances of a character (period in my case) in 1+ portions/segments/ranges of a string. I'm using Bash on Linux. Ideally the solution is in Bash, but if it's either not possible or terribly complex I can call any app commonly found on Linux (sed, Python, etc).
Example:
Starting String: "<mark>foo.bar.baz</mark> blah. blah. blah. <mark>abc.def.ghi</mark> ...
" .
Needed transformation: Replace all periods ".
" between <mark>
and </mark>
with the string "<wbr />
" .
Desired Result: "<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>
" .
EDITS:
The starting string will never contain <mark>
or </mark>
within a set of them (ie. the range markers are never nested).
I'm asking for help with some built-in Bash capability to perform this. The obvious mechanism is to try to find and , and then perform substitution in the content between. I know Bash can do offset finding (in an indirect way), and substitution. But can it be performed on a subset?
For the comments regarding parsing this as XML: I did not say this is XML so you should not assume it. Ultimately it's irrelevant to my question; the range markers can be anything.
Here's something I got working. It's not pure Bash, but it's simple.
while $(echo "${my_str}" | grep -E '<mark>[^.]*\.[^<]*</mark>' >/dev/null 2>&1) ; do
my_str=$(echo "${my_str}" | sed -E -e 's,(<mark>[^.]*)\.([^<]*</mark>),\1<wbr />\2,g')
done
Share
Improve this question
edited Mar 26 at 11:43
codesniffer
asked Mar 24 at 17:02
codesniffercodesniffer
1,21612 silver badges23 bronze badges
9
|
Show 4 more comments
5 Answers
Reset to default 4Setup:
string='<mark>foo.bar.baz</mark> blah. blah. blah. <mark>abc.def.ghi</mark>'
One bash
solution:
regex='(<mark>[^<]*</mark>)' # assumes no "<" between "<mark>" and "</mark>" tags
unset prev_string # used to test for a change to 'string'
# while we have a match and a change has been made to 'string' ...
while [[ "${string}" =~ ${regex} && "${prev_string}" != "${string}" ]]
do
# typeset -p BASH_REMATCH # uncomment to see contents of the BASH_REMATCH[] array
prev_string="${string}"
# use nested parameter substitutions to make replacement
string="${string/${BASH_REMATCH[1]}/${BASH_REMATCH[1]//\./<wbr \/>}}"
done
NOTE: "${prev_string}" != "${string}"
added as a quick hack to insure we don't go into an infinite loop in the case where no modifications are made to string
(eg, no periods between the tags)
A variation on the above which adds a few cpu cycles while making the parameter substitutions easier to read and understand:
regex='(<mark>[^<]*</mark>)'
unset prev_string
while [[ "${string}" =~ ${regex} && "${prev_string}" != "${string}" ]]
do
old="${BASH_REMATCH[1]}" # copy the match; makes follow-on commands a bit cleaner
new="${old//\./<wbr \/>}" # replace all periods with "<wbr />"
prev_string="${string}"
string="${string/${old}/${new}}" # update "string" by replacing "${old}" with "${new}"
done
These both generate:
$ typeset -p string
declare -- string="<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>"
Feed Perl from stdin or append a file name:
perl -pe 's%(<mark>.*?</mark>)% $1 =~ s|\.|<wbr />|gr %eg'
Output:
<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>
Source: https://unix.stackexchange/a/152623/74329
This is probably super inperformant, but it only uses a single regex to search and replace - no loop needed. I am no expert in shell scripts, so I will not provide one, but this should work inside a Perl call.
Try matching:
([^.]+|\G)\.(?=(?:(?!<mark>).)+<\/mark>)
and replacing with:
$1<wbr />
See: regex101
Explanation
MATCH:
- Match all
.
:
( ... )
: Capture to group 1 either[^.]+
: anything but a dot|\G
: or the end of the last match
\.
: then match a dot
- Ensure the dot is inside
<mark> ... </mark>
tags:
(?= ... )
: Look ahead and assert(?: ... )+
: that you match anything(?!<mark>).
: but it cannot be<mark>
.
<\/mark>
: Find</mark>
, ensuring that you must be inside the tag
REPLACE:
$1
: Keep the first group (everything before a dot, but inside tag)<wbr />
: and replace the dots with<wbr />
Using any awk in any shell on all Unix boxes:
$ awk '
BEGIN {
FS = OFS = "</mark>"
}
{
for (i = 1; i <= NF; i++) {
if ( match($i, /<mark>.*/) ) {
tgt = substr($i, RSTART, RLENGTH)
gsub(/\./, "<wbr />", tgt)
$i = substr($i, 1, RSTART - 1) tgt
}
}
print
}
' file
<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>
This Shellcheck-clean pure Bash code updates the value of the variable my_str
:
tmp=$my_str
my_str=
while [[ $tmp =~ ^(.*)(\<mark\>.*\</mark\>)(.*)$ ]]; do
tmp=${BASH_REMATCH[1]}
my_str=${BASH_REMATCH[2]//./<wbr />}${BASH_REMATCH[3]}${my_str}
done
my_str=${tmp}${my_str}
- The code makes no assumptions about characters between
<mark>
and</mark>
. (E.g.<
is OK.) <mark>...</mark>
substrings are processed right-to-left within the input string to work around the fact that matching of regular expressions in Bash is always greedy.- See mkelement0's excellent answer to How do I use a regex in a shell script? for information about regular expressions in Bash.
- See Substituting part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for an explanation of the expansion mechanism (
${var//old/new}
) used in${BASH_REMATCH[2]//./<wbr />}
.
tmp=$string; newstr=; while [[ $tmp == *'<mark>'*'</mark>'* ]]; do tmp2=${tmp#*<mark>*</mark>}; tmp3=${tmp%"$tmp2"}; tmp=$tmp2; tmp4=${tmp3%%<mark>*</mark>}; tmp5=${tmp3#"$tmp4"}; tmp5=${tmp5//./'<wbr />'}; tmp5="<begin>${tmp5#<mark>}"; tmp5="${tmp5%</mark>}</end>"; newstr+=$tmp4$tmp5; done; newstr+=$tmp; printf '%s\n' "$newstr"
– pjh Commented Mar 24 at 19:25while $(echo ... | grep ...); do
withwhile grep -q -E '<mark>[^.]*\.[^<]*</mark>' <<< "${my_str}"; do
to eliminate two subshell calls on each pass through the loop; the$(echo ... | sed ...)
could be replaced with$(sed ... <<< "${my_str}")
to eliminate another subshell, while this last subshell could be replaced with some creative parameter substitutions; though I'd look into how to compare${my_str}
to a regex and how that populates theBASH_REMATCH[]
array, then theBASH_REMATCH[]
results can be used to formulate the parameter substitution – markp-fuso Commented Mar 24 at 21:33