linux - Replace all instances of character in portion of string in bash

I need to replace all instances of a character (period in my case) in 1+ portions/segments/ranges of a string. I'm using Bash on Linux. Ideally the solution is in Bash, but if it's either not possible or terribly complex I can call any app commonly found on Linux (sed, Python, etc).

Example:

Starting String: "foo.bar.baz blah. blah. blah. abc.def.ghi ..." .

Needed transformation: Replace all periods "." between  and  with the string "" .

Desired Result: "foobarbaz blah. blah. blah. abcdefghi" .

EDITS:

The starting string will never contain  or  within a set of them (ie. the range markers are never nested).

I'm asking for help with some built-in Bash capability to perform this. The obvious mechanism is to try to find and , and then perform substitution in the content between. I know Bash can do offset finding (in an indirect way), and substitution. But can it be performed on a subset?

For the comments regarding parsing this as XML: I did not say this is XML so you should not assume it. Ultimately it's irrelevant to my question; the range markers can be anything.

Here's something I got working. It's not pure Bash, but it's simple.

while $(echo "${my_str}" | grep -E '<mark>[^.]*\.[^<]*</mark>' >/dev/null 2>&1) ; do
    my_str=$(echo "${my_str}" | sed -E -e 's,(<mark>[^.]*)\.([^<]*</mark>),\1<wbr />\2,g')
done

Example:

Starting String: "foo.bar.baz blah. blah. blah. abc.def.ghi ..." .

Needed transformation: Replace all periods "." between  and  with the string "" .

Desired Result: "foobarbaz blah. blah. blah. abcdefghi" .

EDITS:

The starting string will never contain  or  within a set of them (ie. the range markers are never nested).

For the comments regarding parsing this as XML: I did not say this is XML so you should not assume it. Ultimately it's irrelevant to my question; the range markers can be anything.

Here's something I got working. It's not pure Bash, but it's simple.

while $(echo "${my_str}" | grep -E '<mark>[^.]*\.[^<]*</mark>' >/dev/null 2>&1) ; do
    my_str=$(echo "${my_str}" | sed -E -e 's,(<mark>[^.]*)\.([^<]*</mark>),\1<wbr />\2,g')
done

Share Improve this question edited Mar 26 at 11:43 asked Mar 24 at 17:02 codesniffer 1,21612 silver badges23 bronze badges

2 Do not parse XML with regex. Use an XML parser. – Léa Gris Commented Mar 24 at 18:26
2 Post valid XML in your question. – Cyrus Commented Mar 24 at 18:34
1 This quick hack (which absolutely will not work for general XML strings) may help to get you started on a pure Bash solution: tmp=$string; newstr=; while [[ $tmp == *''*''* ]]; do tmp2=${tmp#**}; tmp3=${tmp%"$tmp2"}; tmp=$tmp2; tmp4=${tmp3%%*}; tmp5=${tmp3#"$tmp4"}; tmp5=${tmp5//./''}; tmp5="<begin>${tmp5#}"; tmp5="${tmp5%}</end>"; newstr+=$tmp4$tmp5; done; newstr+=$tmp; printf '%s\n' "$newstr" – pjh Commented Mar 24 at 19:25
@Shawn - good observation! I changed the tags mid-edit and missed some. I've corrected the Desired Result. – codesniffer Commented Mar 24 at 20:33
1 you could start by replacing the while $(echo ... | grep ...); do with while grep -q -E '[^.]*\.[^<]*' <<< "${my_str}"; do to eliminate two subshell calls on each pass through the loop; the $(echo ... | sed ...) could be replaced with $(sed ... <<< "${my_str}") to eliminate another subshell, while this last subshell could be replaced with some creative parameter substitutions; though I'd look into how to compare ${my_str} to a regex and how that populates the BASH_REMATCH[] array, then the BASH_REMATCH[] results can be used to formulate the parameter substitution – markp-fuso Commented Mar 24 at 21:33

| Show 4 more comments

5 Answers 5

Sorted by: Reset to default 4

Setup:

string='<mark>foo.bar.baz</mark> blah. blah. blah. <mark>abc.def.ghi</mark>'

One bash solution:

regex='(<mark>[^<]*</mark>)'           # assumes no "<" between "<mark>" and "</mark>" tags
unset prev_string                      # used to test for a change to 'string'

# while we have a match and a change has been made to 'string' ...

while [[ "${string}" =~ ${regex} && "${prev_string}" != "${string}" ]]
do
    # typeset -p BASH_REMATCH          # uncomment to see contents of the BASH_REMATCH[] array

    prev_string="${string}"

    # use nested parameter substitutions to make replacement

    string="${string/${BASH_REMATCH[1]}/${BASH_REMATCH[1]//\./<wbr \/>}}"
done

NOTE: "${prev_string}" != "${string}" added as a quick hack to insure we don't go into an infinite loop in the case where no modifications are made to string (eg, no periods between the tags)

A variation on the above which adds a few cpu cycles while making the parameter substitutions easier to read and understand:

regex='(<mark>[^<]*</mark>)'
unset prev_string

while [[ "${string}" =~ ${regex} && "${prev_string}" != "${string}" ]]
do
    old="${BASH_REMATCH[1]}"           # copy the match; makes follow-on commands a bit cleaner
    new="${old//\./<wbr \/>}"          # replace all periods with "<wbr />"

    prev_string="${string}"
    string="${string/${old}/${new}}"   # update "string" by replacing "${old}" with "${new}"
done

These both generate:

$ typeset -p string
declare -- string="<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>"

Feed Perl from stdin or append a file name:

perl -pe 's%(<mark>.*?</mark>)% $1 =~ s|\.|<wbr />|gr %eg'

Output:

<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>

Source: https://unix.stackexchange/a/152623/74329

This is probably super inperformant, but it only uses a single regex to search and replace - no loop needed. I am no expert in shell scripts, so I will not provide one, but this should work inside a Perl call.

Try matching:

([^.]+|\G)\.(?=(?:(?!<mark>).)+<\/mark>)

and replacing with:

$1<wbr />

See: regex101

Explanation

MATCH:

Match all .:

( ... ): Capture to group 1 either
- [^.]+: anything but a dot
- |\G: or the end of the last match
\.: then match a dot

Ensure the dot is inside  ...  tags:

(?= ... ): Look ahead and assert
- (?: ... )+: that you match anything
 - (?!).: but it cannot be .
- <\/mark>: Find , ensuring that you must be inside the tag

REPLACE:

$1: Keep the first group (everything before a dot, but inside tag)
: and replace the dots with

Using any awk in any shell on all Unix boxes:

$ awk '
BEGIN {
    FS = OFS = "</mark>"
}
{
    for (i = 1; i <= NF; i++) {
        if ( match($i, /<mark>.*/) ) {
            tgt = substr($i, RSTART, RLENGTH)
            gsub(/\./, "<wbr />", tgt)
            $i = substr($i, 1, RSTART - 1) tgt
        }
    }
    print
}
' file
<mark>foo<wbr />bar<wbr />baz</mark> blah. blah. blah. <mark>abc<wbr />def<wbr />ghi</mark>

This Shellcheck-clean pure Bash code updates the value of the variable my_str:

tmp=$my_str
my_str=
while [[ $tmp =~ ^(.*)(\<mark\>.*\</mark\>)(.*)$ ]]; do
    tmp=${BASH_REMATCH[1]}
    my_str=${BASH_REMATCH[2]//./<wbr />}${BASH_REMATCH[3]}${my_str}
done
my_str=${tmp}${my_str}

The code makes no assumptions about characters between  and . (E.g. < is OK.)
... substrings are processed right-to-left within the input string to work around the fact that matching of regular expressions in Bash is always greedy.
See mkelement0's excellent answer to How do I use a regex in a shell script? for information about regular expressions in Bash.
See Substituting part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for an explanation of the expansion mechanism (${var//old/new}) used in ${BASH_REMATCH[2]//./}.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

linux - Replace all instances of character in portion of string in bash - Stack Overflow

5 Answers 5

与本文相关的文章

评论列表(0)