最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

sed - How can I do a large number recursive word substitutions in a large file using bash in fast time? - Stack Overflow

programmeradmin2浏览0评论

This question is based on How do I fix sed commands becoming extremely slow when load is high? with the advice of @markp-fuso, @jhnc, and @Jetchisel to avoid a chameleon question as many of the answers used hashing and maps for optimization.

I have the following bash script

INPUT=`cat $INPUT_FILE`
while read line; do
   PROP_NAME="$(echo $line | cut -f1 -d'=')"
   PROP_VALUE="$(echo $line | cut -f2- -d'=' | sed 's/\$/\\\$/g' | sed 's/\&/\\\&/g')"
   INPUT="$(echo "$INPUT" | sed "s\`${PROP_NAME}\b\`${PROP_VALUE}\`g")"
done < "$PROPERTIES_FILE"
echo "$INPUT"

This script takes a properties file with format that supports recursion and special characters:

$foo=$barname bar
$barname=Tom&Jerry
$hello=world

And uses it to substitute into a text file with no set format. So

I went to the $foo and said hello to the $hello and they fined me $5.

becomes

I went to the Tom&Jerry bar and said hello to the world and they fined me $5.

The properties file and the text file are hundreds of lines long so performance is important, and the naive implementation results in many minutes or even over an hour of processing time depending on system load. Also important to note is use of \b in the sed, which means that all references are terminated with a punctuation mark or whitespace.

The script cannot do infinite recursion because it only makes one pass through the properties file, which also causes order of properties to matter when recursion is being used.

This question is based on How do I fix sed commands becoming extremely slow when load is high? with the advice of @markp-fuso, @jhnc, and @Jetchisel to avoid a chameleon question as many of the answers used hashing and maps for optimization.

I have the following bash script

INPUT=`cat $INPUT_FILE`
while read line; do
   PROP_NAME="$(echo $line | cut -f1 -d'=')"
   PROP_VALUE="$(echo $line | cut -f2- -d'=' | sed 's/\$/\\\$/g' | sed 's/\&/\\\&/g')"
   INPUT="$(echo "$INPUT" | sed "s\`${PROP_NAME}\b\`${PROP_VALUE}\`g")"
done < "$PROPERTIES_FILE"
echo "$INPUT"

This script takes a properties file with format that supports recursion and special characters:

$foo=$barname bar
$barname=Tom&Jerry
$hello=world

And uses it to substitute into a text file with no set format. So

I went to the $foo and said hello to the $hello and they fined me $5.

becomes

I went to the Tom&Jerry bar and said hello to the world and they fined me $5.

The properties file and the text file are hundreds of lines long so performance is important, and the naive implementation results in many minutes or even over an hour of processing time depending on system load. Also important to note is use of \b in the sed, which means that all references are terminated with a punctuation mark or whitespace.

The script cannot do infinite recursion because it only makes one pass through the properties file, which also causes order of properties to matter when recursion is being used.

Share Improve this question edited yesterday Ed Morton 205k18 gold badges87 silver badges206 bronze badges asked yesterday Bryan TanBryan Tan 3312 silver badges17 bronze badges 24
  • 1 This question is similar to: How do I fix sed commands becoming extremely slow when load is high?. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. – Paul Hodges Commented yesterday
  • 1 How do the suggestions on the previous copy of this question not answer your problem? Close this and go edit that question to use those suggestions, and show how they don't solve your needs. – Paul Hodges Commented yesterday
  • 1 @PaulHodges they did and were told that changing the question after receiving 7 answers was inappropriate and to create a new one with the extra criteria (which I think are still too loosely outlined here) – jhnc Commented yesterday
  • 1 please clarify if the recursion is only in the properties or if non-properties in the input can turn into properties after some substitutions (eg. props: $k1=1, $k2=3, $k3=v input: $k$k$k1 could become $k$k1 or v) – jhnc Commented yesterday
  • 1 I also see now that your tags all start with $, you no longer have tags that start with ^ or other characters. That can turn this into a different question as we could now search for alpha-numeric strings that start with $ and then test to see if they exist in the list of tags instead of having to search for each of the specific tags which is time consuming+trickier. Do all your tags now start with $? Are they all alphanumeric or can they contain _ or other characters? Can they start with numbers or do they always start with a letter? Any other uniquely identifying features of tags? – Ed Morton Commented yesterday
 |  Show 19 more comments

4 Answers 4

Reset to default 0

Given the the the following data/files:

  • properties.txt
^hello=goodbye
$key2=bar
$key1=foo_$key2
$foo=$barname bar
$barname=Tom&Jerry
$hello=world
  • sample_input.txt
This is a story about $hello. It starts at a $foo and ends in a park.

Bob said to Sally "^hello, see you soon"

Leave first 2 matches alone: $foobar $hellow ^hello

I  $key2 am the great $key1

I went to the $foo and said hello to the $hello and they fined me $5.

The script.

#!/usr/bin/env bash

while IFS='=' read -ru "$raw" raw_key raw_value; do
  while IFS='=' read -ru "$not" key value; do
    if [[ "$raw_key" == "^"* ]]; then
      raw_key="\\${raw_key}"
    fi
    # If the raw_value contains the key as a substring, perform the replacement.
    if [[ "$raw_value" == *"$key" ]]; then
      raw_value="${raw_value//"$key"/$value}"
    fi
  done {not}< properties.txt
  input_value+=("s/${raw_key}\\b/${raw_value//\&/\\&}/g")
done {raw}< properties.txt

sed_input="$(IFS=';'; printf '%s' "${input_value[*]}")"

sed "$sed_input" sample_input.txt
  • The output
This is a story about world. It starts at a Tom&Jerry bar and ends in a park.

Bob said to Sally "goodbye, see you soon"

Leave first 2 matches alone: $foobar $hellow goodbye

I  bar am the great foo_bar

I went to the Tom&Jerry bar and said hello to the world and they fined me $5.


As per the comment section from @jhnc, if order of the properties.txt matters then consider sorting it, change (sort has a lot of options/flag)

< properties.txt

to

< <(sort properties.txt)

  • Not tested on a large set of data/files, which will surely be very slow, then again if speed is the key then shell scripts is not the best tool for the job.

  • See How-to-replace-variable-content

Holding the entire input file in memory looks crazy to me. The standard, efficient way to process it is as a stream, filtering through a suitable sed program.

Then the question becomes: how do we transform that substitutions file into a sed program? The obvious approach is to use sed itself to transform each line into a s/// command - something like this (untested):

h
# First, quote metachars on the replacement side
s/[^=]*=//
s/[&\\/]/\\&/g
x
# Next, quote the pattern side
s/=.*//
s/[][.*\\/^$]/\\&/g
# Put them together
s,.*,s/&/,
G
s/\n//
s,$,/,

To wrap that into a loop, we just need to add a label at the start and a jump at the end:

1i\
:a
$a\
ta

We don't need to store the generated program in a file, as we can use a process substitution along the lines of sed -f <(sed -f key_value_to_sed properties.txt) <"$INPUT_FILE"

From the comments on the question, it seems you could preprocess the property file to expand property keys that appear in values.

As noted in my answer to the previous question, the original code takes O(m.n) time - m properties looked for in text of size n. Preprocessing the properties and making use of Perl's ability to search literal string alternations in constant time can drop this to O(m+n) - one pass over the properties and one pass over the text:

perl -e '
    # load properties from first file
    while ( ($k,$v) = split "=",<<>>,2 ) {
        chomp $v;
        $k2v{$k} = $v;         # hash for value lookup
        unshift @propkeys, $k; # array for insertion order
        last if eof;
    }

    # build single regex from all keys
    # \Q escapes regex metacharacters
    $re = join "|", map qr/\Q$_\E/, @propkeys;

    # walk properties (in reverse), expanding values as we go
    for $k (@propkeys) {
        $k2v{$k} =~ s/($re)\b/ $seen{$1} ? $k2v{$1} : $1 /ge;
        $seen{$k} = 1;
    }

    # load input from second file
    undef $/;
    $_ = <<>>;

    # convert all properties simultaneously
    s/($re)\b/ $k2v{$1} /ge;

    # output the result
    print;

' propfile textfile

The method I use for preprocessing produces property values that match the behaviour of the question code where order of listing properties affects result:

$key1=foo_$key2
$key2=bar

(expands $key1 in textfile to foo_bar)

vs

$key2=bar
$key1=foo_$key2

(expands $key1 in textfile to foo_$key2)

To expand property values until they no longer contain any keys, the code can become:


perl -e '
    while ( ($k,$v) = split "=",<<>>,2 ) {
        chomp $v;
        $k2v{$k} = $v;
        unshift @propkeys, $k;
        last if eof;
    }
    $re = join "|", map qr/\Q$_\E/, @propkeys;

    # loop trying to expand value of each property key in list
    # remove key from list once its value contains no key references
    while (@propkeys = grep $k2v{$_} =~ s/($re)\b/ $k2v{$1} /ge, @propkeys) {
        die "recursion depth exceeded (@propkeys)\n" if ++$ct > 10;
    }

    undef $/;
    $_ = <<>>;
    s/($re)\b/ $k2v{$1} /ge;
    print;

' propfile textfile


Depending on details not provided in the question, it may be possible and useful to cache or memoise the expanded properties for later reuse without having to recompute the values each time.

Given your new input and some assumptions (see my comments under the question), this might be what you want, using any POSIX awk:

$ cat tst1.sh
#!/usr/bin/env bash

awk '
    NR == FNR {
        pos = index($0, "=")
        tag = substr($0, 1, pos - 1)
        val = substr($0, pos + 1)
        tags2vals[tag] = val
        next
    }
    {
        matching = 1
        numIters = 0
        while ( matching && (numIters++ < 10) ) {
            matching = 0
            head = ""
            tail = $0
            while ( match(tail, /[$][[:alpha:]_][[:alnum:]_]*/) ) {
                tag = val = substr(tail, RSTART, RLENGTH)
                if ( tag in tags2vals ) {
                    matching = 1
                    val = tags2vals[tag]
                }
                head = head substr(tail, 1, RSTART-1) val
                tail = substr(tail, RSTART+RLENGTH)
            }
            $0 = head tail
        }
        print
    }
' "$@"
$ ./tst1.sh props input
I went to the Tom&Jerry bar and said hello to the world and they fined me $5.

The above was run against these input files from the question:

$ head props input
==> props <==
$foo=$barname bar
$barname=Tom&Jerry
$hello=world

==> input <==
I went to the $foo and said hello to the $hello and they fined me $5.

Now let's try it with input that'll test some of the rainy day cases discussed in various comments here and in the previous question. Here is the new sample input:

$ head props2 input2
==> props2 <==
$key2=bar
$key1=foo_$key2
$rec=$ursive
$ursive=$rec
$foo=has=equals   and spaces
$bar=has&backref
$this=has\escape\tchars
$that=foo

==> input2 <==
my $rec line $foo.with.$this#and $stuff $that now $key1 plus $$that with $bar too

and the output which the OP can decide if it's as expected or not:

$ ./tst1.sh props2 input2
my $rec line has=equals   and spaces.with.has\escape\tchars#and $stuff foo now foo_bar plus has=equals   and spaces with has&backref too

Alternatively something like this which expands your mappings up front rather than iterating over the input after every set of changes might be more what you want:

$ cat tst2.sh
#!/usr/bin/env bash

awk '
    NR == FNR {
        pos = index($0, "=")
        tag = substr($0, 1, pos - 1)
        val = substr($0, pos + 1)
        tags2vals[tag] = val
        next
    }
    FNR == 1 {
        matching = 1
        numIters = 0
        while ( matching && (numIters++ < 10) ) {
            matching = 0
            for ( tag in tags2vals ) {
                val = tags2vals[tag]
                if ( val in tags2vals ) {
                    matching = 1
                    tags2vals[tag] = tags2vals[val]
                }
            }
        }
    }
    {
        head = ""
        tail = $0
        while ( match(tail, /[$][[:alpha:]_][[:alnum:]_]*/) ) {
            tag = val = substr(tail, RSTART, RLENGTH)
            if ( tag in tags2vals ) {
                matching = 1
                val = tags2vals[tag]
            }
            head = head substr(tail, 1, RSTART-1) val
            tail = substr(tail, RSTART+RLENGTH)
        }
        $0 = head tail
        print
    }
' "$@"
$ ./tst2.sh props2 input2
my $ursive line has=equals   and spaces.with.has\escape\tchars#and $stuff foo now foo_$key2 plus $foo with has&backref too

The above scripts assume that all of your tags start with $ and follow the same rules as C variable names, i.e. are word-constituent chars starting with a letter or underscore, and adds a safety net where it won't do more than 10 rounds of replacement on the input to guard against infinite recursion such as $rec=$ursive and $ursive=$rec would cause.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论