This question is based on How do I fix sed commands becoming extremely slow when load is high? with the advice of @markp-fuso, @jhnc, and @Jetchisel to avoid a chameleon question as many of the answers used hashing and maps for optimization.
I have the following bash script
INPUT=`cat $INPUT_FILE`
while read line; do
PROP_NAME="$(echo $line | cut -f1 -d'=')"
PROP_VALUE="$(echo $line | cut -f2- -d'=' | sed 's/\$/\\\$/g' | sed 's/\&/\\\&/g')"
INPUT="$(echo "$INPUT" | sed "s\`${PROP_NAME}\b\`${PROP_VALUE}\`g")"
done < "$PROPERTIES_FILE"
echo "$INPUT"
This script takes a properties file with format that supports recursion and special characters:
$foo=$barname bar
$barname=Tom&Jerry
$hello=world
And uses it to substitute into a text file with no set format. So
I went to the $foo and said hello to the $hello and they fined me $5.
becomes
I went to the Tom&Jerry bar and said hello to the world and they fined me $5.
The properties file and the text file are hundreds of lines long so performance is important, and the naive implementation results in many minutes or even over an hour of processing time depending on system load. Also important to note is use of \b in the sed, which means that all references are terminated with a punctuation mark or whitespace.
The script cannot do infinite recursion because it only makes one pass through the properties file, which also causes order of properties to matter when recursion is being used.
This question is based on How do I fix sed commands becoming extremely slow when load is high? with the advice of @markp-fuso, @jhnc, and @Jetchisel to avoid a chameleon question as many of the answers used hashing and maps for optimization.
I have the following bash script
INPUT=`cat $INPUT_FILE`
while read line; do
PROP_NAME="$(echo $line | cut -f1 -d'=')"
PROP_VALUE="$(echo $line | cut -f2- -d'=' | sed 's/\$/\\\$/g' | sed 's/\&/\\\&/g')"
INPUT="$(echo "$INPUT" | sed "s\`${PROP_NAME}\b\`${PROP_VALUE}\`g")"
done < "$PROPERTIES_FILE"
echo "$INPUT"
This script takes a properties file with format that supports recursion and special characters:
$foo=$barname bar
$barname=Tom&Jerry
$hello=world
And uses it to substitute into a text file with no set format. So
I went to the $foo and said hello to the $hello and they fined me $5.
becomes
I went to the Tom&Jerry bar and said hello to the world and they fined me $5.
The properties file and the text file are hundreds of lines long so performance is important, and the naive implementation results in many minutes or even over an hour of processing time depending on system load. Also important to note is use of \b in the sed, which means that all references are terminated with a punctuation mark or whitespace.
The script cannot do infinite recursion because it only makes one pass through the properties file, which also causes order of properties to matter when recursion is being used.
Share Improve this question edited yesterday Ed Morton 205k18 gold badges87 silver badges206 bronze badges asked yesterday Bryan TanBryan Tan 3312 silver badges17 bronze badges 24 | Show 19 more comments4 Answers
Reset to default 0Given the the the following data/files:
- properties.txt
^hello=goodbye
$key2=bar
$key1=foo_$key2
$foo=$barname bar
$barname=Tom&Jerry
$hello=world
- sample_input.txt
This is a story about $hello. It starts at a $foo and ends in a park.
Bob said to Sally "^hello, see you soon"
Leave first 2 matches alone: $foobar $hellow ^hello
I $key2 am the great $key1
I went to the $foo and said hello to the $hello and they fined me $5.
The script.
#!/usr/bin/env bash
while IFS='=' read -ru "$raw" raw_key raw_value; do
while IFS='=' read -ru "$not" key value; do
if [[ "$raw_key" == "^"* ]]; then
raw_key="\\${raw_key}"
fi
# If the raw_value contains the key as a substring, perform the replacement.
if [[ "$raw_value" == *"$key" ]]; then
raw_value="${raw_value//"$key"/$value}"
fi
done {not}< properties.txt
input_value+=("s/${raw_key}\\b/${raw_value//\&/\\&}/g")
done {raw}< properties.txt
sed_input="$(IFS=';'; printf '%s' "${input_value[*]}")"
sed "$sed_input" sample_input.txt
- The output
This is a story about world. It starts at a Tom&Jerry bar and ends in a park.
Bob said to Sally "goodbye, see you soon"
Leave first 2 matches alone: $foobar $hellow goodbye
I bar am the great foo_bar
I went to the Tom&Jerry bar and said hello to the world and they fined me $5.
As per the comment section from @jhnc, if order of the properties.txt
matters then consider sorting it, change (sort
has a lot of options/flag)
< properties.txt
to
< <(sort properties.txt)
Not tested on a large set of data/files, which will surely be very slow, then again if speed is the key then shell scripts is not the best tool for the job.
See How-to-replace-variable-content
Holding the entire input file in memory looks crazy to me. The standard, efficient way to process it is as a stream, filtering through a suitable sed
program.
Then the question becomes: how do we transform that substitutions file into a sed
program? The obvious approach is to use sed
itself to transform each line into a s///
command - something like this (untested):
h
# First, quote metachars on the replacement side
s/[^=]*=//
s/[&\\/]/\\&/g
x
# Next, quote the pattern side
s/=.*//
s/[][.*\\/^$]/\\&/g
# Put them together
s,.*,s/&/,
G
s/\n//
s,$,/,
To wrap that into a loop, we just need to add a label at the start and a jump at the end:
1i\
:a
$a\
ta
We don't need to store the generated program in a file, as we can use a process substitution along the lines of sed -f <(sed -f key_value_to_sed properties.txt) <"$INPUT_FILE"
From the comments on the question, it seems you could preprocess the property file to expand property keys that appear in values.
As noted in my answer to the previous question, the original code takes O(m.n) time - m
properties looked for in text of size n
. Preprocessing the properties and making use of Perl's ability to search literal string alternations in constant time can drop this to O(m+n) - one pass over the properties and one pass over the text:
perl -e '
# load properties from first file
while ( ($k,$v) = split "=",<<>>,2 ) {
chomp $v;
$k2v{$k} = $v; # hash for value lookup
unshift @propkeys, $k; # array for insertion order
last if eof;
}
# build single regex from all keys
# \Q escapes regex metacharacters
$re = join "|", map qr/\Q$_\E/, @propkeys;
# walk properties (in reverse), expanding values as we go
for $k (@propkeys) {
$k2v{$k} =~ s/($re)\b/ $seen{$1} ? $k2v{$1} : $1 /ge;
$seen{$k} = 1;
}
# load input from second file
undef $/;
$_ = <<>>;
# convert all properties simultaneously
s/($re)\b/ $k2v{$1} /ge;
# output the result
print;
' propfile textfile
The method I use for preprocessing produces property values that match the behaviour of the question code where order of listing properties affects result:
$key1=foo_$key2
$key2=bar
(expands $key1
in textfile to foo_bar
)
vs
$key2=bar
$key1=foo_$key2
(expands $key1
in textfile to foo_$key2
)
To expand property values until they no longer contain any keys, the code can become:
perl -e '
while ( ($k,$v) = split "=",<<>>,2 ) {
chomp $v;
$k2v{$k} = $v;
unshift @propkeys, $k;
last if eof;
}
$re = join "|", map qr/\Q$_\E/, @propkeys;
# loop trying to expand value of each property key in list
# remove key from list once its value contains no key references
while (@propkeys = grep $k2v{$_} =~ s/($re)\b/ $k2v{$1} /ge, @propkeys) {
die "recursion depth exceeded (@propkeys)\n" if ++$ct > 10;
}
undef $/;
$_ = <<>>;
s/($re)\b/ $k2v{$1} /ge;
print;
' propfile textfile
Depending on details not provided in the question, it may be possible and useful to cache or memoise the expanded properties for later reuse without having to recompute the values each time.
Given your new input and some assumptions (see my comments under the question), this might be what you want, using any POSIX awk:
$ cat tst1.sh
#!/usr/bin/env bash
awk '
NR == FNR {
pos = index($0, "=")
tag = substr($0, 1, pos - 1)
val = substr($0, pos + 1)
tags2vals[tag] = val
next
}
{
matching = 1
numIters = 0
while ( matching && (numIters++ < 10) ) {
matching = 0
head = ""
tail = $0
while ( match(tail, /[$][[:alpha:]_][[:alnum:]_]*/) ) {
tag = val = substr(tail, RSTART, RLENGTH)
if ( tag in tags2vals ) {
matching = 1
val = tags2vals[tag]
}
head = head substr(tail, 1, RSTART-1) val
tail = substr(tail, RSTART+RLENGTH)
}
$0 = head tail
}
print
}
' "$@"
$ ./tst1.sh props input
I went to the Tom&Jerry bar and said hello to the world and they fined me $5.
The above was run against these input files from the question:
$ head props input
==> props <==
$foo=$barname bar
$barname=Tom&Jerry
$hello=world
==> input <==
I went to the $foo and said hello to the $hello and they fined me $5.
Now let's try it with input that'll test some of the rainy day cases discussed in various comments here and in the previous question. Here is the new sample input:
$ head props2 input2
==> props2 <==
$key2=bar
$key1=foo_$key2
$rec=$ursive
$ursive=$rec
$foo=has=equals and spaces
$bar=has&backref
$this=has\escape\tchars
$that=foo
==> input2 <==
my $rec line $foo.with.$this#and $stuff $that now $key1 plus $$that with $bar too
and the output which the OP can decide if it's as expected or not:
$ ./tst1.sh props2 input2
my $rec line has=equals and spaces.with.has\escape\tchars#and $stuff foo now foo_bar plus has=equals and spaces with has&backref too
Alternatively something like this which expands your mappings up front rather than iterating over the input after every set of changes might be more what you want:
$ cat tst2.sh
#!/usr/bin/env bash
awk '
NR == FNR {
pos = index($0, "=")
tag = substr($0, 1, pos - 1)
val = substr($0, pos + 1)
tags2vals[tag] = val
next
}
FNR == 1 {
matching = 1
numIters = 0
while ( matching && (numIters++ < 10) ) {
matching = 0
for ( tag in tags2vals ) {
val = tags2vals[tag]
if ( val in tags2vals ) {
matching = 1
tags2vals[tag] = tags2vals[val]
}
}
}
}
{
head = ""
tail = $0
while ( match(tail, /[$][[:alpha:]_][[:alnum:]_]*/) ) {
tag = val = substr(tail, RSTART, RLENGTH)
if ( tag in tags2vals ) {
matching = 1
val = tags2vals[tag]
}
head = head substr(tail, 1, RSTART-1) val
tail = substr(tail, RSTART+RLENGTH)
}
$0 = head tail
print
}
' "$@"
$ ./tst2.sh props2 input2
my $ursive line has=equals and spaces.with.has\escape\tchars#and $stuff foo now foo_$key2 plus $foo with has&backref too
The above scripts assume that all of your tags start with $
and follow the same rules as C variable names, i.e. are word-constituent chars starting with a letter or underscore, and adds a safety net where it won't do more than 10 rounds of replacement on the input to guard against infinite recursion such as $rec=$ursive
and $ursive=$rec
would cause.
$k1=1, $k2=3, $k3=v
input:$k$k$k1
could become$k$k1
orv
) – jhnc Commented yesterday$
, you no longer have tags that start with^
or other characters. That can turn this into a different question as we could now search for alpha-numeric strings that start with$
and then test to see if they exist in the list of tags instead of having to search for each of the specific tags which is time consuming+trickier. Do all your tags now start with$
? Are they all alphanumeric or can they contain_
or other characters? Can they start with numbers or do they always start with a letter? Any other uniquely identifying features of tags? – Ed Morton Commented yesterday