file1, which contains a single string per line. I want to check if each string exists in the second column of file2. file2 contains two single space-separated strings per line, but some lines may have leading spaces before the first column.
I want to use only grep and/or cut to perform the match and output matching lines from file2 to newFile.txt, ensuring whole word matching (-w).
I've tried
grep -wF -f file1 file2 > newFile.txt
but due to the file size terminal runs infinitely.
I've also tried
grep -wF -f <(cut -d ' ' -f 2 file2) | grep -wF -f - file2 > newFile.txt
This only works for some lines in file2 because some lines have multiple spaces before the 2 strings, although those strings are only separated by single space.
File1:
aaa
bbb
ccc
File2:
a aaa (should match)
b bbb (should match and does but the former spaces throws off cut -d, resulting in incomplete output of line in the newFile.txt)
c cc (should not match)
Question: How can I efficiently extract and match whole words in the second column of file2.txt, while handling inconsistent leading spaces? I prefer using grep and/or cut, but I'm open to small modifications.
file1, which contains a single string per line. I want to check if each string exists in the second column of file2. file2 contains two single space-separated strings per line, but some lines may have leading spaces before the first column.
I want to use only grep and/or cut to perform the match and output matching lines from file2 to newFile.txt, ensuring whole word matching (-w).
I've tried
grep -wF -f file1 file2 > newFile.txt
but due to the file size terminal runs infinitely.
I've also tried
grep -wF -f <(cut -d ' ' -f 2 file2) | grep -wF -f - file2 > newFile.txt
This only works for some lines in file2 because some lines have multiple spaces before the 2 strings, although those strings are only separated by single space.
File1:
aaa
bbb
ccc
File2:
a aaa (should match)
b bbb (should match and does but the former spaces throws off cut -d, resulting in incomplete output of line in the newFile.txt)
c cc (should not match)
Question: How can I efficiently extract and match whole words in the second column of file2.txt, while handling inconsistent leading spaces? I prefer using grep and/or cut, but I'm open to small modifications.
Share Improve this question asked 2 days ago ChuepapiiiChuepapiii 911 gold badge2 silver badges9 bronze badges 7 | Show 2 more comments2 Answers
Reset to default 2Asking for help to do this efficiently with grep and cut is like asking for help constructing a garden fence with a kitchen fork and a paperclip. They're simply not the right tools for the job and so they cannot be used efficiently for this, nor can they be used robustly (or portably) without adding yet more tools to the mix to help them out. An awk-only solution, by contrast, would be trivial, efficient, and portable, e.g. the following will work using any POSIX awk:
$ awk 'NR == FNR{ tgts[$1]; next } $2 in tgts' file1 file2
a aaa (should match)
b bbb (should match and does but the former spaces throws off cut -d, resulting in incomplete output of line in the newFile.txt)
Original answer before I noticed the OP said "I want to check if each string exists in the second column of file2. file2 contains two single space-separated strings per line" and thought they wanted to match all "words" in file2:
$ cat tst.awk
NR == FNR {
tgts[$1]
next
}
{
split($0, words, /[^[:alnum:]_]+/)
for ( i in words ) {
if ( words[i] in tgts ) {
print
next
}
}
}
$ awk -f tst.awk file1 file2
a aaa (should match)
b bbb (should match and does but the former spaces throws off cut -d, resulting in incomplete output of line in the newFile.txt)
If you have more characters than just alpha-numerics and _
that you consider part of a "word" then just change [^[:alnum:]_]
to include them, e.g. if a "word" can contain .
and -
then change it to [^[:alnum:]_.-]
I would have leaned towards using awk, in this case I've used paste command I guess you can use something like this regex:
pattern=$(paste -sd'|' file1)
grep -E "^[[:space:]]*\S+[[:space:]]+($pattern)\b" file2 > newfile.txt
if I understand you correctly... This should extract lines from file2 where the second column exactly matches one of the file1 entries even with leading spaces
awk
? It's built for searching, and knows about columns. – Gordon Davisson Commented 2 days agofile1
size. One thing that you could try is splittingfile1
in smaller chunks (e.g. withsplit
), process them independently, and concatenate the outputs. You could measure the timeT(1)
it takes with a 1-linefile1
, progressively increase the number of lines, and find then
value that minimizesT(n) / n
. – Renaud Pacalet Commented 2 days agowc file1 file2
should be sufficient – markp-fuso Commented 2 days agogrep ... | grep ...
script is invalid syntax and is missing any reference tofile1
– markp-fuso Commented 2 days ago