bash - How to match patterns from one file against a specific column in another file using grep?

file1, which contains a single string per line. I want to check if each string exists in the second column of file2. file2 contains two single space-separated strings per line, but some lines may have leading spaces before the first column.

I want to use only grep and/or cut to perform the match and output matching lines from file2 to newFile.txt, ensuring whole word matching (-w).

I've tried

grep -wF -f file1 file2 > newFile.txt

but due to the file size terminal runs infinitely.

I've also tried

grep -wF -f <(cut -d ' ' -f 2 file2) | grep -wF -f - file2 > newFile.txt

This only works for some lines in file2 because some lines have multiple spaces before the 2 strings, although those strings are only separated by single space.

File1:

 aaa
 bbb
 ccc

File2:

 a aaa (should match) 
     b bbb (should match and does but the former spaces throws off cut -d, resulting in incomplete output of line in the newFile.txt)
 c cc (should not match)

Question: How can I efficiently extract and match whole words in the second column of file2.txt, while handling inconsistent leading spaces? I prefer using grep and/or cut, but I'm open to small modifications.

I want to use only grep and/or cut to perform the match and output matching lines from file2 to newFile.txt, ensuring whole word matching (-w).

I've tried

grep -wF -f file1 file2 > newFile.txt

but due to the file size terminal runs infinitely.

I've also tried

grep -wF -f <(cut -d ' ' -f 2 file2) | grep -wF -f - file2 > newFile.txt

This only works for some lines in file2 because some lines have multiple spaces before the 2 strings, although those strings are only separated by single space.

File1:

 aaa
 bbb
 ccc

File2:

 a aaa (should match) 
     b bbb (should match and does but the former spaces throws off cut -d, resulting in incomplete output of line in the newFile.txt)
 c cc (should not match)

Share Improve this question asked 2 days ago Chuepapiii 911 gold badge2 silver badges9 bronze badges

1 Are you sure you wouldn't rather use awk? It's built for searching, and knows about columns. – Gordon Davisson Commented 2 days ago
awk works, but I want to know how grep/cut would work – Chuepapiii Commented 2 days ago
"but due to the file size terminal runs infinitely" If it is really a matter of file size, then it is likely file1 size. One thing that you could try is splitting file1 in smaller chunks (e.g. with split), process them independently, and concatenate the outputs. You could measure the time T(1) it takes with a 1-line file1, progressively increase the number of lines, and find the n value that minimizes T(n) / n. – Renaud Pacalet Commented 2 days ago
please update the question with the size of both files; the complete output from wc file1 file2 should be sufficient – markp-fuso Commented 2 days ago
fwiw, the grep ... | grep ... script is invalid syntax and is missing any reference to file1 – markp-fuso Commented 2 days ago

| Show 2 more comments

2 Answers 2

Sorted by: Reset to default 2

Asking for help to do this efficiently with grep and cut is like asking for help constructing a garden fence with a kitchen fork and a paperclip. They're simply not the right tools for the job and so they cannot be used efficiently for this, nor can they be used robustly (or portably) without adding yet more tools to the mix to help them out. An awk-only solution, by contrast, would be trivial, efficient, and portable, e.g. the following will work using any POSIX awk:

$ awk 'NR == FNR{ tgts[$1]; next } $2 in tgts' file1 file2
 a aaa (should match)
     b bbb (should match and does but the former spaces throws off cut -d, resulting in incomplete output of line in the newFile.txt)

Original answer before I noticed the OP said "I want to check if each string exists in the second column of file2. file2 contains two single space-separated strings per line" and thought they wanted to match all "words" in file2:

$ cat tst.awk
NR == FNR {
    tgts[$1]
    next
}
{
    split($0, words, /[^[:alnum:]_]+/)
    for ( i in words ) {
        if ( words[i] in tgts ) {
            print
            next
        }
    }
}

$ awk -f tst.awk file1 file2
 a aaa (should match)
     b bbb (should match and does but the former spaces throws off cut -d, resulting in incomplete output of line in the newFile.txt)

If you have more characters than just alpha-numerics and _ that you consider part of a "word" then just change [^[:alnum:]_] to include them, e.g. if a "word" can contain . and - then change it to [^[:alnum:]_.-]

I would have leaned towards using awk, in this case I've used paste command I guess you can use something like this regex:

pattern=$(paste -sd'|' file1)
grep -E "^[[:space:]]*\S+[[:space:]]+($pattern)\b" file2 > newfile.txt

if I understand you correctly... This should extract lines from file2 where the second column exactly matches one of the file1 entries even with leading spaces

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

bash - How to match patterns from one file against a specific column in another file using grep? - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)