I have a command that reads a giant(92GB, 100M+ lines) file in parallel. File grows in time, perhaps, it is a log a file.
Command itself:
parallel -a "2025-03-18.log" -k --pipepart "grep processor | grep -e TxnError | jq .msg"
I have 16 CPU and 32 RAM, but it still takes 20-25min to parse that file, even parallelized. So I decided to start parallelized read from specific line. How can I do that, any thoughts?
I have a command that reads a giant(92GB, 100M+ lines) file in parallel. File grows in time, perhaps, it is a log a file.
Command itself:
parallel -a "2025-03-18.log" -k --pipepart "grep processor | grep -e TxnError | jq .msg"
I have 16 CPU and 32 RAM, but it still takes 20-25min to parse that file, even parallelized. So I decided to start parallelized read from specific line. How can I do that, any thoughts?
Share Improve this question asked Mar 19 at 18:18 AziiAzii 991 silver badge7 bronze badges 2 |1 Answer
Reset to default 1As I am sure you know --pipepart
is way more efficient than --pipe
.
However, you use the default block size, and in your case 1MB is way to small: You start a process for each 1MB.
I like using --block -1
. If you have 16 jobslots it will split the input file into 16 blocks of equal size. This will be done on the fly. This will in total only start 16 jobs that each read different parts of your big file. (And yes: It will only split at \n - GNU Parallel is not braindead).
parallel --block -1 -a "2025-03-18.log" -k --pipepart "grep processor | grep -e TxnError | jq .msg"
If you insist, you can start reading from a specific line:
# This is slow - not recommended
cat file | tail -n +100000 | parallel --pipe ...
but then you cannot use --pipepart
. This is because there is no easy way to count the number of lines without reading the file.
But I have the feeling you do not really need to know the line number. You just want to skip the first 62GB of the file, and see the rest of the file as a 30GB file.
And that can be done:
# Requires root access
losetup -o 66572539904 /dev/loop0 2025-03-18.log
Now you use /dev/loop0 as your file:
parallel -a /dev/loop0 -k --block -1 --pipepart "grep processor | grep -e TxnError | jq .msg"
The very first line will likely be a partial line, but I think you will be OK with that.
(I am not 100% sure how /dev/loop
deals with a growing file.)
If you do not have losetup
you can force GNU Parallel to split into more parts and simply ignore the first parts. Here I split into 1 GB blocks. Then I ignore the first 62 of those:
parallel --pipepart --block 1G -a big.log 'true {= if(seq() <= 62){skip()} =}; do_stuff'
(true
is just a dummy call to be able to run some perl code).
Or you can do that more fine grained and use 100MB blocks:
parallel --pipepart --block 0.1G -a big.log 'true {= if(seq() <= 620){skip()} =}; do_stuff'
Or if you want to keep all your CPU cores busy. This will split into 3 block*number of job slots = 48 parts. Then it will skip the first 32 of those, thus only running 16 jobs that processes the last 1/3 of the file:
parallel --pipepart --block -3 -a big.log 'true {= if(seq() <= 32){skip()} =}; do_stuff'
stdbuf -oL tail -n +100 "2025-03-18.log" | parallel -k ...
, to start with line number 100? – Philippe Commented Mar 19 at 20:12