最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

linux - GNU parallel, is it possible to force parallel read a file starting from specific line? - Stack Overflow

programmeradmin6浏览0评论

I have a command that reads a giant(92GB, 100M+ lines) file in parallel. File grows in time, perhaps, it is a log a file.

Command itself:

parallel -a "2025-03-18.log" -k --pipepart "grep processor | grep -e TxnError | jq .msg"

I have 16 CPU and 32 RAM, but it still takes 20-25min to parse that file, even parallelized. So I decided to start parallelized read from specific line. How can I do that, any thoughts?

I have a command that reads a giant(92GB, 100M+ lines) file in parallel. File grows in time, perhaps, it is a log a file.

Command itself:

parallel -a "2025-03-18.log" -k --pipepart "grep processor | grep -e TxnError | jq .msg"

I have 16 CPU and 32 RAM, but it still takes 20-25min to parse that file, even parallelized. So I decided to start parallelized read from specific line. How can I do that, any thoughts?

Share Improve this question asked Mar 19 at 18:18 AziiAzii 991 silver badge7 bronze badges 2
  • Have you tried stdbuf -oL tail -n +100 "2025-03-18.log" | parallel -k ..., to start with line number 100? – Philippe Commented Mar 19 at 20:12
  • madness to have logfiles of this size (imho), look into using logrotate with the copytruncate option to get this file size under control. , meanwhile - Ole Tange's feedback contains lots of reasonable options wrt parallel. – ticktalk Commented Mar 23 at 9:36
Add a comment  | 

1 Answer 1

Reset to default 1

As I am sure you know --pipepart is way more efficient than --pipe.

However, you use the default block size, and in your case 1MB is way to small: You start a process for each 1MB.

I like using --block -1. If you have 16 jobslots it will split the input file into 16 blocks of equal size. This will be done on the fly. This will in total only start 16 jobs that each read different parts of your big file. (And yes: It will only split at \n - GNU Parallel is not braindead).

parallel --block -1 -a "2025-03-18.log" -k --pipepart "grep processor | grep -e TxnError | jq .msg"

If you insist, you can start reading from a specific line:

# This is slow - not recommended
cat file | tail -n +100000 | parallel --pipe ...

but then you cannot use --pipepart. This is because there is no easy way to count the number of lines without reading the file.

But I have the feeling you do not really need to know the line number. You just want to skip the first 62GB of the file, and see the rest of the file as a 30GB file.

And that can be done:

# Requires root access
losetup -o 66572539904 /dev/loop0 2025-03-18.log

Now you use /dev/loop0 as your file:

parallel -a /dev/loop0 -k --block -1 --pipepart "grep processor | grep -e TxnError | jq .msg"

The very first line will likely be a partial line, but I think you will be OK with that.

(I am not 100% sure how /dev/loop deals with a growing file.)

If you do not have losetup you can force GNU Parallel to split into more parts and simply ignore the first parts. Here I split into 1 GB blocks. Then I ignore the first 62 of those:

parallel --pipepart --block 1G -a big.log 'true {= if(seq() <= 62){skip()} =}; do_stuff'

(true is just a dummy call to be able to run some perl code).

Or you can do that more fine grained and use 100MB blocks:

parallel --pipepart --block 0.1G -a big.log 'true {= if(seq() <= 620){skip()} =}; do_stuff'

Or if you want to keep all your CPU cores busy. This will split into 3 block*number of job slots = 48 parts. Then it will skip the first 32 of those, thus only running 16 jobs that processes the last 1/3 of the file:

parallel --pipepart --block -3 -a big.log 'true {= if(seq() <= 32){skip()} =}; do_stuff'

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论