最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Exclude lines with duplicate values in awk - Stack Overflow

programmeradmin2浏览0评论

I have a tsv file like this:

chr1    28932   29543   chr1    29159   29422   RNAPOLII_T1_pos_1_q05_peak_1    114 .   5.55679 14.5827 11.4511 119
chr1    199425  200055  .   -1  -1  .   .   .   .   .   .   .
chr1    206917  207235  .   -1  -1  .   .   .   .   .   .   .
chr1    629342  630035  chr1    629392  629981  RNAPOLII_T1_pos_1_q05_peak_2    89  .   1.53473 11.9814 8.95881 434
chr1    630824  631475  chr1    630904  631286  RNAPOLII_T1_pos_1_q05_peak_3    110 .   1.66136 14.1185 11.0065 34
chr1    631947  632282  .   -1  -1  .   .   .   .   .   .   .
chr1    632546  632864  chr1    632596  632814  RNAPOLII_T1_pos_1_q05_peak_4    53  .   1.45791 8.17161 5.34813 45
chr1    633792  634430  chr1    634016  634206  RNAPOLII_T1_pos_1_q05_peak_5    42  .   1.40136 6.99814 4.24691 25
chr1    634453  634840  chr1    634503  634790  RNAPOLII_T1_pos_1_q05_peak_6    68  .   1.68267 9.80195 6.88384 32
chr1    778082  779111  chr1    778407  778997  RNAPOLII_T1_pos_1_q05_peak_7    290 .   8.3336  32.7328 29.0707 207
chr1    827049  827851  chr1    827150  827773  RNAPOLII_T1_pos_1_q05_peak_8    43  .   3.42454 7.13586 4.37707 251
chr1    941573  941926  chr1    941623  941876  RNAPOLII_T1_pos_1_q05_peak_9    48  .   3.83227 7.61827 4.82768 136
chr1    989375  989734  .   -1  -1  .   .   .   .   .   .   .
chr1    990673  991342  .   -1  -1  .   .   .   .   .   .   .
chr1    991736  992432  chr1    991990  992382  RNAPOLII_T1_pos_1_q05_peak_10   58  .   4.33261 8.71042 5.8516  205
chr1    992407  994252  chr1    992698  993311  RNAPOLII_T1_pos_1_q05_peak_11   62  .   3.89152 9.08737 6.20787 479
chr1    992407  994252  chr1    993534  994152  RNAPOLII_T1_pos_1_q05_peak_12   60  .   3.39559 8.88015 6.01409 170
chr1    994237  998788  chr1    994346  998738  RNAPOLII_T1_pos_1_q05_peak_13   633 .   13.9139 67.4929 63.32   2194
chr1    998775  1002233 chr1    998825  1002089 RNAPOLII_T1_pos_1_q05_peak_14   850 .   19.1217 89.4139 85.0549 1234
chr1    1004118 1004538 .   -1  -1  .   .   .   .   .   .   .
chr1    1005008 1006499 chr1    1005058 1005522 RNAPOLII_T1_pos_1_q05_peak_15   55  .   4.46653 8.38165 5.54531 345
chr1    1019994 1020390 .   -1  -1  .   .   .   .   .   .   .
chr1    1020344 1020662 .   -1  -1  .   .   .   .   .   .   .
chr1    1078905 1080785 chr1    1079111 1079300 RNAPOLII_T1_pos_1_q05_peak_16   48  .   3.07279 7.6091  4.82217 93
chr1    1078905 1080785 chr1    1079358 1079899 RNAPOLII_T1_pos_1_q05_peak_17   90  .   4.56426 12.0559 9.03203 158
chr1    1157419 1158008 chr1    1157469 1157958 RNAPOLII_T1_pos_1_q05_peak_18   113 .   5.84903 14.4751 11.3505 128
chr1    1216203 1216549 .   -1  -1  .   .   .   .   .   .   .
chr1    1216526 1216931 .   -1  -1  .   .   .   .   .   .   .
chr1    1231559 1232418 chr1    1231766 1232368 RNAPOLII_T1_pos_1_q05_peak_19   175 .   7.74351 20.8689 17.5159 180
chr1    1248702 1249624 .   -1  -1  .   .   .   .   .   .   .

I want to use awk to select the data in column number 5, but only for unique values in column 2. For example, at lines 16/17, the value 992407 is repeated. I only want to keep the first value in col 5 for these coordinates, 992698. Any duplicates should be immediately one after the other, so I wrote this awk line to filter the file:

awk 'BEGIN {$prev=-1} { if($2 != $prev){ print $5; $prev=$2 }}'

which to me should exclude any lines where a value in column 2 is identical to the value in column 2 found at the line just before. However, no lines are filtered when I apply this. What am I missing?

I have a tsv file like this:

chr1    28932   29543   chr1    29159   29422   RNAPOLII_T1_pos_1_q05_peak_1    114 .   5.55679 14.5827 11.4511 119
chr1    199425  200055  .   -1  -1  .   .   .   .   .   .   .
chr1    206917  207235  .   -1  -1  .   .   .   .   .   .   .
chr1    629342  630035  chr1    629392  629981  RNAPOLII_T1_pos_1_q05_peak_2    89  .   1.53473 11.9814 8.95881 434
chr1    630824  631475  chr1    630904  631286  RNAPOLII_T1_pos_1_q05_peak_3    110 .   1.66136 14.1185 11.0065 34
chr1    631947  632282  .   -1  -1  .   .   .   .   .   .   .
chr1    632546  632864  chr1    632596  632814  RNAPOLII_T1_pos_1_q05_peak_4    53  .   1.45791 8.17161 5.34813 45
chr1    633792  634430  chr1    634016  634206  RNAPOLII_T1_pos_1_q05_peak_5    42  .   1.40136 6.99814 4.24691 25
chr1    634453  634840  chr1    634503  634790  RNAPOLII_T1_pos_1_q05_peak_6    68  .   1.68267 9.80195 6.88384 32
chr1    778082  779111  chr1    778407  778997  RNAPOLII_T1_pos_1_q05_peak_7    290 .   8.3336  32.7328 29.0707 207
chr1    827049  827851  chr1    827150  827773  RNAPOLII_T1_pos_1_q05_peak_8    43  .   3.42454 7.13586 4.37707 251
chr1    941573  941926  chr1    941623  941876  RNAPOLII_T1_pos_1_q05_peak_9    48  .   3.83227 7.61827 4.82768 136
chr1    989375  989734  .   -1  -1  .   .   .   .   .   .   .
chr1    990673  991342  .   -1  -1  .   .   .   .   .   .   .
chr1    991736  992432  chr1    991990  992382  RNAPOLII_T1_pos_1_q05_peak_10   58  .   4.33261 8.71042 5.8516  205
chr1    992407  994252  chr1    992698  993311  RNAPOLII_T1_pos_1_q05_peak_11   62  .   3.89152 9.08737 6.20787 479
chr1    992407  994252  chr1    993534  994152  RNAPOLII_T1_pos_1_q05_peak_12   60  .   3.39559 8.88015 6.01409 170
chr1    994237  998788  chr1    994346  998738  RNAPOLII_T1_pos_1_q05_peak_13   633 .   13.9139 67.4929 63.32   2194
chr1    998775  1002233 chr1    998825  1002089 RNAPOLII_T1_pos_1_q05_peak_14   850 .   19.1217 89.4139 85.0549 1234
chr1    1004118 1004538 .   -1  -1  .   .   .   .   .   .   .
chr1    1005008 1006499 chr1    1005058 1005522 RNAPOLII_T1_pos_1_q05_peak_15   55  .   4.46653 8.38165 5.54531 345
chr1    1019994 1020390 .   -1  -1  .   .   .   .   .   .   .
chr1    1020344 1020662 .   -1  -1  .   .   .   .   .   .   .
chr1    1078905 1080785 chr1    1079111 1079300 RNAPOLII_T1_pos_1_q05_peak_16   48  .   3.07279 7.6091  4.82217 93
chr1    1078905 1080785 chr1    1079358 1079899 RNAPOLII_T1_pos_1_q05_peak_17   90  .   4.56426 12.0559 9.03203 158
chr1    1157419 1158008 chr1    1157469 1157958 RNAPOLII_T1_pos_1_q05_peak_18   113 .   5.84903 14.4751 11.3505 128
chr1    1216203 1216549 .   -1  -1  .   .   .   .   .   .   .
chr1    1216526 1216931 .   -1  -1  .   .   .   .   .   .   .
chr1    1231559 1232418 chr1    1231766 1232368 RNAPOLII_T1_pos_1_q05_peak_19   175 .   7.74351 20.8689 17.5159 180
chr1    1248702 1249624 .   -1  -1  .   .   .   .   .   .   .

I want to use awk to select the data in column number 5, but only for unique values in column 2. For example, at lines 16/17, the value 992407 is repeated. I only want to keep the first value in col 5 for these coordinates, 992698. Any duplicates should be immediately one after the other, so I wrote this awk line to filter the file:

awk 'BEGIN {$prev=-1} { if($2 != $prev){ print $5; $prev=$2 }}'

which to me should exclude any lines where a value in column 2 is identical to the value in column 2 found at the line just before. However, no lines are filtered when I apply this. What am I missing?

Share Improve this question edited Mar 20 at 9:47 Whitehot asked Mar 14 at 16:35 WhitehotWhitehot 5075 silver badges22 bronze badges 5
  • 4 awk variables are not shell variables. Do not use $var to expand them, use just var. In your case awk '!seen[$2]++ {print $5}' should do what you want. – Renaud Pacalet Commented Mar 14 at 17:00
  • replace $prev with prev and see if your awk script now generates the desired result; if you're still not seeing the desired result then update the question with your latest coding attempt AND the expected output (for the provided sample input) – markp-fuso Commented Mar 14 at 17:46
  • Variables don't need to be initialized, variables don't have a dollar sign, the conditional can be put outside the block. Maybe this works for you awk 'val == $2{next} {val = $2}1' file. Also this is not a csv file, maybe a tsv from genome browser? – Andre Wildberg Commented Mar 14 at 17:57
  • Can you your expected output for the given input data. – anubhava Commented Mar 14 at 18:37
  • The sample input you show is not CSV, please edit your question to show the actual CSV file or tell us what separates the values in the file you posted (blanks, tabs, fixed width or something else). – Ed Morton Commented Mar 16 at 15:10
Add a comment  | 

1 Answer 1

Reset to default 1

If you are using $i in GNU AWK it means i-th field (of current row), if i was not set, it is assumed to be 0, therefore

awk 'BEGIN {$prev=-1} { if($2 != $prev){ print $5; $prev=$2 }}'

is same as doing

awk 'BEGIN {$0=-1} { if($2 != $0){ print $5; $0=$2 }}'

condition inside if will never hold for your data (or any other multi-column data) as $0 denotes whole line GNU AWK.

You should assign to and compare against prev variable and you might use condition as pattern, without if, that is

awk 'BEGIN {prev=-1}($2 != prev){ print $5; prev=$2 }'
发布评论

评论列表(0)

  1. 暂无评论