I have a file with a header row containing the file path as the name for each column and I'd like to extract and print out just the file name. There are over 100 columns.
E.g. Input header row:
AAF2Y7VM5-8/cnv/F04_reads.tsv AAF2Y7VM5-7/cnv/D04_reads.tsv AAF2Y7VM5-6/cnv/E04_reads.tsv
Goal output header row:
F04_reads.tsv D04_reads.tsv E04_reads.tsv
I have:
awk -F '[/|\t]' '{if (NR==1) {for(i=1;i<=NF;i++) printf $i"\t"}}' ZScores.txt
That outputs all three delimited values for every column, but I want just the third value, i.e. the file name, for each column in this row. Awk, bash, or sed solutions appreciated!
I have a file with a header row containing the file path as the name for each column and I'd like to extract and print out just the file name. There are over 100 columns.
E.g. Input header row:
AAF2Y7VM5-8/cnv/F04_reads.tsv AAF2Y7VM5-7/cnv/D04_reads.tsv AAF2Y7VM5-6/cnv/E04_reads.tsv
Goal output header row:
F04_reads.tsv D04_reads.tsv E04_reads.tsv
I have:
awk -F '[/|\t]' '{if (NR==1) {for(i=1;i<=NF;i++) printf $i"\t"}}' ZScores.txt
That outputs all three delimited values for every column, but I want just the third value, i.e. the file name, for each column in this row. Awk, bash, or sed solutions appreciated!
Share Improve this question edited Feb 7 at 3:33 RavinderSingh13 134k14 gold badges60 silver badges98 bronze badges asked Feb 6 at 19:36 user2293045user2293045 712 bronze badges 2- Just FYI, I suspect you might be interested in our sister site: Bioinformatics. – terdon Commented Feb 7 at 16:38
- 1 What about the remainder of the file beyond the header? Skip or print? – dawg Commented Feb 7 at 22:15
7 Answers
Reset to default 12Using any awk if your fields are tab-separated as they appear to be:
$ awk 'NR==1{gsub("[^\t]+/","")} 1' file
F04_reads.tsv D04_reads.tsv E04_reads.tsv
Otherwise, using any POSIX awk:
$ awk 'NR==1{gsub("[^[:space:]]+/","")} 1' file
F04_reads.tsv D04_reads.tsv E04_reads.tsv
Change [^[:space:]]
to [^ \t]
if you don't have a POSIX awk but - get a new awk.
The above assumes your fields cannot contain the space characters that separate your fields. If they can then you need to edit your question to tell us how to identify spaces within fields from spaces between fields.
Tweaking OP's current code to print every 3rd field:
$ awk -F '[/|\t]' '{if (NR==1) {for(i=3;i<=NF;i+=3) printf $i"\t"}}' ZScores.txt
F04_reads.tsv D04_reads.tsv E04_reads.tsv
NOTE: there's a trailing \t
on that output; also, the line does not end with a \n
Removing the trailing \t
, adding a trailing \n
, and skipping processing of rest of file:
$ awk -F '[/|\t]' 'NR==1 { for (i=3;i<=NF;i+=3) { printf "%s%s", sep, $i; sep="\t" }; print ""; exit }' ZScores.txt
F04_reads.tsv D04_reads.tsv E04_reads.tsv
Where:
sep
is blank for first pass through loop, then set to\t
for remaining passes through the loopprint ""
- terminate theprintf
line of output with a\n
(default output record separator)exit
- to keep from reading (and in this case ignoring) rest of file
NOTE: OP's code places a tab (\t
) between output values but the expected output shows a single space between values; if OP wishes to separate the output with single spaces then replace sep="\t"
with sep=" "
1st solution: With your shown samples please try following.
awk '
{
while(match($0,/(\/[^\/]*\/)([^.]*\.tsv)/,arr)){
val=(val?val OFS:"") arr[2]
$0=substr($0,RSTART+RLENGTH)
}
$0=val
}
1
' Input_file
2nd solution: if ok with perl onliner solution
perl -nle 'print join(" ", /([^\/]+_reads\.tsv)/g)' Input_file
a non-awk solution
$ sed 1q file | tr -s ' ' \n | cut -d/ -f3 | paste -sd' '
extract first row, transpose to column, cut the 3rd field, serialize back to a row
KISS:
$ echo $(head -n1 file | tr ' ' '\n' | cut -d/ -f3)
F04_reads.tsv D04_reads.tsv E04_reads.tsv
or
$ echo $(head -n1 file | tr ' ' '\n' | awk -F/ 'NF{printf "%s " ,$3}')
F04_reads.tsv D04_reads.tsv E04_reads.tsv
To just extract first line:
Bash (replace tabs):
( IFS=$'\t' read -ra cols <file; echo "${cols[@]##*/}" )
- load first line of file into array, columns delimited by (any number of) tabs
- print array after stripping longest prefix that ends with a slash from each element
Bash (retain tabs):
(
shopt -s extglob
IFS= read -r cols
echo "${cols//+([!$'\t'])\/}"
) <file
Sed (replace tabs):
sed -E 's|[^\t]+/||g; y|\t| |; q' file
Sed (retain tabs):
sed -E 's|[^\t]+/||g; q' file
If the intention is to also retain the whole file as tsv:
Bash: append cat
after echo
in the "retain tabs" version:
(
shopt -s extglob
IFS= read -r cols
echo "${cols//+([!$'\t'])\/}"
cat
) <file
Sed: prefix s
command with 1
and elide the q
from "retain tabs" version:
sed -E '1s|[^\t]+/||g' file
I would exploit GNU AWK for this task following way. Let file.txt
content be TAB-sheared file with following content:
AAF2Y7VM5-8/cnv/F04_reads.tsv AAF2Y7VM5-7/cnv/D04_reads.tsv AAF2Y7VM5-6/cnv/E04_reads.tsv
something something something
something something something
Then
awk 'BEGIN{FS="/";RS="[\t\n]";ORS="\t"}{print $3}RT=="\n"{exit}' file.txt
gives output
F04_reads.tsv D04_reads.tsv E04_reads.tsv
Explanation: I inform GNU AWK that record are separated by TAB or newline character and fields are separated by /
and print
value should be suffixed with \t
, rather than newline. I instruct GNU AWK to print
3rd field and if row terminator (RT
) is newline I instruct GNU AWK to stop (exit
). Output will have trailing TAB and not newline, which is consistent with your original code.
(tested in GNU Awk 5.3.1)