最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Trying to figure out why bash code with awk doesn't work correctly - Stack Overflow

programmeradmin0浏览0评论

I'm trying to figure out why code below doesn't work correctly, using the code below I want to find every "time" string from 1.html file.

<td align="center"><p>02:04:37.472</p></td>
<td align="center"><p>02:04:38.208</p></td>
<td align="center"><p>02:04:38.242</p></td>

I'm stuck and can't use the gnu version of awk. I would be grateful for help in repairing the code. Thank you

#!/bin/bash

_subtitles_getSubtitlesForUrl() {
    local awkCode2=

    read -r -d "" awkCode2 << 'SEARCHFORIDSAWKEOF'
BEGIN {
    fileSize = 0
    fps = 0
    time = 0
}
/td align="center"><p>0/ {
    isTimeMatched = match($0, /td align="center"><p>0[^0-9]*([0-9\.]+)/)
    if (isTimeMatched) {
        time = substr($0, RSTART + 22, RLENGTH - 16)
    }
}
/Rozmiar pliku/ {
    isFileSizeMatched = match($0, /Rozmiar pliku:[^0-9]*([0-9\.]+)/)
    if (isFileSizeMatched) {
        fileSize = substr($0, RSTART + 18, RLENGTH - 14)
    }
}
/Video FPS/ {
    isFpsMatched = match($0, /Video FPS:[^0-9]*([0-9\.]+)/)
    if (isFpsMatched) {
        fps = substr($0, RSTART + 15, RLENGTH - 15)
    }
}
/napiprojekt:/ {
    isHrefMatched = match($0, /href="(napiprojekt:[^"]*)"/)
    if (isHrefMatched) {
        printf("%7s | fps: %6s |%10s | %s\n",time ,fps ,fileSize , substr($0, RSTART + 6, RLENGTH - 7))
    }
}
SEARCHFORIDSAWKEOF

    cat 1.html | busybox awk "$awkCode2"

}

_subtitles_getSubtitlesForUrl

file 1.html

<tr title="<b>Autor:</b> Brak (dodał: macieju6)<br><b>Rozmiar pliku:</b> 23.48 GiB (25211536687 bajtów)<br><b>Ogólne bitrate pliku:</b> 27.0 Mbps<br><br><b>Video FPS:</b> 23.976<br><b>Video kodek:</b> MPEG-4<br><b>Video bitrate:</b> 25.0 Mbps<br><b>Video rozdzielczość:</b> 3840x1608<br><b>Video rozmiar:</b> 21.8 GiB (93%)<br><b>Video proporcje obrazu:</b> 2.40:1<br><br><b>Audio format:</b> E-AC-3 (Audio Coding 3)<br><b>Audio bitrate:</b> 960 Kbps<br><b>Audio liczba kanałów:</b> 6<br><b>Audio sampling rate:</b> 48.0 KHz<br><b>Audio resolution:</b> 16 bits<br><b>Audio rozmiar:</b> 855 MiB (4%)<br>" valign="middle">
    <td align="left"><p class="blue indent">                                 <a class="tableA" href="napiprojekt:1fb0f00ceb78e10cfe89af3dbdccdd19">Godzilla Minus One.mp4</a></p></td>
    <td align="center"><p>23.48 GiB</p> </td>
    <td align="center"><p>23.976 </p></td>
    <td align="center"><p>02:04:37.472</p></td>
    <td align="center"><p>Brak</p></td>
    <td align="center"><p>2025-01-17</p></td>
    <td align="center"><p>10</p></td>

</tr>


<tr title="<b>Autor:</b> Victor Delacroix<br><b>Rozmiar pliku:</b> 24.93 GiB (26772498718 bajtów)<br><b>Ogólne bitrate pliku:</b> 28.6 Mbps<br><br><b>Video FPS:</b> 23.976<br><b>Video kodek:</b> Matroska<br><b>Video bitrate:</b> <br><b>Video rozdzielczość:</b> 1920x1080<br><b>Video rozmiar:</b> <br><b>Video proporcje obrazu:</b> 16:9<br><br><b>Audio format:</b> TrueHD<br><b>Audio bitrate:</b> <br><b>Audio liczba kanałów:</b> 8<br><b>Audio sampling rate:</b> 48.0 KHz<br><b>Audio resolution:</b> <br><b>Audio rozmiar:</b> <br>" valign="middle">
    <td align="left"><p class="blue indent">                                 <a class="tableA" href="napiprojekt:ee2581c8ed39680c0851ac340f868d61">Godzilla Minus One.mkv</a></p></td>
    <td align="center"><p>24.93 GiB</p> </td>
    <td align="center"><p>23.976 </p></td>
    <td align="center"><p>02:04:38.208</p></td>
    <td align="center"><p>Victor Delacroix</p></td>
    <td align="center"><p>2024-08-18</p></td>
    <td align="center"><p>22</p></td>

</tr>


<tr title="<b>Autor:</b> Brak (dodał: kossa88)<br><b>Rozmiar pliku:</b> 951.9 MiB (998147235 bajtów)<br><b>Ogólne bitrate pliku:</b> 1 068 Kbps<br><br><b>Video FPS:</b> 23.976<br><b>Video kodek:</b> MPEG-4<br><b>Video bitrate:</b> 900 Kbps<br><b>Video rozdzielczość:</b> 720x300<br><b>Video rozmiar:</b> 806 MiB (85%)<br><b>Video proporcje obrazu:</b> 2.40:1<br><br><b>Audio format:</b> AAC (Advanced Audio Codec)<br><b>Audio bitrate:</b> 160 Kbps<br><b>Audio liczba kanałów:</b> 2<br><b>Audio sampling rate:</b> 48.0 KHz<br><b>Audio resolution:</b> <br><b>Audio rozmiar:</b> 143 MiB (15%)<br>" valign="middle">
    <td align="left"><p class="blue indent">                                 <a class="tableA" href="napiprojekt:ab8edd9c56debfa9b66be98fabff8968">Godzilla Minus One.mp4</a></p></td>
    <td align="center"><p>951.9 MiB</p> </td>
    <td align="center"><p>23.976 </p></td>
    <td align="center"><p>02:04:38.242</p></td>
    <td align="center"><p>Brak</p></td>
    <td align="center"><p>2024-07-10</p></td>
    <td align="center"><p>4</p></td>

</tr>

The result I get is:

$ ./napi_test.sh 
      0 | fps: 23.976 | 23.48 GiB | napiprojekt:1fb0f00ceb78e10cfe89af3dbdccdd19
2:04:37 | fps: 23.976 | 24.93 GiB | napiprojekt:ee2581c8ed39680c0851ac340f868d61
2:04:38 | fps: 23.976 | 951.9 MiB | napiprojekt:ab8edd9c56debfa9b66be98fabff8968

As you can see the results are shifted one line down

I want:

2:04:37 | fps: 23.976 | 23.48 GiB | napiprojekt:1fb0f00ceb78e10cfe89af3dbdccdd19
2:04:38 | fps: 23.976 | 24.93 GiB | napiprojekt:ee2581c8ed39680c0851ac340f868d61
2:04:38 | fps: 23.976 | 951.9 MiB | napiprojekt:ab8edd9c56debfa9b66be98fabff8968

I'm trying to figure out why code below doesn't work correctly, using the code below I want to find every "time" string from 1.html file.

<td align="center"><p>02:04:37.472</p></td>
<td align="center"><p>02:04:38.208</p></td>
<td align="center"><p>02:04:38.242</p></td>

I'm stuck and can't use the gnu version of awk. I would be grateful for help in repairing the code. Thank you

#!/bin/bash

_subtitles_getSubtitlesForUrl() {
    local awkCode2=

    read -r -d "" awkCode2 << 'SEARCHFORIDSAWKEOF'
BEGIN {
    fileSize = 0
    fps = 0
    time = 0
}
/td align="center"><p>0/ {
    isTimeMatched = match($0, /td align="center"><p>0[^0-9]*([0-9\.]+)/)
    if (isTimeMatched) {
        time = substr($0, RSTART + 22, RLENGTH - 16)
    }
}
/Rozmiar pliku/ {
    isFileSizeMatched = match($0, /Rozmiar pliku:[^0-9]*([0-9\.]+)/)
    if (isFileSizeMatched) {
        fileSize = substr($0, RSTART + 18, RLENGTH - 14)
    }
}
/Video FPS/ {
    isFpsMatched = match($0, /Video FPS:[^0-9]*([0-9\.]+)/)
    if (isFpsMatched) {
        fps = substr($0, RSTART + 15, RLENGTH - 15)
    }
}
/napiprojekt:/ {
    isHrefMatched = match($0, /href="(napiprojekt:[^"]*)"/)
    if (isHrefMatched) {
        printf("%7s | fps: %6s |%10s | %s\n",time ,fps ,fileSize , substr($0, RSTART + 6, RLENGTH - 7))
    }
}
SEARCHFORIDSAWKEOF

    cat 1.html | busybox awk "$awkCode2"

}

_subtitles_getSubtitlesForUrl

file 1.html

<tr title="<b>Autor:</b> Brak (dodał: macieju6)<br><b>Rozmiar pliku:</b> 23.48 GiB (25211536687 bajtów)<br><b>Ogólne bitrate pliku:</b> 27.0 Mbps<br><br><b>Video FPS:</b> 23.976<br><b>Video kodek:</b> MPEG-4<br><b>Video bitrate:</b> 25.0 Mbps<br><b>Video rozdzielczość:</b> 3840x1608<br><b>Video rozmiar:</b> 21.8 GiB (93%)<br><b>Video proporcje obrazu:</b> 2.40:1<br><br><b>Audio format:</b> E-AC-3 (Audio Coding 3)<br><b>Audio bitrate:</b> 960 Kbps<br><b>Audio liczba kanałów:</b> 6<br><b>Audio sampling rate:</b> 48.0 KHz<br><b>Audio resolution:</b> 16 bits<br><b>Audio rozmiar:</b> 855 MiB (4%)<br>" valign="middle">
    <td align="left"><p class="blue indent">                                 <a class="tableA" href="napiprojekt:1fb0f00ceb78e10cfe89af3dbdccdd19">Godzilla Minus One.mp4</a></p></td>
    <td align="center"><p>23.48 GiB</p> </td>
    <td align="center"><p>23.976 </p></td>
    <td align="center"><p>02:04:37.472</p></td>
    <td align="center"><p>Brak</p></td>
    <td align="center"><p>2025-01-17</p></td>
    <td align="center"><p>10</p></td>

</tr>


<tr title="<b>Autor:</b> Victor Delacroix<br><b>Rozmiar pliku:</b> 24.93 GiB (26772498718 bajtów)<br><b>Ogólne bitrate pliku:</b> 28.6 Mbps<br><br><b>Video FPS:</b> 23.976<br><b>Video kodek:</b> Matroska<br><b>Video bitrate:</b> <br><b>Video rozdzielczość:</b> 1920x1080<br><b>Video rozmiar:</b> <br><b>Video proporcje obrazu:</b> 16:9<br><br><b>Audio format:</b> TrueHD<br><b>Audio bitrate:</b> <br><b>Audio liczba kanałów:</b> 8<br><b>Audio sampling rate:</b> 48.0 KHz<br><b>Audio resolution:</b> <br><b>Audio rozmiar:</b> <br>" valign="middle">
    <td align="left"><p class="blue indent">                                 <a class="tableA" href="napiprojekt:ee2581c8ed39680c0851ac340f868d61">Godzilla Minus One.mkv</a></p></td>
    <td align="center"><p>24.93 GiB</p> </td>
    <td align="center"><p>23.976 </p></td>
    <td align="center"><p>02:04:38.208</p></td>
    <td align="center"><p>Victor Delacroix</p></td>
    <td align="center"><p>2024-08-18</p></td>
    <td align="center"><p>22</p></td>

</tr>


<tr title="<b>Autor:</b> Brak (dodał: kossa88)<br><b>Rozmiar pliku:</b> 951.9 MiB (998147235 bajtów)<br><b>Ogólne bitrate pliku:</b> 1 068 Kbps<br><br><b>Video FPS:</b> 23.976<br><b>Video kodek:</b> MPEG-4<br><b>Video bitrate:</b> 900 Kbps<br><b>Video rozdzielczość:</b> 720x300<br><b>Video rozmiar:</b> 806 MiB (85%)<br><b>Video proporcje obrazu:</b> 2.40:1<br><br><b>Audio format:</b> AAC (Advanced Audio Codec)<br><b>Audio bitrate:</b> 160 Kbps<br><b>Audio liczba kanałów:</b> 2<br><b>Audio sampling rate:</b> 48.0 KHz<br><b>Audio resolution:</b> <br><b>Audio rozmiar:</b> 143 MiB (15%)<br>" valign="middle">
    <td align="left"><p class="blue indent">                                 <a class="tableA" href="napiprojekt:ab8edd9c56debfa9b66be98fabff8968">Godzilla Minus One.mp4</a></p></td>
    <td align="center"><p>951.9 MiB</p> </td>
    <td align="center"><p>23.976 </p></td>
    <td align="center"><p>02:04:38.242</p></td>
    <td align="center"><p>Brak</p></td>
    <td align="center"><p>2024-07-10</p></td>
    <td align="center"><p>4</p></td>

</tr>

The result I get is:

$ ./napi_test.sh 
      0 | fps: 23.976 | 23.48 GiB | napiprojekt:1fb0f00ceb78e10cfe89af3dbdccdd19
2:04:37 | fps: 23.976 | 24.93 GiB | napiprojekt:ee2581c8ed39680c0851ac340f868d61
2:04:38 | fps: 23.976 | 951.9 MiB | napiprojekt:ab8edd9c56debfa9b66be98fabff8968

As you can see the results are shifted one line down

I want:

2:04:37 | fps: 23.976 | 23.48 GiB | napiprojekt:1fb0f00ceb78e10cfe89af3dbdccdd19
2:04:38 | fps: 23.976 | 24.93 GiB | napiprojekt:ee2581c8ed39680c0851ac340f868d61
2:04:38 | fps: 23.976 | 951.9 MiB | napiprojekt:ab8edd9c56debfa9b66be98fabff8968
Share Improve this question edited Feb 2 at 3:53 Barmar 783k56 gold badges546 silver badges660 bronze badges asked Feb 1 at 23:40 D SD S 1295 bronze badges 4
  • 5 I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Commented Feb 2 at 0:15
  • Why do you only match times beginning with 0? If that's the hour part of the time, can't it go up to 23? – Barmar Commented Feb 2 at 3:55
  • Without 0 It will be fine..., it's the length of the movies, no movie is longer than 9 hours :P – D S Commented Feb 2 at 6:40
  • 2 Useless use of cat (?): cat 1.html | busybox awk "$awkCode2" --> busybox awk "$awkCode2" 1.html – Itération 122442 Commented Feb 2 at 9:34
Add a comment  | 

2 Answers 2

Reset to default 5

You're generating output when you match on napiprojekt: but at that point you haven't yet matched on the corresponding td align=\"center\"><p>0 line; net result is your time value is being displayed during the follow-on block's printf operation (ie, time is 'shifted' by one block).

Consider capturing the napiprojekt: data in a variable and then generate your output when you match on td align=\"center\"><p>0:

######### /td align="center"><p>0/
#
# replace this:

time = substr($0, RSTART + 22, RLENGTH - 16)

# with this:

printf("%7s | fps: %6s |%10s | %s\n",substr($0, RSTART + 22, RLENGTH - 16) ,fps ,fileSize , napi_data)

######### /napiprojekt:/
#
# replace this:

printf("%7s | fps: %6s |%10s | %s\n",time ,fps ,fileSize , substr($0, RSTART + 6, RLENGTH - 7))

# with this:

napi_data = substr($0, RSTART + 6, RLENGTH - 7)

After making these two sets of changes the code generates:

2:04:37 | fps: 23.976 | 23.48 GiB | napiprojekt:1fb0f00ceb78e10cfe89af3dbdccdd19
2:04:38 | fps: 23.976 | 24.93 GiB | napiprojekt:ee2581c8ed39680c0851ac340f868d61
2:04:38 | fps: 23.976 | 951.9 MiB | napiprojekt:ab8edd9c56debfa9b66be98fabff8968

I'm stuck and can't use the gnu version of awk

I experimented with busybox awk, version BusyBox v1.30.1 (Ubuntu 1:1.30.1-7ubuntu3.1) multi-call binary. and found that it does support multiple-characters RS (row separator), thus allowing to inform that </tr> are separating rows, consider following simplified example, let 1.html content be

<tr title="<b>Autor:</b> Brak (dodał: macieju6)<br><b>Rozmiar pliku:</b> 23.48 GiB (25211536687 bajtów)<br><b>Ogólne bitrate pliku:</b> 27.0 Mbps<br><br><b>Video FPS:</b> 23.976<br><b>Video kodek:</b> MPEG-4<br><b>Video bitrate:</b> 25.0 Mbps<br><b>Video rozdzielczość:</b> 3840x1608<br><b>Video rozmiar:</b> 21.8 GiB (93%)<br><b>Video proporcje obrazu:</b> 2.40:1<br><br><b>Audio format:</b> E-AC-3 (Audio Coding 3)<br><b>Audio bitrate:</b> 960 Kbps<br><b>Audio liczba kanałów:</b> 6<br><b>Audio sampling rate:</b> 48.0 KHz<br><b>Audio resolution:</b> 16 bits<br><b>Audio rozmiar:</b> 855 MiB (4%)<br>" valign="middle">
    <td align="left"><p class="blue indent">                                 <a class="tableA" href="napiprojekt:1fb0f00ceb78e10cfe89af3dbdccdd19">Godzilla Minus One.mp4</a></p></td>
    <td align="center"><p>23.48 GiB</p> </td>
    <td align="center"><p>23.976 </p></td>
    <td align="center"><p>02:04:37.472</p></td>
    <td align="center"><p>Brak</p></td>
    <td align="center"><p>2025-01-17</p></td>
    <td align="center"><p>10</p></td>

</tr>


<tr title="<b>Autor:</b> Victor Delacroix<br><b>Rozmiar pliku:</b> 24.93 GiB (26772498718 bajtów)<br><b>Ogólne bitrate pliku:</b> 28.6 Mbps<br><br><b>Video FPS:</b> 23.976<br><b>Video kodek:</b> Matroska<br><b>Video bitrate:</b> <br><b>Video rozdzielczość:</b> 1920x1080<br><b>Video rozmiar:</b> <br><b>Video proporcje obrazu:</b> 16:9<br><br><b>Audio format:</b> TrueHD<br><b>Audio bitrate:</b> <br><b>Audio liczba kanałów:</b> 8<br><b>Audio sampling rate:</b> 48.0 KHz<br><b>Audio resolution:</b> <br><b>Audio rozmiar:</b> <br>" valign="middle">
    <td align="left"><p class="blue indent">                                 <a class="tableA" href="napiprojekt:ee2581c8ed39680c0851ac340f868d61">Godzilla Minus One.mkv</a></p></td>
    <td align="center"><p>24.93 GiB</p> </td>
    <td align="center"><p>23.976 </p></td>
    <td align="center"><p>02:04:38.208</p></td>
    <td align="center"><p>Victor Delacroix</p></td>
    <td align="center"><p>2024-08-18</p></td>
    <td align="center"><p>22</p></td>

</tr>


<tr title="<b>Autor:</b> Brak (dodał: kossa88)<br><b>Rozmiar pliku:</b> 951.9 MiB (998147235 bajtów)<br><b>Ogólne bitrate pliku:</b> 1 068 Kbps<br><br><b>Video FPS:</b> 23.976<br><b>Video kodek:</b> MPEG-4<br><b>Video bitrate:</b> 900 Kbps<br><b>Video rozdzielczość:</b> 720x300<br><b>Video rozmiar:</b> 806 MiB (85%)<br><b>Video proporcje obrazu:</b> 2.40:1<br><br><b>Audio format:</b> AAC (Advanced Audio Codec)<br><b>Audio bitrate:</b> 160 Kbps<br><b>Audio liczba kanałów:</b> 2<br><b>Audio sampling rate:</b> 48.0 KHz<br><b>Audio resolution:</b> <br><b>Audio rozmiar:</b> 143 MiB (15%)<br>" valign="middle">
    <td align="left"><p class="blue indent">                                 <a class="tableA" href="napiprojekt:ab8edd9c56debfa9b66be98fabff8968">Godzilla Minus One.mp4</a></p></td>
    <td align="center"><p>951.9 MiB</p> </td>
    <td align="center"><p>23.976 </p></td>
    <td align="center"><p>02:04:38.242</p></td>
    <td align="center"><p>Brak</p></td>
    <td align="center"><p>2024-07-10</p></td>
    <td align="center"><p>4</p></td>

</tr>

then

busybox awk 'BEGIN{RS="</tr>"}
match($0,/napiprojekt:[0-9a-f]*/){id=substr($0,RSTART,RLENGTH)}
match($0,/[0-9][0-9]:[0-9][0-9]:[0-9][0-9][.][0-9]+/){time=substr($0,RSTART,RLENGTH)}
{print id, time}' 1.html

gives output

napiprojekt:1fb0f00ceb78e10cfe89af3dbdccdd19 02:04:37.472
napiprojekt:ee2581c8ed39680c0851ac340f868d61 02:04:38.208
napiprojekt:ab8edd9c56debfa9b66be98fabff8968 02:04:38.242
napiprojekt:ab8edd9c56debfa9b66be98fabff8968 02:04:38.242

Explanation: after informing busybox awk that </tr> is row separator, I can treat each <tr>... as single line, therefore I can search for data in any order.

EDIT: after why is the last one duplicated?

Reason is that there is empty row behind last </tr>, which cause printing of last known values, to avoid this /<tr/ pattern can be added for printing action, that is doing

busybox awk 'BEGIN{RS="</tr>"}
match($0,/napiprojekt:[0-9a-f]*/){id=substr($0,RSTART,RLENGTH)}
match($0,/[0-9][0-9]:[0-9][0-9]:[0-9][0-9][.][0-9]+/){time=substr($0,RSTART,RLENGTH)}
/<tr/{print id, time}' 1.htm
发布评论

评论列表(0)

  1. 暂无评论