最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

xpath - Selecting multple values from an XML document in linux shell - Stack Overflow

programmeradmin5浏览0评论

XML: .xml

Example

<tv>
  <programme channel="VirginRadio.uk" start="20250319220000 +0000" stop="20250320010000 +0000">
    <title lang="en">Olivia Jones</title>
    <desc lang="en">It may be late in the day but Olivia Jones still has a stack full of the best songs around to soundtrack your night.</desc>
    <icon src=";/>
  </programme>
</tv>

I want to print the channel, start time, and title of matching programmes.

I came up with this:

tv / programme / title [ contains(text(), "Olivia") ] / parent::*/concat(@channel, "_", @start, "_", title)

which works at .html

However, it doesn't work with xqor with xmlstarlet.

Can this be done in the shell? What are the other options?

XML: https://raw.githubusercontent/dp247/Freeview-EPG/master/epg.xml

Example

<tv>
  <programme channel="VirginRadio.uk" start="20250319220000 +0000" stop="20250320010000 +0000">
    <title lang="en">Olivia Jones</title>
    <desc lang="en">It may be late in the day but Olivia Jones still has a stack full of the best songs around to soundtrack your night.</desc>
    <icon src="https://images.metadata.sky/pd-image/e776e89e-47b9-4529-8eec-09589a7bb782/cover"/>
  </programme>
</tv>

I want to print the channel, start time, and title of matching programmes.

I came up with this:

tv / programme / title [ contains(text(), "Olivia") ] / parent::*/concat(@channel, "_", @start, "_", title)

which works at https://www.freeformatter/xpath-tester.html

However, it doesn't work with xqor with xmlstarlet.

Can this be done in the shell? What are the other options?

Share Improve this question edited Mar 17 at 19:04 Christoph Rackwitz 15.9k5 gold badges39 silver badges51 bronze badges asked Mar 17 at 18:01 Richard BarracloughRichard Barraclough 3,0305 gold badges48 silver badges78 bronze badges 2
  • Aside: with XPath you can use brackets inside brackets so the parent::* is superfluous. Also, the text() isn't really needed: tv/programme[title[contains(.,"Olivia")]]/concat(@channel,"_",@start,"_",title) – Fravadona Commented Mar 18 at 11:14
  • @ClosingVotes "Using a tool" in the shell is somewhat equivalent to calling a library function in other languages. Would that be right to close a question about for eg. a numpy function just because it isn't part of Python's standard library? – Fravadona Commented Mar 18 at 14:54
Add a comment  | 

4 Answers 4

Reset to default 1
xmlstarlet select --template \
  --match "//tv/programme[title[contains(text(),'Olivia')]]" \
  --value-of "concat(@channel,'_',@start,'_',title)" -n file.xml

Output:

VirginRadio.uk_20250319220000 +0000_Olivia Jones

With xmllint can be done using --shell feature but needs some extra text processing

bpath='tv/programme[title[ contains(text(), "Olivia")]]'
printf "%s\n" "cat $bpath/@channel | $bpath/@start | $bpath/title/text()" "bye" |\
xmllint --shell tmp2.xml | tr -d '"' |\
gawk 'BEGIN{RS=" channel="; FS="\n -+\n( start=)?|\n[/] > "; OFS="|"}{ if(NR > 1) print $1, $2, $3}'

Result

VirginRadio.uk|20250319220000 +0000|Olivia Jones
fm203.uk|20250219220000 +0000|Olivia Newton

Raw ouput from xmllint

printf "%s\n" "cat $bpath/@channel | $bpath/@start | $bpath/title/text()" "bye" | xmllint --shell tmp2.xml

/ > cat tv/programme[title[ contains(text(), "Olivia")]]/@channel | tv/programme[title[ contains(text(), "Olivia")]]/@start | tv/programme[title[ contains(text(), "Olivia")]]/title/text()
 -------
 channel="VirginRadio.uk"
 -------
 start="20250319220000 +0000"
 -------
Olivia Jones
 -------
 channel="fm203.uk"
 -------
 start="20250219220000 +0000"
 -------
Olivia Newton
/ > bye

xq (from kislyuk/yq) uses jq under the hood, so descend into .tv.programme[] (use --xml-force-list programme to make sure .programme is an array, even if there's just one child item), then select by condition .title."#text" | contains($q) (with $q being your search query defined with --arg q "Olivia"), and compose the output by concatenating ."@channel", ."@start", and .title."#text", with an underscore joined in between. The -r flag decodes the result into a raw string.

xq --arg q "Olivia" -r --xml-force-list programme '
  .tv.programme[] | select(.title."#text" | contains($q))
  | [."@channel", ."@start", .title."#text"] | join("_")
'

The same can be achieved with yq (from mikefarah/yq) by applying some little tweaks to the jq approach from above: Import values (the query) through the environment and retrieve them using strenv, follow the encoding of text nodes as +content (instead of #text), and the additional + preceding also the @ in attributes names. Then, manually induce the iterability of .programme by addressing both alternatives (using the alternative operator //), and, if needed (depending on file extensions used), implicitly define the input and output formatting (with the -px and -roy flags).

q="Olivia" yq -px -roy '
  .tv.programme | (select(type == "!!seq") | .[]) // .
  | select(.title.+content | contains(strenv(q)))
  | [.+@channel, .+@start, .title.+content] | join("_")
'

Output from both using the input sample:

VirginRadio.uk_20250319220000 +0000_Olivia Jones

I see the answer is still found. As an addition with python:

import requests
import xml.etree.ElementTree as ET
import io
import pandas as pd

url = "https://raw.githubusercontent/dp247/Freeview-EPG/master/epg.xml"
response = requests.get(url, stream=True)

xml_stream = io.BytesIO(response.content)
context = ET.iterparse(xml_stream, events=("start", "end"))
data = []  # List to store extracted data

for event, elem in context:
    if event == "start" and elem.tag == "programme":
        channel = elem.attrib.get("channel", "")
        start_time = elem.attrib.get("start", "")
    elif event == "end" and elem.tag == "title":
        if "Olivia" in elem.text:
            data.append({"Title": elem.text, "Channel": channel, "Time": start_time})
    elif event == "end" and elem.tag == "programme":
        elem.clear()  # Free memory

df = pd.DataFrame(data)
print(df.to_string())
发布评论

评论列表(0)

  1. 暂无评论