XML: .xml
Example
<tv>
<programme channel="VirginRadio.uk" start="20250319220000 +0000" stop="20250320010000 +0000">
<title lang="en">Olivia Jones</title>
<desc lang="en">It may be late in the day but Olivia Jones still has a stack full of the best songs around to soundtrack your night.</desc>
<icon src=";/>
</programme>
</tv>
I want to print the channel, start time, and title of matching programmes.
I came up with this:
tv / programme / title [ contains(text(), "Olivia") ] / parent::*/concat(@channel, "_", @start, "_", title)
which works at .html
However, it doesn't work with xq
or with xmlstarlet
.
Can this be done in the shell? What are the other options?
XML: https://raw.githubusercontent/dp247/Freeview-EPG/master/epg.xml
Example
<tv>
<programme channel="VirginRadio.uk" start="20250319220000 +0000" stop="20250320010000 +0000">
<title lang="en">Olivia Jones</title>
<desc lang="en">It may be late in the day but Olivia Jones still has a stack full of the best songs around to soundtrack your night.</desc>
<icon src="https://images.metadata.sky/pd-image/e776e89e-47b9-4529-8eec-09589a7bb782/cover"/>
</programme>
</tv>
I want to print the channel, start time, and title of matching programmes.
I came up with this:
tv / programme / title [ contains(text(), "Olivia") ] / parent::*/concat(@channel, "_", @start, "_", title)
which works at https://www.freeformatter/xpath-tester.html
However, it doesn't work with xq
or with xmlstarlet
.
Can this be done in the shell? What are the other options?
Share Improve this question edited Mar 17 at 19:04 Christoph Rackwitz 15.9k5 gold badges39 silver badges51 bronze badges asked Mar 17 at 18:01 Richard BarracloughRichard Barraclough 3,0305 gold badges48 silver badges78 bronze badges 2 |4 Answers
Reset to default 1xmlstarlet select --template \
--match "//tv/programme[title[contains(text(),'Olivia')]]" \
--value-of "concat(@channel,'_',@start,'_',title)" -n file.xml
Output:
VirginRadio.uk_20250319220000 +0000_Olivia Jones
With xmllint
can be done using --shell
feature but needs some extra text processing
bpath='tv/programme[title[ contains(text(), "Olivia")]]'
printf "%s\n" "cat $bpath/@channel | $bpath/@start | $bpath/title/text()" "bye" |\
xmllint --shell tmp2.xml | tr -d '"' |\
gawk 'BEGIN{RS=" channel="; FS="\n -+\n( start=)?|\n[/] > "; OFS="|"}{ if(NR > 1) print $1, $2, $3}'
Result
VirginRadio.uk|20250319220000 +0000|Olivia Jones
fm203.uk|20250219220000 +0000|Olivia Newton
Raw ouput from xmllint
printf "%s\n" "cat $bpath/@channel | $bpath/@start | $bpath/title/text()" "bye" | xmllint --shell tmp2.xml
/ > cat tv/programme[title[ contains(text(), "Olivia")]]/@channel | tv/programme[title[ contains(text(), "Olivia")]]/@start | tv/programme[title[ contains(text(), "Olivia")]]/title/text()
-------
channel="VirginRadio.uk"
-------
start="20250319220000 +0000"
-------
Olivia Jones
-------
channel="fm203.uk"
-------
start="20250219220000 +0000"
-------
Olivia Newton
/ > bye
xq
(from kislyuk/yq) uses jq
under the hood, so descend into .tv.programme[]
(use --xml-force-list programme
to make sure .programme
is an array, even if there's just one child item), then select
by condition .title."#text" | contains($q)
(with $q
being your search query defined with --arg q "Olivia"
), and compose the output by concatenating ."@channel"
, ."@start"
, and .title."#text"
, with an underscore join
ed in between. The -r
flag decodes the result into a raw string.
xq --arg q "Olivia" -r --xml-force-list programme '
.tv.programme[] | select(.title."#text" | contains($q))
| [."@channel", ."@start", .title."#text"] | join("_")
'
The same can be achieved with yq
(from mikefarah/yq) by applying some little tweaks to the jq
approach from above: Import values (the query) through the environment and retrieve them using strenv
, follow the encoding of text nodes as +content
(instead of #text
), and the additional +
preceding also the @
in attributes names. Then, manually induce the iterability of .programme
by addressing both alternatives (using the alternative operator //
), and, if needed (depending on file extensions used), implicitly define the input and output formatting (with the -px
and -roy
flags).
q="Olivia" yq -px -roy '
.tv.programme | (select(type == "!!seq") | .[]) // .
| select(.title.+content | contains(strenv(q)))
| [.+@channel, .+@start, .title.+content] | join("_")
'
Output from both using the input sample:
VirginRadio.uk_20250319220000 +0000_Olivia Jones
I see the answer is still found. As an addition with python:
import requests
import xml.etree.ElementTree as ET
import io
import pandas as pd
url = "https://raw.githubusercontent/dp247/Freeview-EPG/master/epg.xml"
response = requests.get(url, stream=True)
xml_stream = io.BytesIO(response.content)
context = ET.iterparse(xml_stream, events=("start", "end"))
data = [] # List to store extracted data
for event, elem in context:
if event == "start" and elem.tag == "programme":
channel = elem.attrib.get("channel", "")
start_time = elem.attrib.get("start", "")
elif event == "end" and elem.tag == "title":
if "Olivia" in elem.text:
data.append({"Title": elem.text, "Channel": channel, "Time": start_time})
elif event == "end" and elem.tag == "programme":
elem.clear() # Free memory
df = pd.DataFrame(data)
print(df.to_string())
parent::*
is superfluous. Also, thetext()
isn't really needed:tv/programme[title[contains(.,"Olivia")]]/concat(@channel,"_",@start,"_",title)
– Fravadona Commented Mar 18 at 11:14numpy
function just because it isn't part of Python's standard library? – Fravadona Commented Mar 18 at 14:54