I have a Bash script that is supposed to convert an XML file into CSV format. However, the script is not fully universal, and when I check the content of the output file, it appears to be empty.
Expected Output
The script should generate two CSV files (faculty.csv
and students.csv
) with the following headers and values:
faculty.csv:
name,building,room,capacity
"Faculty of Art","B59","R1000","240"
"Faculty of Art","B59","R1001","360"
"Faculty of Art","B59","R1002","210"
...
students.csv:
student_name,student_id,student_email,programme,year,address,contact,module_id,module_name,module_leader,lecturer1,lecturer2,name,building,room,exam_mark,coursework1,coursework2,coursework3
"Michael Jennings","273644","[email protected]","Fine Art","4","08832 Kim Groves Suite 335 North Mathewview, RI 43333","3976680297","FINE4401","Theoretical Sketching","Maria Tyler","","","Faculty of Art","B81","R1002","","33","7","26"
...
The Script
#!/bin/bash
# Check for input file
if [[ -z "$1" ]]; then
echo "Usage: $0 <input.xml> [output.csv]"
exit 1
fi
input_xml="$1"
output_csv="${2:-${input_xml%.xml}.csv}"
# Extract unique tags (excluding XML declaration and root tags)
tags=$(grep -oP '<\K[^/>]+(?=>[^<]+</[^>]+>)' "$input_xml" | grep -vP '^(faculties|students|\?xml)' | sort | uniq | tr '\n' ',' | sed 's/,$//')
# Create CSV header
echo "$tags" > "$output_csv"
# Process data records
awk -v tags="$tags" '
BEGIN {
FS="[<>]"
OFS=","
split(tags, header, /,/)
gsub(/ /, "", header[1]) # Remove any accidental spaces
}
/<faculty>/,/<\/faculty>/ || /<student>/,/<\/student>/ {
if ($2 in header) values[$2] = $3
if ($0 ~ /<\/faculty>|<\/student>/) {
for (i=1; i<=length(header); i++) {
printf "%s%s", (header[i] in values ? values[header[i]] : ""), (i<length(header) ? OFS : ORS)
}
delete values
}
}
' "$input_xml" >> "$output_csv"
echo "CSV generated: $output_csv"
XML File Structure
faculty.xml:
<?xml version="1.0" ?>
<faculties>
<faculty>
<faculty>Faculty of Art</faculty>
<building>B59</building>
<room>R1000</room>
<capacity>240</capacity>
</faculty>
...
</faculties>
students.xml:
<?xml version="1.0" ?>
<students>
<student>
<student_name>Michael Jennings</student_name>
<student_id>273644</student_id>
<student_email>[email protected]</student_email>
<programme>Fine Art</programme>
<year>4</year>
<address>08832 Kim Groves Suite 335 North Mathewview, RI 43333</address>
<contact>3976680297</contact>
<module_id>FINE4401</module_id>
<module_name>Theoretical Sketching</module_name>
<module_leader>Maria Tyler</module_leader>
<lecturer1/>
<lecturer2/>
<faculty>Faculty of Art</faculty>
<building>B81</building>
<room>R1002</room>
<exam_mark/>
<coursework1>33</coursework1>
<coursework2>7</coursework2>
<coursework3>26</coursework3>
</student>
...
</students>
Problem
- The script is not fully dynamic. It should handle different XML structures without hardcoding tags.
- The output CSV files are empty or not in the expected format.
Additional Information
I am trying to avoid using external tools like xmllint
. The script should ideally be universal, but a semi-generic solution would also be acceptable.
I have a Bash script that is supposed to convert an XML file into CSV format. However, the script is not fully universal, and when I check the content of the output file, it appears to be empty.
Expected Output
The script should generate two CSV files (faculty.csv
and students.csv
) with the following headers and values:
faculty.csv:
name,building,room,capacity
"Faculty of Art","B59","R1000","240"
"Faculty of Art","B59","R1001","360"
"Faculty of Art","B59","R1002","210"
...
students.csv:
student_name,student_id,student_email,programme,year,address,contact,module_id,module_name,module_leader,lecturer1,lecturer2,name,building,room,exam_mark,coursework1,coursework2,coursework3
"Michael Jennings","273644","[email protected]","Fine Art","4","08832 Kim Groves Suite 335 North Mathewview, RI 43333","3976680297","FINE4401","Theoretical Sketching","Maria Tyler","","","Faculty of Art","B81","R1002","","33","7","26"
...
The Script
#!/bin/bash
# Check for input file
if [[ -z "$1" ]]; then
echo "Usage: $0 <input.xml> [output.csv]"
exit 1
fi
input_xml="$1"
output_csv="${2:-${input_xml%.xml}.csv}"
# Extract unique tags (excluding XML declaration and root tags)
tags=$(grep -oP '<\K[^/>]+(?=>[^<]+</[^>]+>)' "$input_xml" | grep -vP '^(faculties|students|\?xml)' | sort | uniq | tr '\n' ',' | sed 's/,$//')
# Create CSV header
echo "$tags" > "$output_csv"
# Process data records
awk -v tags="$tags" '
BEGIN {
FS="[<>]"
OFS=","
split(tags, header, /,/)
gsub(/ /, "", header[1]) # Remove any accidental spaces
}
/<faculty>/,/<\/faculty>/ || /<student>/,/<\/student>/ {
if ($2 in header) values[$2] = $3
if ($0 ~ /<\/faculty>|<\/student>/) {
for (i=1; i<=length(header); i++) {
printf "%s%s", (header[i] in values ? values[header[i]] : ""), (i<length(header) ? OFS : ORS)
}
delete values
}
}
' "$input_xml" >> "$output_csv"
echo "CSV generated: $output_csv"
XML File Structure
faculty.xml:
<?xml version="1.0" ?>
<faculties>
<faculty>
<faculty>Faculty of Art</faculty>
<building>B59</building>
<room>R1000</room>
<capacity>240</capacity>
</faculty>
...
</faculties>
students.xml:
<?xml version="1.0" ?>
<students>
<student>
<student_name>Michael Jennings</student_name>
<student_id>273644</student_id>
<student_email>[email protected]</student_email>
<programme>Fine Art</programme>
<year>4</year>
<address>08832 Kim Groves Suite 335 North Mathewview, RI 43333</address>
<contact>3976680297</contact>
<module_id>FINE4401</module_id>
<module_name>Theoretical Sketching</module_name>
<module_leader>Maria Tyler</module_leader>
<lecturer1/>
<lecturer2/>
<faculty>Faculty of Art</faculty>
<building>B81</building>
<room>R1002</room>
<exam_mark/>
<coursework1>33</coursework1>
<coursework2>7</coursework2>
<coursework3>26</coursework3>
</student>
...
</students>
Problem
- The script is not fully dynamic. It should handle different XML structures without hardcoding tags.
- The output CSV files are empty or not in the expected format.
Additional Information
I am trying to avoid using external tools like xmllint
. The script should ideally be universal, but a semi-generic solution would also be acceptable.
4 Answers
Reset to default 4It's not possible to create a "universal" solution without external tools such as xmllint
, and you'd have to define your requirements for "semi-generic" to be able to get a solution that can handle whatever subset of XML you find acceptable. A fixed hierarchy of non-overlapping, non-nested tags without comments or strings containing tag-like substrings such as you show would be possible to parse with an awk script or similar, but that's a long way from the "universal" case you want and may not cover the "semi-generic" cases you have in mind.
Having said that, try this using GNU awk for various extensions. It's far from a "universal" solution for parsing XML but it might be adequate for your needs if all of your XML files look like the 2 examples you provided:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN { OFS="," }
match($0, /^\s*<([^<>]+)(\/|>(.*)<\/[^<>]+)>\s*$/, a) {
# The above regexp matches either of these formats of lines:
# <tag>value</tag>
# <tag/>
numFlds++
tags[numFlds] = a[1]
vals[numFlds] = a[3]
}
/^\s*<\/[^<>]+>\s*$/ {
# The above regexp matches this format of lines:
# </tag>
if ( !doneHdr++ ) {
prtCsv(tags)
}
prtCsv(vals)
numFlds = 0
}
function prtCsv(arr, i, out) {
for ( i=1; i<=numFlds; i++ ) {
out = arr[i]
gsub(/"/, "\"\"", out)
printf "\"%s\"%s", out, (i<numFlds ? OFS : ORS)
}
}
' "${@:--}"
$ ./tst.sh faculty.xml
"faculty","building","room","capacity"
"Faculty of Art","B59","R1000","240"
$ ./tst.sh students.xml
"student_name","student_id","student_email","programme","year","address","contact","module_id","module_name","module_leader","lecturer1","lecturer2","faculty","building","room","exam_mark","coursework1","coursework2","coursework3"
"Michael Jennings","273644","[email protected]","Fine Art","4","08832 Kim Groves Suite 335 North Mathewview, RI 43333","3976680297","FINE4401","Theoretical Sketching","Maria Tyler","","","Faculty of Art","B81","R1002","","33","7","26"
As you can probably have python installed, instead of using regexs you can do
#! /usr/bin/env python3
import pandas as pd
import sys
import xml.etree.ElementTree as ET
r = ET.parse(sys.argv[1]).getroot()
print(pd.DataFrame(([{i.tag: i.text for i in e} for e in r.findall(r[0].tag)])).to_csv(index=False))
which is parsing using etree
and then converting to a pandas dataframe
to simplify the conversion to csv (which you could do by other means if you can't install it).
Or, using just pandas
and etree
internally
#! /usr/bin/env python3
import pandas as pd
import sys
print(pd.read_xml(sys.argv[1], parser="etree").to_csv(index=False))
Addressing issues with OP's current awk
code ...
As currently coded the following is invalid in awk
:
/<faculty>/,/<\/faculty>/ || /<student>/,/<\/student>/
In (GNU) awk
this should be generating an error similar to:
awk: cmd. line:1: /<faculty>/,/<\/faculty>/ || /<student>/,/<\/student>/
awk: cmd. line:1: ^ syntax error
The general issue is that range patterns do not combine with other patterns
(see GNU awk : range patterns for details)
One workaround consists of processing the ranges separately while encapsulating the common action/code into a function, eg:
function process_range() {
if ($2 in header) values[$2] = $3
if ($0 ~ /<\/faculty>|<\/student>/) {
for (i=1; i<=length(header); i++) {
printf "%s%s", (header[i] in values ? values[header[i]] : ""), (i<length(header) ? OFS : ORS)
}
delete values
}
}
/<student>/,/<\/student>/ { process_range() }
/<faculty>/,/<\/faculty>/ { process_range() }
While it's (somewhat) clear that the dual range matches are OP's attempt to process two different files, OP has failed (so far) to notice that there are common tags between the two files. In particular, the faculty
tag shows up in both sample input files which in turn means the faculty
range match (and processing) is fired while processing both files.
One workaround would be to determine the file type based on a tag unique to a given file and then use this file type to conditionally process a range, eg:
/<students>/ { type="students" }
/<faculties>/ { type="faculties" }
/<student>/,/<\/student>/ { if (type=="students") process_range() }
/<faculty>/,/<\/faculty>/ { if (type=="faculties") process_range() }
The next issue OP has missed (so far) is the dual use of the faculty
tag in faculties.xml
:
<faculty>
<faculty>Faculty of Art</faculty>
....
</faculty>
This dual use of the same tag is going to wreak havoc with OP's current logic in that it will short circuit the code that's supposed to generate the new csv-formatted lines.
While it would be possible to modify the awk
script to handle the dual faculty
tags said modifications would be a bit messy so for the sake of this discussion I'm going to retag the internal entry with the department
tag:
<faculty>
<department>Faculty of Art</department>
....
</faculty>
The next issue concerns awk
arrays ...
The current code (split(tags, header, /,/)
) generates array entries that look like: header[1]="address"
, header[2]="building"
. While this is a valid array structure the issue arises with this follow-on bit of code:
if ($2 in header)
The is asking if $2
is an index of the header[]
array.
If $2=="address"
then this conditional is looking for an array entry defined like such: header[address]=...
. Obviously header[address]
is not going to match on header[1]=...
so the conditional will fail.
While there are a couple approaches to fix this we'll replace OP's current split()
call with the following:
header_cnt = split(tags, flds, /,/)
for (i in flds)
header[flds[i]]
header_cnt
gives us the number of split fields which is the same as the length of the array. We'll use header_cnt
to replace follow-on references to `length(header).
The split()
generates an array of the format: flds[1]="address"
, flds[2]="building
, ...
We then use the for
loop to build the associative array that looks like: header[address]=""
, header[building]=""
, ...
NOTE: header[flds[i]]
is comparable to header[flds[i]]=""
With these changes OP's current conditional - if ($2 in header)
- does not need to be changed but we will need to modify the for / printf
loop to make use of the flds[]
array:
for (i=1; i<=header_cnt; i++) {
printf "%s%s", (flds[i] in values ? values[flds[i]] : ""), (i<header_cnt ? OFS : ORS)
}
Modifying OP's current awk
code to incorporate these changes:
awk -v tags="$tags" '
BEGIN {
FS="[<>]"
OFS=","
header_cnt=split(tags, flds, /,/)
for (i in flds)
header[flds[i]]
}
function process_range() {
if ($2 in header) values[$2] = $3
if ($0 ~ /<\/faculty>|<\/student>/) {
for (i=1; i<=header_cnt; i++) {
printf "%s%s", (flds[i] in values ? values[flds[i]] : ""), (i<header_cnt ? OFS : ORS)
}
delete values
}
}
/<students>/ { type="students" }
/<faculties>/ { type="faculties" }
/<student>/,/<\/student>/ { if (type=="students") process_range() }
/<faculty>/,/<\/faculty>/ { if (type=="faculties") process_range() }
' "$input_xml" >> "$output_csv"
NOTE: I'm not 100% sure what OP's plan was for the gsub(/ /, "", header[1])
call so I've removed it; if OP's objective is to remove all white space, or perhaps remove leading/trailing white space but leave internal white space alone, the appropriate gsub()
call could be added back into the code to modify the flds[]
and header[]
arrays
Taking the new code for a test spin:
$ ./xml_to_csv students.xml ; cat students.csv
CSV generated: students.csv
address,building,contact,coursework1,coursework2,coursework3,faculty,module_id,module_leader,module_name,programme,room,student_email,student_id,student_name,year
08832 Kim Groves Suite 335 North Mathewview, RI 43333,,3976680297,,,,Faculty of Art,FINE4401,Maria Tyler,Theoretical Sketching,Fine Art,,[email protected],273644,Michael Jennings,4
$ ./xml_to_csv faculty.xml ; cat faculty.csv
CSV generated: faculty.csv
building,capacity,department,room
B59,240,Faculty of Art,R1000
Modifying your XML example to:
<?xml version="1.0" ?>
<students>
<student>
<student_name>Michael Jennings</student_name>
<student_id>273644</student_id>
<student_email>[email protected]</student_email>
<programme>Fine Art</programme>
<year>4</year>
<address>08832 Kim Groves Suite 335 North Mathewview, RI 43333</address>
<contact>3976680297</contact>
<module_id>FINE4401</module_id>
<module_name>Theoretical Sketching</module_name>
<module_leader>Maria Tyler</module_leader>
<lecturer1/>
<lecturer2/>
<faculty>Faculty of Art</faculty>
<building>B81</building>
<room>R1002</room>
<exam_mark/>
<coursework1>33</coursework1>
<coursework2>7</coursework2>
<coursework3>26</coursework3>
</student>
<student>
<student_name>Sarah Connor</student_name>
<student_id>285912</student_id>
<student_email>[email protected]</student_email>
<programme>Graphic Design</programme>
<year>3</year>
<address>12345 Elm Street, Springfield, IL 62701</address>
<contact>5551234567</contact>
<module_id>GRAP3302</module_id>
<module_name>Digital Illustration</module_name>
<module_leader>John Smith</module_leader>
<lecturer1>Emily Davis</lecturer1>
<lecturer2/>
<faculty>Faculty of Design</faculty>
<building>D12</building>
<room>R305</room>
<exam_mark>78</exam_mark>
<coursework1>85</coursework1>
<coursework2>92</coursework2>
<coursework3>80</coursework3>
</student>
</students>
(ie, more than one student)
Make the following assumptions:
Valid XML files must have only one root;
Valid CSV is 'rectangle' meaning there are the same number of header fields as each number of record has (although a field can be blank);
CSV is usually limited to 0 or 1 data elements per field;
You are usually better off using a proper CSV package to read and generate quoted CSV;
You are far better off using a proper XML parser versus a regex approach on XML!
Given that, you can extract valid CSV data from a valid XML file with a simple xpath statement: /*/*/*
if you data are children of the root or [path_to_root_of_data]/*/*/*
if elsewhere.
Here is a Ruby that demonstrates:
ruby -e '
require "rexml/document"
require "csv"
doc = REXML::Document.new(File.read(ARGV[0]))
data=Hash.new {|h, k| h[k] = [] }
doc.elements.each("/*/*/*"){ |x| data[x.name]<<x.text }
puts CSV.generate{|csv|
csv<<data.keys
data.values.transpose.each{|row| csv<<row}
}' students.xml
Prints:
student_name,student_id,student_email,programme,year,address,contact,module_id,module_name,module_leader,lecturer1,lecturer2,faculty,building,room,exam_mark,coursework1,coursework2,coursework3
Michael Jennings,273644,[email protected],Fine Art,4,"08832 Kim Groves Suite 335 North Mathewview, RI 43333",3976680297,FINE4401,Theoretical Sketching,Maria Tyler,,,Faculty of Art,B81,R1002,,33,7,26
Sarah Connor,285912,[email protected],Graphic Design,3,"12345 Elm Street, Springfield, IL 62701",5551234567,GRAP3302,Digital Illustration,John Smith,Emily Davis,,Faculty of Design,D12,R305,78,85,92,80
And with you faculty.xml
example prints:
faculty,building,room,capacity
Faculty of Art,B59,R1000,240
If you want all the fields quoted, you can replace the line of the CSV generation with:
CSV.generate(**{:force_quotes=>true}){|csv| ...
xq
from kislyuk/yq for that. This "universal" approach first inspects the tags, and produces the CSV based on that, so you can use it for both conversions:xq -r '.[][] | arrays // [.] | (first | keys_unsorted) as $keys | $keys, (.[] | [.[$keys[]]]) | @csv' faculty.xml > faculty.csv
andxq -r '…the same script…' students.xml > students.csv
– pmf Commented Mar 13 at 14:16faculty.csv
shows 3 rooms but only 1 room is provided infaculty.xml
– markp-fuso Commented Mar 13 at 16:39