最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

database - Bash Script to Convert XML to CSV Not Fully Dynamic and Outputs Empty Content - Stack Overflow

programmeradmin1浏览0评论

I have a Bash script that is supposed to convert an XML file into CSV format. However, the script is not fully universal, and when I check the content of the output file, it appears to be empty.

Expected Output

The script should generate two CSV files (faculty.csv and students.csv) with the following headers and values:

faculty.csv:

name,building,room,capacity
"Faculty of Art","B59","R1000","240"
"Faculty of Art","B59","R1001","360"
"Faculty of Art","B59","R1002","210"
...

students.csv:

student_name,student_id,student_email,programme,year,address,contact,module_id,module_name,module_leader,lecturer1,lecturer2,name,building,room,exam_mark,coursework1,coursework2,coursework3
"Michael Jennings","273644","[email protected]","Fine Art","4","08832 Kim Groves Suite 335 North Mathewview, RI 43333","3976680297","FINE4401","Theoretical Sketching","Maria Tyler","","","Faculty of Art","B81","R1002","","33","7","26"
...

The Script

#!/bin/bash

# Check for input file
if [[ -z "$1" ]]; then
    echo "Usage: $0 <input.xml> [output.csv]"
    exit 1
fi

input_xml="$1"
output_csv="${2:-${input_xml%.xml}.csv}"

# Extract unique tags (excluding XML declaration and root tags)
tags=$(grep -oP '<\K[^/>]+(?=>[^<]+</[^>]+>)' "$input_xml" | grep -vP '^(faculties|students|\?xml)' | sort | uniq | tr '\n' ',' | sed 's/,$//')

# Create CSV header
echo "$tags" > "$output_csv"

# Process data records
awk -v tags="$tags" '
    BEGIN {
        FS="[<>]"
        OFS=","
        split(tags, header, /,/)
        gsub(/ /, "", header[1])  # Remove any accidental spaces
    }
    /<faculty>/,/<\/faculty>/ || /<student>/,/<\/student>/ {
        if ($2 in header) values[$2] = $3
        if ($0 ~ /<\/faculty>|<\/student>/) {
            for (i=1; i<=length(header); i++) {
                printf "%s%s", (header[i] in values ? values[header[i]] : ""), (i<length(header) ? OFS : ORS)
            }
            delete values
        }
    }
' "$input_xml" >> "$output_csv"

echo "CSV generated: $output_csv"

XML File Structure

faculty.xml:

<?xml version="1.0" ?>
<faculties>
  <faculty>
    <faculty>Faculty of Art</faculty>
    <building>B59</building>
    <room>R1000</room>
    <capacity>240</capacity>
  </faculty>
  ...
</faculties>

students.xml:

<?xml version="1.0" ?>
<students>
  <student>
    <student_name>Michael Jennings</student_name>
    <student_id>273644</student_id>
    <student_email>[email protected]</student_email>
    <programme>Fine Art</programme>
    <year>4</year>
    <address>08832 Kim Groves Suite 335 North Mathewview, RI 43333</address>
    <contact>3976680297</contact>
    <module_id>FINE4401</module_id>
    <module_name>Theoretical Sketching</module_name>
    <module_leader>Maria Tyler</module_leader>
    <lecturer1/>
    <lecturer2/>
    <faculty>Faculty of Art</faculty>
    <building>B81</building>
    <room>R1002</room>
    <exam_mark/>
    <coursework1>33</coursework1>
    <coursework2>7</coursework2>
    <coursework3>26</coursework3>
  </student>
  ...
</students>

Problem

  1. The script is not fully dynamic. It should handle different XML structures without hardcoding tags.
  2. The output CSV files are empty or not in the expected format.

Additional Information

I am trying to avoid using external tools like xmllint. The script should ideally be universal, but a semi-generic solution would also be acceptable.

I have a Bash script that is supposed to convert an XML file into CSV format. However, the script is not fully universal, and when I check the content of the output file, it appears to be empty.

Expected Output

The script should generate two CSV files (faculty.csv and students.csv) with the following headers and values:

faculty.csv:

name,building,room,capacity
"Faculty of Art","B59","R1000","240"
"Faculty of Art","B59","R1001","360"
"Faculty of Art","B59","R1002","210"
...

students.csv:

student_name,student_id,student_email,programme,year,address,contact,module_id,module_name,module_leader,lecturer1,lecturer2,name,building,room,exam_mark,coursework1,coursework2,coursework3
"Michael Jennings","273644","[email protected]","Fine Art","4","08832 Kim Groves Suite 335 North Mathewview, RI 43333","3976680297","FINE4401","Theoretical Sketching","Maria Tyler","","","Faculty of Art","B81","R1002","","33","7","26"
...

The Script

#!/bin/bash

# Check for input file
if [[ -z "$1" ]]; then
    echo "Usage: $0 <input.xml> [output.csv]"
    exit 1
fi

input_xml="$1"
output_csv="${2:-${input_xml%.xml}.csv}"

# Extract unique tags (excluding XML declaration and root tags)
tags=$(grep -oP '<\K[^/>]+(?=>[^<]+</[^>]+>)' "$input_xml" | grep -vP '^(faculties|students|\?xml)' | sort | uniq | tr '\n' ',' | sed 's/,$//')

# Create CSV header
echo "$tags" > "$output_csv"

# Process data records
awk -v tags="$tags" '
    BEGIN {
        FS="[<>]"
        OFS=","
        split(tags, header, /,/)
        gsub(/ /, "", header[1])  # Remove any accidental spaces
    }
    /<faculty>/,/<\/faculty>/ || /<student>/,/<\/student>/ {
        if ($2 in header) values[$2] = $3
        if ($0 ~ /<\/faculty>|<\/student>/) {
            for (i=1; i<=length(header); i++) {
                printf "%s%s", (header[i] in values ? values[header[i]] : ""), (i<length(header) ? OFS : ORS)
            }
            delete values
        }
    }
' "$input_xml" >> "$output_csv"

echo "CSV generated: $output_csv"

XML File Structure

faculty.xml:

<?xml version="1.0" ?>
<faculties>
  <faculty>
    <faculty>Faculty of Art</faculty>
    <building>B59</building>
    <room>R1000</room>
    <capacity>240</capacity>
  </faculty>
  ...
</faculties>

students.xml:

<?xml version="1.0" ?>
<students>
  <student>
    <student_name>Michael Jennings</student_name>
    <student_id>273644</student_id>
    <student_email>[email protected]</student_email>
    <programme>Fine Art</programme>
    <year>4</year>
    <address>08832 Kim Groves Suite 335 North Mathewview, RI 43333</address>
    <contact>3976680297</contact>
    <module_id>FINE4401</module_id>
    <module_name>Theoretical Sketching</module_name>
    <module_leader>Maria Tyler</module_leader>
    <lecturer1/>
    <lecturer2/>
    <faculty>Faculty of Art</faculty>
    <building>B81</building>
    <room>R1002</room>
    <exam_mark/>
    <coursework1>33</coursework1>
    <coursework2>7</coursework2>
    <coursework3>26</coursework3>
  </student>
  ...
</students>

Problem

  1. The script is not fully dynamic. It should handle different XML structures without hardcoding tags.
  2. The output CSV files are empty or not in the expected format.

Additional Information

I am trying to avoid using external tools like xmllint. The script should ideally be universal, but a semi-generic solution would also be acceptable.

Share Improve this question asked Mar 13 at 9:11 VelvetVelvet 411 silver badge3 bronze badges 6
  • 8 What do you mean by universal? To put it in other words: How can I be universal, if you don't use an XML parser? – user1934428 Commented Mar 13 at 10:05
  • 4 Don't parse XML/HTML with regex, use a proper XML/HTML parser. stackoverflow/questions/49352243/… – Yitzhak Khabinsky Commented Mar 13 at 12:48
  • Also, check it out here XSLT based solution: stackoverflow/questions/79489144/… – Yitzhak Khabinsky Commented Mar 13 at 12:52
  • The Q was closed, so here's my 2c as a comment: I'd use xq from kislyuk/yq for that. This "universal" approach first inspects the tags, and produces the CSV based on that, so you can use it for both conversions: xq -r '.[][] | arrays // [.] | (first | keys_unsorted) as $keys | $keys, (.[] | [.[$keys[]]]) | @csv' faculty.xml > faculty.csv and xq -r '…the same script…' students.xml > students.csv – pmf Commented Mar 13 at 14:16
  • please insure inputs and expected outputs match otherwise (generally speaking) we're left guessing as to how the inputs and outputs 'match'; in this case the expected out of faculty.csv shows 3 rooms but only 1 room is provided in faculty.xml – markp-fuso Commented Mar 13 at 16:39
 |  Show 1 more comment

4 Answers 4

Reset to default 4

It's not possible to create a "universal" solution without external tools such as xmllint, and you'd have to define your requirements for "semi-generic" to be able to get a solution that can handle whatever subset of XML you find acceptable. A fixed hierarchy of non-overlapping, non-nested tags without comments or strings containing tag-like substrings such as you show would be possible to parse with an awk script or similar, but that's a long way from the "universal" case you want and may not cover the "semi-generic" cases you have in mind.

Having said that, try this using GNU awk for various extensions. It's far from a "universal" solution for parsing XML but it might be adequate for your needs if all of your XML files look like the 2 examples you provided:

$ cat tst.sh
#!/usr/bin/env bash

awk '
    BEGIN { OFS="," }
    match($0, /^\s*<([^<>]+)(\/|>(.*)<\/[^<>]+)>\s*$/, a) {
        # The above regexp matches either of these formats of lines:
        #    <tag>value</tag>
        #    <tag/>
        numFlds++
        tags[numFlds] = a[1]
        vals[numFlds] = a[3]
    }
    /^\s*<\/[^<>]+>\s*$/ {
        # The above regexp matches this format of lines:
        #    </tag>
        if ( !doneHdr++ ) {
            prtCsv(tags)
        }
        prtCsv(vals)
        numFlds = 0
    }
    function prtCsv(arr,    i, out) {
        for ( i=1; i<=numFlds; i++ ) {
            out = arr[i]
            gsub(/"/, "\"\"", out)
            printf "\"%s\"%s", out, (i<numFlds ? OFS : ORS)
        }
    }
' "${@:--}"
$ ./tst.sh faculty.xml
"faculty","building","room","capacity"
"Faculty of Art","B59","R1000","240"
$ ./tst.sh students.xml
"student_name","student_id","student_email","programme","year","address","contact","module_id","module_name","module_leader","lecturer1","lecturer2","faculty","building","room","exam_mark","coursework1","coursework2","coursework3"
"Michael Jennings","273644","[email protected]","Fine Art","4","08832 Kim Groves Suite 335 North Mathewview, RI 43333","3976680297","FINE4401","Theoretical Sketching","Maria Tyler","","","Faculty of Art","B81","R1002","","33","7","26"

As you can probably have python installed, instead of using regexs you can do

#! /usr/bin/env python3

import pandas as pd 
import sys
import xml.etree.ElementTree as ET 
  
r = ET.parse(sys.argv[1]).getroot()
print(pd.DataFrame(([{i.tag: i.text for i in e} for e in r.findall(r[0].tag)])).to_csv(index=False))

which is parsing using etree and then converting to a pandas dataframe to simplify the conversion to csv (which you could do by other means if you can't install it).

Or, using just pandas and etree internally

#! /usr/bin/env python3

import pandas as pd 
import sys

print(pd.read_xml(sys.argv[1], parser="etree").to_csv(index=False))

Addressing issues with OP's current awk code ...


As currently coded the following is invalid in awk:

/<faculty>/,/<\/faculty>/ || /<student>/,/<\/student>/

In (GNU) awk this should be generating an error similar to:

awk: cmd. line:1: /<faculty>/,/<\/faculty>/ || /<student>/,/<\/student>/
awk: cmd. line:1:                                         ^ syntax error

The general issue is that range patterns do not combine with other patterns (see GNU awk : range patterns for details)

One workaround consists of processing the ranges separately while encapsulating the common action/code into a function, eg:

function process_range() {
    if ($2 in header) values[$2] = $3
    if ($0 ~ /<\/faculty>|<\/student>/) {
        for (i=1; i<=length(header); i++) {
            printf "%s%s", (header[i] in values ? values[header[i]] : ""), (i<length(header) ? OFS : ORS)
        }
        delete values
    }
}

/<student>/,/<\/student>/ { process_range() }
/<faculty>/,/<\/faculty>/ { process_range() }

While it's (somewhat) clear that the dual range matches are OP's attempt to process two different files, OP has failed (so far) to notice that there are common tags between the two files. In particular, the faculty tag shows up in both sample input files which in turn means the faculty range match (and processing) is fired while processing both files.

One workaround would be to determine the file type based on a tag unique to a given file and then use this file type to conditionally process a range, eg:

/<students>/              { type="students"  }
/<faculties>/             { type="faculties" }
/<student>/,/<\/student>/ { if (type=="students")  process_range() }
/<faculty>/,/<\/faculty>/ { if (type=="faculties") process_range() }

The next issue OP has missed (so far) is the dual use of the faculty tag in faculties.xml:

 <faculty>
   <faculty>Faculty of Art</faculty>
 ....
 </faculty>

This dual use of the same tag is going to wreak havoc with OP's current logic in that it will short circuit the code that's supposed to generate the new csv-formatted lines.

While it would be possible to modify the awk script to handle the dual faculty tags said modifications would be a bit messy so for the sake of this discussion I'm going to retag the internal entry with the department tag:

<faculty>
  <department>Faculty of Art</department>
  ....
</faculty>

The next issue concerns awk arrays ...

The current code (split(tags, header, /,/)) generates array entries that look like: header[1]="address", header[2]="building". While this is a valid array structure the issue arises with this follow-on bit of code:

if ($2 in header)

The is asking if $2 is an index of the header[] array.

If $2=="address" then this conditional is looking for an array entry defined like such: header[address]=.... Obviously header[address] is not going to match on header[1]=... so the conditional will fail.

While there are a couple approaches to fix this we'll replace OP's current split() call with the following:

header_cnt = split(tags, flds, /,/)
for (i in flds)
    header[flds[i]]

header_cnt gives us the number of split fields which is the same as the length of the array. We'll use header_cnt to replace follow-on references to `length(header).

The split() generates an array of the format: flds[1]="address", flds[2]="building, ...

We then use the for loop to build the associative array that looks like: header[address]="", header[building]="", ...

NOTE: header[flds[i]] is comparable to header[flds[i]]=""

With these changes OP's current conditional - if ($2 in header) - does not need to be changed but we will need to modify the for / printf loop to make use of the flds[] array:

for (i=1; i<=header_cnt; i++) {
    printf "%s%s", (flds[i] in values ? values[flds[i]] : ""), (i<header_cnt ? OFS : ORS)
}

Modifying OP's current awk code to incorporate these changes:

awk -v tags="$tags" '
    BEGIN {
        FS="[<>]"
        OFS=","
        header_cnt=split(tags, flds, /,/)
        for (i in flds)
            header[flds[i]]
    }

function process_range() {
        if ($2 in header) values[$2] = $3
        if ($0 ~ /<\/faculty>|<\/student>/) {
            for (i=1; i<=header_cnt; i++) {
                printf "%s%s", (flds[i] in values ? values[flds[i]] : ""), (i<header_cnt ? OFS : ORS)
            }
            delete values
        }
}
    /<students>/              { type="students"  }
    /<faculties>/             { type="faculties" }
    /<student>/,/<\/student>/ { if (type=="students")  process_range() }
    /<faculty>/,/<\/faculty>/ { if (type=="faculties") process_range() }

' "$input_xml" >> "$output_csv"

NOTE: I'm not 100% sure what OP's plan was for the gsub(/ /, "", header[1]) call so I've removed it; if OP's objective is to remove all white space, or perhaps remove leading/trailing white space but leave internal white space alone, the appropriate gsub() call could be added back into the code to modify the flds[] and header[] arrays


Taking the new code for a test spin:

$ ./xml_to_csv students.xml ; cat students.csv
CSV generated: students.csv
address,building,contact,coursework1,coursework2,coursework3,faculty,module_id,module_leader,module_name,programme,room,student_email,student_id,student_name,year
08832 Kim Groves Suite 335 North Mathewview, RI 43333,,3976680297,,,,Faculty of Art,FINE4401,Maria Tyler,Theoretical Sketching,Fine Art,,[email protected],273644,Michael Jennings,4

$ ./xml_to_csv faculty.xml ; cat faculty.csv
CSV generated: faculty.csv
building,capacity,department,room
B59,240,Faculty of Art,R1000

Modifying your XML example to:

<?xml version="1.0" ?>
<students>
  <student>
    <student_name>Michael Jennings</student_name>
    <student_id>273644</student_id>
    <student_email>[email protected]</student_email>
    <programme>Fine Art</programme>
    <year>4</year>
    <address>08832 Kim Groves Suite 335 North Mathewview, RI 43333</address>
    <contact>3976680297</contact>
    <module_id>FINE4401</module_id>
    <module_name>Theoretical Sketching</module_name>
    <module_leader>Maria Tyler</module_leader>
    <lecturer1/>
    <lecturer2/>
    <faculty>Faculty of Art</faculty>
    <building>B81</building>
    <room>R1002</room>
    <exam_mark/>
    <coursework1>33</coursework1>
    <coursework2>7</coursework2>
    <coursework3>26</coursework3>
  </student>
  <student>
    <student_name>Sarah Connor</student_name>
    <student_id>285912</student_id>
    <student_email>[email protected]</student_email>
    <programme>Graphic Design</programme>
    <year>3</year>
    <address>12345 Elm Street, Springfield, IL 62701</address>
    <contact>5551234567</contact>
    <module_id>GRAP3302</module_id>
    <module_name>Digital Illustration</module_name>
    <module_leader>John Smith</module_leader>
    <lecturer1>Emily Davis</lecturer1>
    <lecturer2/>
    <faculty>Faculty of Design</faculty>
    <building>D12</building>
    <room>R305</room>
    <exam_mark>78</exam_mark>
    <coursework1>85</coursework1>
    <coursework2>92</coursework2>
    <coursework3>80</coursework3>
  </student>
</students>

(ie, more than one student)

Make the following assumptions:

  1. Valid XML files must have only one root;

  2. Valid CSV is 'rectangle' meaning there are the same number of header fields as each number of record has (although a field can be blank);

  3. CSV is usually limited to 0 or 1 data elements per field;

  4. You are usually better off using a proper CSV package to read and generate quoted CSV;

  5. You are far better off using a proper XML parser versus a regex approach on XML!

Given that, you can extract valid CSV data from a valid XML file with a simple xpath statement: /*/*/* if you data are children of the root or [path_to_root_of_data]/*/*/* if elsewhere.

Here is a Ruby that demonstrates:

ruby -e '
   require "rexml/document"
   require "csv"
   doc = REXML::Document.new(File.read(ARGV[0]))

   data=Hash.new {|h, k| h[k] = [] }
   doc.elements.each("/*/*/*"){ |x| data[x.name]<<x.text }

   puts CSV.generate{|csv|     
      csv<<data.keys
      data.values.transpose.each{|row| csv<<row}
 }' students.xml

Prints:

student_name,student_id,student_email,programme,year,address,contact,module_id,module_name,module_leader,lecturer1,lecturer2,faculty,building,room,exam_mark,coursework1,coursework2,coursework3
Michael Jennings,273644,[email protected],Fine Art,4,"08832 Kim Groves Suite 335 North Mathewview, RI 43333",3976680297,FINE4401,Theoretical Sketching,Maria Tyler,,,Faculty of Art,B81,R1002,,33,7,26
Sarah Connor,285912,[email protected],Graphic Design,3,"12345 Elm Street, Springfield, IL 62701",5551234567,GRAP3302,Digital Illustration,John Smith,Emily Davis,,Faculty of Design,D12,R305,78,85,92,80

And with you faculty.xml example prints:

faculty,building,room,capacity
Faculty of Art,B59,R1000,240

If you want all the fields quoted, you can replace the line of the CSV generation with:

CSV.generate(**{:force_quotes=>true}){|csv| ... 
发布评论

评论列表(0)

  1. 暂无评论