最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Parsing complex HTML tables - Stack Overflow

programmeradmin5浏览0评论

I'm trying to parse the class schedule provided by my university in order to import the information into some kind of calendar. An example of the schedule can be seen here:
/WIA13-7.%20Block.html

The auto-generated HTML-content is, in my opinion, a mess and very hard to grasp. E.g. the tables are mainly built with rowspans and colspans (the positions of cells in the code pared with their actual visual position in the browser seem partially arbitrary).

What I've already tried:

  1. Asking the university's administration office to provide a simpler, easier to read file separately. Of course this wasn't possible, after all it would mean one minute of additional effort.
  2. Researching the original tool used to generate the HTML. It is called "sked Stundenplan Software". I couldn't find any hints or tools to "reverse" the generation process.
  3. Looking for an existing solution, at which point I found some tools (e.g. /) that do not work for my schedule. After studying the codes of these tools I concluded that they must have been designed for an other/outdated version of sked.
  4. Parsing the HTML with PHP (mostly using DOMDocument). That worked sometimes, but was way too unreliable...The exceptions to take into account seem indefinite.

Right now I don't think that conventional HTML parsing will get me far, at least not in an acceptable developing time. What I am looking for are other methods to fetch information from plex HTML tables, something like YQL, or maybe utilities that can normalize such tables with col-/rowspans. Because I don't have anything concrete in mind, I am mainly asking for some tips or hints for another approach.

Are there other, more suitable methods to parse such tables or am I stuck with conventional HTML parsing?

Edit:

On behalf of a request, I'll paste an example of raw code...

This week:

Results from this code:

Edit 2:
Because of some parsing discussions I'll also add my PHP code. It's my first time with PHP so it's not very sophisticated. It should rather give an insight on how far I've e with parsing the tables in theory. The actual work happens in the function parseSkedTable(), please concentrate on this one. Also, I would like to point out the term "double courses" appearing in the ments, which describes two different courses happening at the same time (the class would be split in such moments). An example of these courses can be found here in week two:
/WIB14-4.%20Block.html

It looks like this:

The corresponding HTML-code of that week can also be accessed here:

And now the PHP-code (I had a hard time translating the ments since I already struggled expressing them in my first language...I hope they may still be helpful):

Update

So far, there have been some solutions to my parsing problem, each of them using JavaScript. Since JavaScript (being especially powerful here because of the ability to use browser-rendered data) seems to be the only efficient way to retrieve reliable information from the HTML, I am now looking for a way to implement some kind of headless browser or rendering engine on my free server at x10hosting. Sadly, I am neither able to install software other than provided by softaculous nor allowed to use PHP's exec() mand.
Any idea would be appreciated!

For the sake of pleteness, I'll post both solutions, existing until now:

  1. jQuery parser by Pierre Dubois:

    (function ($) { $(document).ready(function() {

        var _pe = window.pe || {
            fn : {}
        };
    
        var tblNumber = 0; // Just a incremental number to identify the schedule item with the table
    
        // For each table
        $('table').each(function () {
    
            $('#output').append('Parsing the table number: ' + tblNumber + '<br>');
            // console.log('Parsing the table number: ' + tblNumber);
            tblNumber += 1;
    
            var currentTable = this;
    
    
            // Parser the plex table
            _pe.fn.parsertable.parse($(currentTable));
    
            // Retrieve the parsed data
            var parsedData = $(currentTable).data().tblparser;
    
            //
            // Information about the column structure, nice that is consistent
            //
    
            // Day: Cell index position (0 based)
            // Mo: 3
            // Di: 7
            // Mi: 11
            // Do: 15
            // Fr: 19
            // Sa: 23
    
            // Title Location at Row index position "0"
    
            // "i" represent the middle column position
            for (var i = 3; i < 24; i += 4) {
    
                var currentDay;
    
                // Get the day
                currentDay = $(parsedData.row[0].cell[i].elem).text();
    
                $('#output').append('  Day: ' + currentDay + '<br>');
                // console.log('Day: ' + currentDay);
    
                // Get all the events for that day, excluding the first row and the last row
                for (var j = 1; j < parsedData.col[i].cell.length - 2; j += 1) {
    
                    // First column 
                    if (parsedData.col[i - 1].cell[j - 1].uid !== parsedData.col[i - 1].cell[j].uid ) {
    
                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i - 1].cell[j].elem).text().trim();
    
                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }
    
                    // Second Column
                    if (parsedData.col[i].cell[j - 1].uid !== parsedData.col[i].cell[j].uid &&
                        parsedData.col[i - 1].cell[j].uid !== parsedData.col[i].cell[j].uid) {
    
                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i].cell[j].elem).text().trim();
    
                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }
    
                    // Third Column
                    if (parsedData.col[i + 1].cell[j - 1].uid !== parsedData.col[i + 1].cell[j].uid &&
                        parsedData.col[i].cell[j].uid !== parsedData.col[i + 1].cell[j].uid) {
    
                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i + 1].cell[j].elem).text().trim();
    
                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }
                } 
    
            }
    
        });
    
    
    });
    

    }(jQuery));

  2. JS parser using positional information by me, realizing rambo coder's idea

I'm trying to parse the class schedule provided by my university in order to import the information into some kind of calendar. An example of the schedule can be seen here:
http://www.asw-berufsakademie.de/fileadmin/download/download/Sked%20Stundenplan/WIA13-7.%20Block.html

The auto-generated HTML-content is, in my opinion, a mess and very hard to grasp. E.g. the tables are mainly built with rowspans and colspans (the positions of cells in the code pared with their actual visual position in the browser seem partially arbitrary).

What I've already tried:

  1. Asking the university's administration office to provide a simpler, easier to read file separately. Of course this wasn't possible, after all it would mean one minute of additional effort.
  2. Researching the original tool used to generate the HTML. It is called "sked Stundenplan Software". I couldn't find any hints or tools to "reverse" the generation process.
  3. Looking for an existing solution, at which point I found some tools (e.g. http://code.google./p/skd-schedule-parser/) that do not work for my schedule. After studying the codes of these tools I concluded that they must have been designed for an other/outdated version of sked.
  4. Parsing the HTML with PHP (mostly using DOMDocument). That worked sometimes, but was way too unreliable...The exceptions to take into account seem indefinite.

Right now I don't think that conventional HTML parsing will get me far, at least not in an acceptable developing time. What I am looking for are other methods to fetch information from plex HTML tables, something like YQL, or maybe utilities that can normalize such tables with col-/rowspans. Because I don't have anything concrete in mind, I am mainly asking for some tips or hints for another approach.

Are there other, more suitable methods to parse such tables or am I stuck with conventional HTML parsing?

Edit:

On behalf of a request, I'll paste an example of raw code...

This week:

Results from this code:
http://pastebin./BJduUVtU

Edit 2:
Because of some parsing discussions I'll also add my PHP code. It's my first time with PHP so it's not very sophisticated. It should rather give an insight on how far I've e with parsing the tables in theory. The actual work happens in the function parseSkedTable(), please concentrate on this one. Also, I would like to point out the term "double courses" appearing in the ments, which describes two different courses happening at the same time (the class would be split in such moments). An example of these courses can be found here in week two:
http://www.asw-berufsakademie.de/fileadmin/download/download/Sked%20Stundenplan/WIB14-4.%20Block.html

It looks like this:

The corresponding HTML-code of that week can also be accessed here:
http://pastebin./gLTWz5KU

And now the PHP-code (I had a hard time translating the ments since I already struggled expressing them in my first language...I hope they may still be helpful):
http://pastebin./Nzi8m2v8

Update

So far, there have been some solutions to my parsing problem, each of them using JavaScript. Since JavaScript (being especially powerful here because of the ability to use browser-rendered data) seems to be the only efficient way to retrieve reliable information from the HTML, I am now looking for a way to implement some kind of headless browser or rendering engine on my free server at x10hosting.. Sadly, I am neither able to install software other than provided by softaculous nor allowed to use PHP's exec() mand.
Any idea would be appreciated!

For the sake of pleteness, I'll post both solutions, existing until now:

  1. jQuery parser by Pierre Dubois:

    (function ($) { $(document).ready(function() {

        var _pe = window.pe || {
            fn : {}
        };
    
        var tblNumber = 0; // Just a incremental number to identify the schedule item with the table
    
        // For each table
        $('table').each(function () {
    
            $('#output').append('Parsing the table number: ' + tblNumber + '<br>');
            // console.log('Parsing the table number: ' + tblNumber);
            tblNumber += 1;
    
            var currentTable = this;
    
    
            // Parser the plex table
            _pe.fn.parsertable.parse($(currentTable));
    
            // Retrieve the parsed data
            var parsedData = $(currentTable).data().tblparser;
    
            //
            // Information about the column structure, nice that is consistent
            //
    
            // Day: Cell index position (0 based)
            // Mo: 3
            // Di: 7
            // Mi: 11
            // Do: 15
            // Fr: 19
            // Sa: 23
    
            // Title Location at Row index position "0"
    
            // "i" represent the middle column position
            for (var i = 3; i < 24; i += 4) {
    
                var currentDay;
    
                // Get the day
                currentDay = $(parsedData.row[0].cell[i].elem).text();
    
                $('#output').append('  Day: ' + currentDay + '<br>');
                // console.log('Day: ' + currentDay);
    
                // Get all the events for that day, excluding the first row and the last row
                for (var j = 1; j < parsedData.col[i].cell.length - 2; j += 1) {
    
                    // First column 
                    if (parsedData.col[i - 1].cell[j - 1].uid !== parsedData.col[i - 1].cell[j].uid ) {
    
                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i - 1].cell[j].elem).text().trim();
    
                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }
    
                    // Second Column
                    if (parsedData.col[i].cell[j - 1].uid !== parsedData.col[i].cell[j].uid &&
                        parsedData.col[i - 1].cell[j].uid !== parsedData.col[i].cell[j].uid) {
    
                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i].cell[j].elem).text().trim();
    
                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }
    
                    // Third Column
                    if (parsedData.col[i + 1].cell[j - 1].uid !== parsedData.col[i + 1].cell[j].uid &&
                        parsedData.col[i].cell[j].uid !== parsedData.col[i + 1].cell[j].uid) {
    
                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i + 1].cell[j].elem).text().trim();
    
                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }
                } 
    
            }
    
        });
    
    
    });
    

    }(jQuery));

  2. JS parser using positional information by me, realizing rambo coder's idea

Share Improve this question edited Jun 30, 2017 at 7:01 Brian Tompsett - 汤莱恩 5,89372 gold badges61 silver badges133 bronze badges asked Oct 20, 2012 at 13:23 MCLMCL 4,0853 gold badges29 silver badges42 bronze badges 17
  • 1 Oh brother. I'm not sure whether a tool exists to parse that kind of a mess. ("Mess" from a HTML point of view. Visually, it looks pretty neat!) – Pekka Commented Oct 20, 2012 at 13:28
  • Side note: on Stack Overflow, we aim for self-contained questions that are valid even when external sites go down. Could you post an example of the kind of HTML code that you need to parse in here? Use the {} button to format it as code. Thanks! – Pekka Commented Oct 20, 2012 at 13:32
  • Why do you care about the rowspans when the time is in the cells? Why don't you just create a hash/array of dates pulled from the header and then only worry about the columns and parse the text for the time? Looks pretty simple using any of the standard XML parsing methods and basic regex. – Matt Whipple Commented Oct 20, 2012 at 13:32
  • @Matt look at the table. It's a total, utter catastrophe in terms of programmatically interpreting anything. I bet a 3-hour course would be marked by a cell that has an increased rowspan... good luck parsing that. – Pekka Commented Oct 20, 2012 at 13:33
  • 2 It looks like the <td> elements that you're interested in all have "class='v'", so you could probably just parse those, and ignore everything else. – GreyBeardedGeek Commented Oct 20, 2012 at 13:44
 |  Show 12 more ments

6 Answers 6

Reset to default 1

You could make use of a browsers rendering/layout engine here.

Use http://phantomjs/ to get access to a headless browser that lets you execute javascript on a webpage's dom.

A dash of jquery would make the remaining pseudocode easy to implement:

foreach (td.t as dateElement) {
    //parse date from element text
    //use pixel position + dimensions to calc pixel coord of center
    // save this center in a list along with the date
}

foreach (td.v as calendarEntryElement) {
    //parse time + other stuff from element text
    //use pixel position to find the closest date element in that list(it must be the closest one above)
}

I feel positional information would be very reliable here, because everything is a nested rectangle and its all done via tables.

You don't need to use phantomjs, you could just as easily execute a browser manually, and let it send a request to a local server to collect the results.

Some shell mand roughly like

firefox file://foo123.html

Where you've appended some custom <script> to the end of one of their webpages and saved it.


I study at the same university and a few weeks ago I faced the same problem to parse this time table and convert it to an ICS file. Finally I found my own solution and generalized the code, so that students from other universities, using the Sked software and have a much more plex time table, can import their time table too.
I also created a website, where students can sign up and configure the urls to the time tables which they want to subscribe. In the background runs a cronjob which ensures, that the subscribed calendars are always up to date. You can find the result of the project on my website:
http://calendar.pineappledeveloper./
(it is only in German available).

Many developers use HTML Agility Pack to parse HTML:

http://htmlagilitypack.codeplex./

I have some hope for your problem since the html is well formed and is x-html pliant.

The following Java program parse it successfully but without extracting information.

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;

public final class AgendaParser {
   public static void main( String[] args ) throws Throwable {
      long atStart = System.currentTimeMillis();
      DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
      DocumentBuilder db = dbf.newDocumentBuilder();
      Document doc = db.parse( "WIA13-7. Block.html" );
      System.err.println( "Elapsed time: " + ( System.currentTimeMillis() - atStart ) / 1000L + " seconds" );
      System.err.println( doc.getElementsByTagName( "table" ));
   }
}

it outputs:

Elapsed time: 118 seconds (very long, no?)
.sun.apache.xerces.internal.dom.DeepNodeListImpl@7faea002

I thing a couple of hours of work gives you 80% of the extraction.

Interresting plex table. A mix between a layout table and a data table.

This may help you, that solution is in jQuery and use the same plex table parser available in the Web Experience Toolkit Project. With that table parser, you will be able to retreive your schedule data. The only thing that need to be done would be to parse the schedule item content for the importation to your calendar apps.

This solution use the column instead of the row to retrieve the schedule items

Working example : http://jsfiddle/3t2A8/2/

Here the javascript code that is used to extract and show the schedule items

(function ($) {
    $(document).ready(function() {

        var _pe = window.pe || {
            fn : {}
        };

        var tblNumber = 0; // Just a incremental number to identify the schedule item with the table

        // For each table
        $('table').each(function () {

            $('#output').append('Parsing the table number: ' + tblNumber + '<br>');
            // console.log('Parsing the table number: ' + tblNumber);
            tblNumber += 1;

            var currentTable = this;


            // Parser the plex table
            _pe.fn.parsertable.parse($(currentTable));

            // Retrieve the parsed data
            var parsedData = $(currentTable).data().tblparser;

            //
            // Information about the column structure, nice that is consistent
            //

            // Day: Cell index position (0 based)
            // Mo: 3
            // Di: 7
            // Mi: 11
            // Do: 15
            // Fr: 19
            // Sa: 23

            // Title Location at Row index position "0"

            // "i" represent the middle column position
            for (var i = 3; i < 24; i += 4) {

                var currentDay;

                // Get the day
                currentDay = $(parsedData.row[0].cell[i].elem).text();

                $('#output').append('  Day: ' + currentDay + '<br>');
                // console.log('Day: ' + currentDay);

                // Get all the events for that day, excluding the first row and the last row
                for (var j = 1; j < parsedData.col[i].cell.length - 2; j += 1) {

                    // First column 
                    if (parsedData.col[i - 1].cell[j - 1].uid !== parsedData.col[i - 1].cell[j].uid ) {

                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i - 1].cell[j].elem).text().trim();

                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }

                    // Second Column
                    if (parsedData.col[i].cell[j - 1].uid !== parsedData.col[i].cell[j].uid &&
                        parsedData.col[i - 1].cell[j].uid !== parsedData.col[i].cell[j].uid) {

                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i].cell[j].elem).text().trim();

                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }

                    // Third Column
                    if (parsedData.col[i + 1].cell[j - 1].uid !== parsedData.col[i + 1].cell[j].uid &&
                        parsedData.col[i].cell[j].uid !== parsedData.col[i + 1].cell[j].uid) {

                        // Get the content of that cell and remove ending space
                        var event = $(parsedData.col[i + 1].cell[j].elem).text().trim();

                        if (event.length > 0) {
                            $('#output').append('  + Event: ' + event + '<br>');
                            // console.log('Event: ' + event);
                        }
                    }
                } 

            }

        });


    });

}(jQuery));

it would output for the "double courses" table

Parsing the table number: 0
  Day: Mo, 22.10.2012
  + Event: 12:45 - 14:15 Uhr
      Vorlesung
      DATMOD Gr. 1
      HG: 13
  + Event: 12:45 - 14:15 Uhr
      Vorlesung
      PROG III Gr. 2
      HG: 15
  + Event: 14:30 - 16:00 Uhr
      Vorlesung
      DATMOD Gr. 1
      HG: 13
  + Event: 14:30 - 16:00 Uhr
      Vorlesung
      PROG III Gr. 2
      HG: 15
  + Event: 16:15 - 17:45 Uhr
      Vorlesung
      DATMOD Gr. 2
      HG: 13
  + Event: 16:15 - 17:45 Uhr
      Vorlesung
      PROG III Gr. 1
      HG: 15
  + Event: 18:00 - 19:30 Uhr
      Vorlesung
      DATMOD Gr. 2
      HG: 13
  + Event: 18:00 - 19:30 Uhr
      Vorlesung
      PROG III Gr. 1
      HG: 15
  Day: Di, 23.10.2012
  + Event: 9:00 - 10:30 Uhr
      Vorlesung
      DATMOD Gr. 2
      HG: 13
  + Event: 10:45 - 12:15 Uhr
      Vorlesung
      DATMOD Gr. 2
      HG: 13
  + Event: 12:45 - 14:15 Uhr
      Vorlesung
      DATMOD Gr. 1
      HG: 13
  + Event: 14:30 - 16:00 Uhr
      Vorlesung
      DATMOD Gr. 1
      HG: 13
  + Event: 16:15 - 17:45 Uhr
      Vorlesung
      PROG III Gr. 1
      HG: 15
  + Event: 18:00 - 19:30 Uhr
      Vorlesung
      PROG III Gr. 1
      HG: 15
  Day: Mi, 24.10.2012
  + Event: 9:00 - 10:30 Uhr
      Vorlesung
      DATMOD
      HG: 09
  + Event: 10:45 - 12:15 Uhr
      Vorlesung
      DATMOD
      HG: 09
  + Event: 12:45 - 14:15 Uhr
      Vorlesung
      IuF
      HG: 09
  + Event: 14:30 - 16:00 Uhr
      Vorlesung
      IuF
      HG: 09
  Day: Do, 25.10.2012
  + Event: 12:45 - 14:15 Uhr
      Vorlesung
      PROG III Gr. 2
      HG: 15
  + Event: 14:30 - 16:00 Uhr
      Vorlesung
      PROG III Gr. 2
      HG: 15
  + Event: 16:15 - 17:45 Uhr
      Vorlesung
      Linux Gr. 2
      HG: 15
  + Event: 18:00 - 19:30 Uhr
      Vorlesung
      Linux Gr. 2
      HG: 15
  Day: Fr, 26.10.2012
  Day: Sa, 27.10.2012
  + Event: 9:00 - 11:15 Uhr
      Klausur
      MP INT RW
      HG: 14

:-)

pseudoPHP

class ScheduleTableParser {
  buildTimetableFromTable() {
    //Trivial
    Parse Day Rowspans, Day Names, Dates into $this->days;
    e.g. $days[0]['rowspan'] = 4 for Monday 22/10/2012

    //Extract Lessons      
    $tr = 0;
    foreach tr {
      $td = 0;
      foreach td{
        if(td.class = 'v') {
          parseClass($td,$tr,$tdDOMObject);
        }
        $td++;
      }
    }
 }
 parseClass($td,$tr,$tdDOMObject) {
   //Trivial
   Get the Class Name Etc   

   //Calculate Time
   $time = 9:00 + 5mins*tr;
   $tr = $tr - 2;
   $i = 0;
   while($tr > 0) {
     $tr - $this->days[$i]['rowspan'];
     $day = $this->days[$i]['name'];
     $date = $this->days[$i]['date'];
     $i++;
   }
 }
}
发布评论

评论列表(0)

  1. 暂无评论