I'm trying to parse the class schedule provided by my university in order to import the information into some kind of calendar.
An example of the schedule can be seen here:
/WIA13-7.%20Block.html
The auto-generated HTML-content is, in my opinion, a mess and very hard to grasp. E.g. the tables are mainly built with rowspans and colspans (the positions of cells in the code pared with their actual visual position in the browser seem partially arbitrary).
What I've already tried:
- Asking the university's administration office to provide a simpler, easier to read file separately. Of course this wasn't possible, after all it would mean one minute of additional effort.
- Researching the original tool used to generate the HTML. It is called "sked Stundenplan Software". I couldn't find any hints or tools to "reverse" the generation process.
- Looking for an existing solution, at which point I found some tools (e.g. /) that do not work for my schedule. After studying the codes of these tools I concluded that they must have been designed for an other/outdated version of sked.
- Parsing the HTML with PHP (mostly using DOMDocument). That worked sometimes, but was way too unreliable...The exceptions to take into account seem indefinite.
Right now I don't think that conventional HTML parsing will get me far, at least not in an acceptable developing time. What I am looking for are other methods to fetch information from plex HTML tables, something like YQL, or maybe utilities that can normalize such tables with col-/rowspans. Because I don't have anything concrete in mind, I am mainly asking for some tips or hints for another approach.
Are there other, more suitable methods to parse such tables or am I stuck with conventional HTML parsing?
Edit:
On behalf of a request, I'll paste an example of raw code...
This week:
Results from this code:
Edit 2:
Because of some parsing discussions I'll also add my PHP code. It's my first time with PHP so it's not very sophisticated. It should rather give an insight on how far I've e with parsing the tables in theory. The actual work happens in the function parseSkedTable(), please concentrate on this one. Also, I would like to point out the term "double courses" appearing in the ments, which describes two different courses happening at the same time (the class would be split in such moments). An example of these courses can be found here in week two:
/WIB14-4.%20Block.html
It looks like this:
The corresponding HTML-code of that week can also be accessed here:
And now the PHP-code (I had a hard time translating the ments since I already struggled expressing them in my first language...I hope they may still be helpful):
Update
So far, there have been some solutions to my parsing problem, each of them using JavaScript. Since JavaScript (being especially powerful here because of the ability to use browser-rendered data) seems to be the only efficient way to retrieve reliable information from the HTML, I am now looking for a way to implement some kind of headless browser or rendering engine on my free server at x10hosting. Sadly, I am neither able to install software other than provided by softaculous nor allowed to use PHP's exec() mand.
Any idea would be appreciated!
For the sake of pleteness, I'll post both solutions, existing until now:
jQuery parser by Pierre Dubois:
(function ($) { $(document).ready(function() {
var _pe = window.pe || { fn : {} }; var tblNumber = 0; // Just a incremental number to identify the schedule item with the table // For each table $('table').each(function () { $('#output').append('Parsing the table number: ' + tblNumber + '<br>'); // console.log('Parsing the table number: ' + tblNumber); tblNumber += 1; var currentTable = this; // Parser the plex table _pe.fn.parsertable.parse($(currentTable)); // Retrieve the parsed data var parsedData = $(currentTable).data().tblparser; // // Information about the column structure, nice that is consistent // // Day: Cell index position (0 based) // Mo: 3 // Di: 7 // Mi: 11 // Do: 15 // Fr: 19 // Sa: 23 // Title Location at Row index position "0" // "i" represent the middle column position for (var i = 3; i < 24; i += 4) { var currentDay; // Get the day currentDay = $(parsedData.row[0].cell[i].elem).text(); $('#output').append(' Day: ' + currentDay + '<br>'); // console.log('Day: ' + currentDay); // Get all the events for that day, excluding the first row and the last row for (var j = 1; j < parsedData.col[i].cell.length - 2; j += 1) { // First column if (parsedData.col[i - 1].cell[j - 1].uid !== parsedData.col[i - 1].cell[j].uid ) { // Get the content of that cell and remove ending space var event = $(parsedData.col[i - 1].cell[j].elem).text().trim(); if (event.length > 0) { $('#output').append(' + Event: ' + event + '<br>'); // console.log('Event: ' + event); } } // Second Column if (parsedData.col[i].cell[j - 1].uid !== parsedData.col[i].cell[j].uid && parsedData.col[i - 1].cell[j].uid !== parsedData.col[i].cell[j].uid) { // Get the content of that cell and remove ending space var event = $(parsedData.col[i].cell[j].elem).text().trim(); if (event.length > 0) { $('#output').append(' + Event: ' + event + '<br>'); // console.log('Event: ' + event); } } // Third Column if (parsedData.col[i + 1].cell[j - 1].uid !== parsedData.col[i + 1].cell[j].uid && parsedData.col[i].cell[j].uid !== parsedData.col[i + 1].cell[j].uid) { // Get the content of that cell and remove ending space var event = $(parsedData.col[i + 1].cell[j].elem).text().trim(); if (event.length > 0) { $('#output').append(' + Event: ' + event + '<br>'); // console.log('Event: ' + event); } } } } }); });
}(jQuery));
JS parser using positional information by me, realizing rambo coder's idea
I'm trying to parse the class schedule provided by my university in order to import the information into some kind of calendar.
An example of the schedule can be seen here:
http://www.asw-berufsakademie.de/fileadmin/download/download/Sked%20Stundenplan/WIA13-7.%20Block.html
The auto-generated HTML-content is, in my opinion, a mess and very hard to grasp. E.g. the tables are mainly built with rowspans and colspans (the positions of cells in the code pared with their actual visual position in the browser seem partially arbitrary).
What I've already tried:
- Asking the university's administration office to provide a simpler, easier to read file separately. Of course this wasn't possible, after all it would mean one minute of additional effort.
- Researching the original tool used to generate the HTML. It is called "sked Stundenplan Software". I couldn't find any hints or tools to "reverse" the generation process.
- Looking for an existing solution, at which point I found some tools (e.g. http://code.google./p/skd-schedule-parser/) that do not work for my schedule. After studying the codes of these tools I concluded that they must have been designed for an other/outdated version of sked.
- Parsing the HTML with PHP (mostly using DOMDocument). That worked sometimes, but was way too unreliable...The exceptions to take into account seem indefinite.
Right now I don't think that conventional HTML parsing will get me far, at least not in an acceptable developing time. What I am looking for are other methods to fetch information from plex HTML tables, something like YQL, or maybe utilities that can normalize such tables with col-/rowspans. Because I don't have anything concrete in mind, I am mainly asking for some tips or hints for another approach.
Are there other, more suitable methods to parse such tables or am I stuck with conventional HTML parsing?
Edit:
On behalf of a request, I'll paste an example of raw code...
This week:
Results from this code:
http://pastebin./BJduUVtU
Edit 2:
Because of some parsing discussions I'll also add my PHP code. It's my first time with PHP so it's not very sophisticated. It should rather give an insight on how far I've e with parsing the tables in theory. The actual work happens in the function parseSkedTable(), please concentrate on this one. Also, I would like to point out the term "double courses" appearing in the ments, which describes two different courses happening at the same time (the class would be split in such moments). An example of these courses can be found here in week two:
http://www.asw-berufsakademie.de/fileadmin/download/download/Sked%20Stundenplan/WIB14-4.%20Block.html
It looks like this:
The corresponding HTML-code of that week can also be accessed here:
http://pastebin./gLTWz5KU
And now the PHP-code (I had a hard time translating the ments since I already struggled expressing them in my first language...I hope they may still be helpful):
http://pastebin./Nzi8m2v8
Update
So far, there have been some solutions to my parsing problem, each of them using JavaScript. Since JavaScript (being especially powerful here because of the ability to use browser-rendered data) seems to be the only efficient way to retrieve reliable information from the HTML, I am now looking for a way to implement some kind of headless browser or rendering engine on my free server at x10hosting.. Sadly, I am neither able to install software other than provided by softaculous nor allowed to use PHP's exec() mand.
Any idea would be appreciated!
For the sake of pleteness, I'll post both solutions, existing until now:
jQuery parser by Pierre Dubois:
(function ($) { $(document).ready(function() {
var _pe = window.pe || { fn : {} }; var tblNumber = 0; // Just a incremental number to identify the schedule item with the table // For each table $('table').each(function () { $('#output').append('Parsing the table number: ' + tblNumber + '<br>'); // console.log('Parsing the table number: ' + tblNumber); tblNumber += 1; var currentTable = this; // Parser the plex table _pe.fn.parsertable.parse($(currentTable)); // Retrieve the parsed data var parsedData = $(currentTable).data().tblparser; // // Information about the column structure, nice that is consistent // // Day: Cell index position (0 based) // Mo: 3 // Di: 7 // Mi: 11 // Do: 15 // Fr: 19 // Sa: 23 // Title Location at Row index position "0" // "i" represent the middle column position for (var i = 3; i < 24; i += 4) { var currentDay; // Get the day currentDay = $(parsedData.row[0].cell[i].elem).text(); $('#output').append(' Day: ' + currentDay + '<br>'); // console.log('Day: ' + currentDay); // Get all the events for that day, excluding the first row and the last row for (var j = 1; j < parsedData.col[i].cell.length - 2; j += 1) { // First column if (parsedData.col[i - 1].cell[j - 1].uid !== parsedData.col[i - 1].cell[j].uid ) { // Get the content of that cell and remove ending space var event = $(parsedData.col[i - 1].cell[j].elem).text().trim(); if (event.length > 0) { $('#output').append(' + Event: ' + event + '<br>'); // console.log('Event: ' + event); } } // Second Column if (parsedData.col[i].cell[j - 1].uid !== parsedData.col[i].cell[j].uid && parsedData.col[i - 1].cell[j].uid !== parsedData.col[i].cell[j].uid) { // Get the content of that cell and remove ending space var event = $(parsedData.col[i].cell[j].elem).text().trim(); if (event.length > 0) { $('#output').append(' + Event: ' + event + '<br>'); // console.log('Event: ' + event); } } // Third Column if (parsedData.col[i + 1].cell[j - 1].uid !== parsedData.col[i + 1].cell[j].uid && parsedData.col[i].cell[j].uid !== parsedData.col[i + 1].cell[j].uid) { // Get the content of that cell and remove ending space var event = $(parsedData.col[i + 1].cell[j].elem).text().trim(); if (event.length > 0) { $('#output').append(' + Event: ' + event + '<br>'); // console.log('Event: ' + event); } } } } }); });
}(jQuery));
JS parser using positional information by me, realizing rambo coder's idea
- 1 Oh brother. I'm not sure whether a tool exists to parse that kind of a mess. ("Mess" from a HTML point of view. Visually, it looks pretty neat!) – Pekka Commented Oct 20, 2012 at 13:28
-
Side note: on Stack Overflow, we aim for self-contained questions that are valid even when external sites go down. Could you post an example of the kind of HTML code that you need to parse in here? Use the
{}
button to format it as code. Thanks! – Pekka Commented Oct 20, 2012 at 13:32 - Why do you care about the rowspans when the time is in the cells? Why don't you just create a hash/array of dates pulled from the header and then only worry about the columns and parse the text for the time? Looks pretty simple using any of the standard XML parsing methods and basic regex. – Matt Whipple Commented Oct 20, 2012 at 13:32
-
@Matt look at the table. It's a total, utter catastrophe in terms of programmatically interpreting anything. I bet a 3-hour course would be marked by a cell that has an increased
rowspan
... good luck parsing that. – Pekka Commented Oct 20, 2012 at 13:33 - 2 It looks like the <td> elements that you're interested in all have "class='v'", so you could probably just parse those, and ignore everything else. – GreyBeardedGeek Commented Oct 20, 2012 at 13:44
6 Answers
Reset to default 1You could make use of a browsers rendering/layout engine here.
Use http://phantomjs/ to get access to a headless browser that lets you execute javascript on a webpage's dom.
A dash of jquery would make the remaining pseudocode easy to implement:
foreach (td.t as dateElement) {
//parse date from element text
//use pixel position + dimensions to calc pixel coord of center
// save this center in a list along with the date
}
foreach (td.v as calendarEntryElement) {
//parse time + other stuff from element text
//use pixel position to find the closest date element in that list(it must be the closest one above)
}
I feel positional information would be very reliable here, because everything is a nested rectangle and its all done via tables.
You don't need to use phantomjs, you could just as easily execute a browser manually, and let it send a request to a local server to collect the results.
Some shell mand roughly like
firefox file://foo123.html
Where you've appended some custom <script>
to the end of one of their webpages and saved it.
I study at the same university and a few weeks ago I faced the same problem to parse this time table and convert it to an ICS file. Finally I found my own solution and generalized the code, so that students from other universities, using the Sked software and have a much more plex time table, can import their time table too.
I also created a website, where students can sign up and configure the urls to the time tables which they want to subscribe. In the background runs a cronjob which ensures, that the subscribed calendars are always up to date.
You can find the result of the project on my website:
http://calendar.pineappledeveloper./
(it is only in German available).
Many developers use HTML Agility Pack to parse HTML:
http://htmlagilitypack.codeplex./
I have some hope for your problem since the html is well formed and is x-html pliant.
The following Java program parse it successfully but without extracting information.
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
public final class AgendaParser {
public static void main( String[] args ) throws Throwable {
long atStart = System.currentTimeMillis();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse( "WIA13-7. Block.html" );
System.err.println( "Elapsed time: " + ( System.currentTimeMillis() - atStart ) / 1000L + " seconds" );
System.err.println( doc.getElementsByTagName( "table" ));
}
}
it outputs:
Elapsed time: 118 seconds (very long, no?)
.sun.apache.xerces.internal.dom.DeepNodeListImpl@7faea002
I thing a couple of hours of work gives you 80% of the extraction.
Interresting plex table. A mix between a layout table and a data table.
This may help you, that solution is in jQuery and use the same plex table parser available in the Web Experience Toolkit Project. With that table parser, you will be able to retreive your schedule data. The only thing that need to be done would be to parse the schedule item content for the importation to your calendar apps.
This solution use the column instead of the row to retrieve the schedule items
Working example : http://jsfiddle/3t2A8/2/
Here the javascript code that is used to extract and show the schedule items
(function ($) {
$(document).ready(function() {
var _pe = window.pe || {
fn : {}
};
var tblNumber = 0; // Just a incremental number to identify the schedule item with the table
// For each table
$('table').each(function () {
$('#output').append('Parsing the table number: ' + tblNumber + '<br>');
// console.log('Parsing the table number: ' + tblNumber);
tblNumber += 1;
var currentTable = this;
// Parser the plex table
_pe.fn.parsertable.parse($(currentTable));
// Retrieve the parsed data
var parsedData = $(currentTable).data().tblparser;
//
// Information about the column structure, nice that is consistent
//
// Day: Cell index position (0 based)
// Mo: 3
// Di: 7
// Mi: 11
// Do: 15
// Fr: 19
// Sa: 23
// Title Location at Row index position "0"
// "i" represent the middle column position
for (var i = 3; i < 24; i += 4) {
var currentDay;
// Get the day
currentDay = $(parsedData.row[0].cell[i].elem).text();
$('#output').append(' Day: ' + currentDay + '<br>');
// console.log('Day: ' + currentDay);
// Get all the events for that day, excluding the first row and the last row
for (var j = 1; j < parsedData.col[i].cell.length - 2; j += 1) {
// First column
if (parsedData.col[i - 1].cell[j - 1].uid !== parsedData.col[i - 1].cell[j].uid ) {
// Get the content of that cell and remove ending space
var event = $(parsedData.col[i - 1].cell[j].elem).text().trim();
if (event.length > 0) {
$('#output').append(' + Event: ' + event + '<br>');
// console.log('Event: ' + event);
}
}
// Second Column
if (parsedData.col[i].cell[j - 1].uid !== parsedData.col[i].cell[j].uid &&
parsedData.col[i - 1].cell[j].uid !== parsedData.col[i].cell[j].uid) {
// Get the content of that cell and remove ending space
var event = $(parsedData.col[i].cell[j].elem).text().trim();
if (event.length > 0) {
$('#output').append(' + Event: ' + event + '<br>');
// console.log('Event: ' + event);
}
}
// Third Column
if (parsedData.col[i + 1].cell[j - 1].uid !== parsedData.col[i + 1].cell[j].uid &&
parsedData.col[i].cell[j].uid !== parsedData.col[i + 1].cell[j].uid) {
// Get the content of that cell and remove ending space
var event = $(parsedData.col[i + 1].cell[j].elem).text().trim();
if (event.length > 0) {
$('#output').append(' + Event: ' + event + '<br>');
// console.log('Event: ' + event);
}
}
}
}
});
});
}(jQuery));
it would output for the "double courses" table
Parsing the table number: 0
Day: Mo, 22.10.2012
+ Event: 12:45 - 14:15 Uhr
Vorlesung
DATMOD Gr. 1
HG: 13
+ Event: 12:45 - 14:15 Uhr
Vorlesung
PROG III Gr. 2
HG: 15
+ Event: 14:30 - 16:00 Uhr
Vorlesung
DATMOD Gr. 1
HG: 13
+ Event: 14:30 - 16:00 Uhr
Vorlesung
PROG III Gr. 2
HG: 15
+ Event: 16:15 - 17:45 Uhr
Vorlesung
DATMOD Gr. 2
HG: 13
+ Event: 16:15 - 17:45 Uhr
Vorlesung
PROG III Gr. 1
HG: 15
+ Event: 18:00 - 19:30 Uhr
Vorlesung
DATMOD Gr. 2
HG: 13
+ Event: 18:00 - 19:30 Uhr
Vorlesung
PROG III Gr. 1
HG: 15
Day: Di, 23.10.2012
+ Event: 9:00 - 10:30 Uhr
Vorlesung
DATMOD Gr. 2
HG: 13
+ Event: 10:45 - 12:15 Uhr
Vorlesung
DATMOD Gr. 2
HG: 13
+ Event: 12:45 - 14:15 Uhr
Vorlesung
DATMOD Gr. 1
HG: 13
+ Event: 14:30 - 16:00 Uhr
Vorlesung
DATMOD Gr. 1
HG: 13
+ Event: 16:15 - 17:45 Uhr
Vorlesung
PROG III Gr. 1
HG: 15
+ Event: 18:00 - 19:30 Uhr
Vorlesung
PROG III Gr. 1
HG: 15
Day: Mi, 24.10.2012
+ Event: 9:00 - 10:30 Uhr
Vorlesung
DATMOD
HG: 09
+ Event: 10:45 - 12:15 Uhr
Vorlesung
DATMOD
HG: 09
+ Event: 12:45 - 14:15 Uhr
Vorlesung
IuF
HG: 09
+ Event: 14:30 - 16:00 Uhr
Vorlesung
IuF
HG: 09
Day: Do, 25.10.2012
+ Event: 12:45 - 14:15 Uhr
Vorlesung
PROG III Gr. 2
HG: 15
+ Event: 14:30 - 16:00 Uhr
Vorlesung
PROG III Gr. 2
HG: 15
+ Event: 16:15 - 17:45 Uhr
Vorlesung
Linux Gr. 2
HG: 15
+ Event: 18:00 - 19:30 Uhr
Vorlesung
Linux Gr. 2
HG: 15
Day: Fr, 26.10.2012
Day: Sa, 27.10.2012
+ Event: 9:00 - 11:15 Uhr
Klausur
MP INT RW
HG: 14
:-)
pseudoPHP
class ScheduleTableParser {
buildTimetableFromTable() {
//Trivial
Parse Day Rowspans, Day Names, Dates into $this->days;
e.g. $days[0]['rowspan'] = 4 for Monday 22/10/2012
//Extract Lessons
$tr = 0;
foreach tr {
$td = 0;
foreach td{
if(td.class = 'v') {
parseClass($td,$tr,$tdDOMObject);
}
$td++;
}
}
}
parseClass($td,$tr,$tdDOMObject) {
//Trivial
Get the Class Name Etc
//Calculate Time
$time = 9:00 + 5mins*tr;
$tr = $tr - 2;
$i = 0;
while($tr > 0) {
$tr - $this->days[$i]['rowspan'];
$day = $this->days[$i]['name'];
$date = $this->days[$i]['date'];
$i++;
}
}
}