I am trying to monitor day-to-day prices from an online catalogue. The site uses HTTPS and generates the catalogue pages with javascript. How can i interface with the site and make it generate the pages I need?
I have done this with other sites where the HTML can easily be accessed, I have no problem parseing the HTML once generated.
I only know Python and Java.
Thanks in advance.
I am trying to monitor day-to-day prices from an online catalogue. The site uses HTTPS and generates the catalogue pages with javascript. How can i interface with the site and make it generate the pages I need?
I have done this with other sites where the HTML can easily be accessed, I have no problem parseing the HTML once generated.
I only know Python and Java.
Thanks in advance.
Share Improve this question asked Apr 6, 2011 at 5:41 jsjjsj 9,39118 gold badges60 silver badges107 bronze badges3 Answers
Reset to default 11Take a look at HTMLUnit - a headless Java browser that can be fully controlled by your code. A simple example can be seen here: http://htmlunit.sourceforge.net/gettingStarted.html
(obligatory warning: by screen-scraping the site, you may be breaking its ToS, and possibly open yourself to lawsuits; check whether you are allowed to do it before you start)
If they've created a Web API that their JavaScript interfaces with, you might be able to scrape that directly, rather than trying to go the HTML route.
If they've obfuscated it or that option isn't available for some other reason, you'll basically need a Web browser to evaluate the JavaScript and then scrap the browser's DOM. Perhaps write a browser plugin?
I use webkit through it's python bindings for scraping javascript content. See here for example.