I can stick a jQuery javascript link in the header of a web page via Firebug. Then, I can run a script to scrape it and the pages it links to.
How do I begin writing this script in jQuery or javascript in general? Is there an interface in either jQuery/Javascript with which I can use XPath to access the elements on a page (and on the pages it links to)?
I can stick a jQuery javascript link in the header of a web page via Firebug. Then, I can run a script to scrape it and the pages it links to.
How do I begin writing this script in jQuery or javascript in general? Is there an interface in either jQuery/Javascript with which I can use XPath to access the elements on a page (and on the pages it links to)?
Share Improve this question edited Mar 9, 2012 at 7:58 dangerChihuahua007 asked Mar 8, 2012 at 15:32 dangerChihuahua007dangerChihuahua007 21k38 gold badges128 silver badges211 bronze badges3 Answers
Reset to default 5First, you'll need a JavaScript runtime outside of the browser. The most mon is Node.js. Next you'll need a way to create the DOM client-side. This is typically done using jsdom.
So, your script should:
- download the html page (
jsdom
does this for you, but you can use request) - create a client-side DOM
- parse using jQuery
Here is a sample Node.js script:
var jsdom = require("jsdom");
jsdom.env("http://nodejs/dist/", [
'http://code.jquery./jquery-1.5.min.js'
], function(errors, window) {
console.log("there have been", window.$("a").length, "nodejs releases!");
});
You would run it, like so:
$ node scrape.js
Don't forget to install jsdom
first:
$ npm install --production jsdom
You can get the HTML of page quickly with:
var html = document.documentElement.innerHTML;
This will only return a string literal and it will not capture the root element.
You might be interested in pjscrape, a web-scraping library built for exactly this purpose (disclaimer: this is my project). It's based on PhantomJS, a headless Webkit implementation you can run from the mand line, and it has a really simple syntax for scraping data from multiple pages and finding additional urls to spider and scrape.