最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - How can one scrape a web page with jQuery and XPath? - Stack Overflow

programmeradmin2浏览0评论

I can stick a jQuery javascript link in the header of a web page via Firebug. Then, I can run a script to scrape it and the pages it links to.

How do I begin writing this script in jQuery or javascript in general? Is there an interface in either jQuery/Javascript with which I can use XPath to access the elements on a page (and on the pages it links to)?

I can stick a jQuery javascript link in the header of a web page via Firebug. Then, I can run a script to scrape it and the pages it links to.

How do I begin writing this script in jQuery or javascript in general? Is there an interface in either jQuery/Javascript with which I can use XPath to access the elements on a page (and on the pages it links to)?

Share Improve this question edited Mar 9, 2012 at 7:58 dangerChihuahua007 asked Mar 8, 2012 at 15:32 dangerChihuahua007dangerChihuahua007 21k38 gold badges128 silver badges211 bronze badges
Add a ment  | 

3 Answers 3

Reset to default 5

First, you'll need a JavaScript runtime outside of the browser. The most mon is Node.js. Next you'll need a way to create the DOM client-side. This is typically done using jsdom.

So, your script should:

  1. download the html page (jsdom does this for you, but you can use request)
  2. create a client-side DOM
  3. parse using jQuery

Here is a sample Node.js script:

var jsdom = require("jsdom");

jsdom.env("http://nodejs/dist/", [
    'http://code.jquery./jquery-1.5.min.js'
  ], function(errors, window) {
  console.log("there have been", window.$("a").length, "nodejs releases!");
});

You would run it, like so:

$ node scrape.js

Don't forget to install jsdom first:

$ npm install --production jsdom

You can get the HTML of page quickly with:

var html = document.documentElement.innerHTML;

This will only return a string literal and it will not capture the root element.

You might be interested in pjscrape, a web-scraping library built for exactly this purpose (disclaimer: this is my project). It's based on PhantomJS, a headless Webkit implementation you can run from the mand line, and it has a really simple syntax for scraping data from multiple pages and finding additional urls to spider and scrape.

发布评论

评论列表(0)

  1. 暂无评论