最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Problems with web site scraping using zombie.js - Stack Overflow

programmeradmin3浏览0评论

I need to do some web scraping. After playing around with different web testing framework, of which most where either too slow (Selenium) or too buggy for my needs (env.js), I decided that zombie.js looks most promising, as it uses a solid set of libraries for HTML parsing and DOM manipulation. However, it seems to me like it doesn't even support basic event-based Javascript code like in the following web page:

<html>
  <head>
    <title>test</title>
    <script type="text/javascript">

      console.log("test script executing...");
      console.log("registering callback for event DOMContentLoaded on " + document);

      document.addEventListener('DOMContentLoaded', function(){
        console.log("DOMContentLoaded triggered");
      }, false);

      function loaded() {
        console.log("onload triggered");
      }

    </script>
  </head>

  <body onload="loaded();">
    <h1>Test</h1>
  </body>
</html>

I then decided to trigger those events manually like this:

zombie = require("zombie");

zombie.visit("http://localhost:4567/", { debug: true }, function (err, browser, status) {

  doc = browser.document;
  console.log("firing DOMContentLoaded on " + doc);
  browser.fire("DOMContentLoaded", doc, function (err, browser, status) {

    body = browser.querySelector("body");
    console.log("firing load on " + body);
    browser.fire("load", body, function (err, browser, status) {

      console.log(browser.html());

    });
  });

});

Which works for this particular test page. My problem is a more general one, though: I want to be able to scrape more plex, AJAX-based sites like a friends list on Facebook (something like .php?id=100000028174850&sk=friends&v=friends). It is no problem to log into the site using zombie, but some content like those lists seem to be pletely loaded dynamically using AJAX, and I don't know how to trigger the event handlers that initiate the loading.

There are several questions I have regarding this problem:

  • Has somebody already implemented a similarly plex scraper without using a browser remote-controlling solution like Selenium?
  • Is there some reference on the loading process of a plex Javascript-based page?
  • Can somebody provide advice on how to debug a real browser to see what I might need to execute to trigger the Facebook event handlers?
  • Any other ideas about this topic?

Again, please do not point me to solutions involving controlling a real browser like Selenium, as I know about those. What is however wele are suggestions for a real in-memory renderer like WebKit accessible from the Ruby scripting language, but preferrably with the possibility to set cookies and preferrably also load raw HTML instead of triggering real HTTP requests.

I need to do some web scraping. After playing around with different web testing framework, of which most where either too slow (Selenium) or too buggy for my needs (env.js), I decided that zombie.js looks most promising, as it uses a solid set of libraries for HTML parsing and DOM manipulation. However, it seems to me like it doesn't even support basic event-based Javascript code like in the following web page:

<html>
  <head>
    <title>test</title>
    <script type="text/javascript">

      console.log("test script executing...");
      console.log("registering callback for event DOMContentLoaded on " + document);

      document.addEventListener('DOMContentLoaded', function(){
        console.log("DOMContentLoaded triggered");
      }, false);

      function loaded() {
        console.log("onload triggered");
      }

    </script>
  </head>

  <body onload="loaded();">
    <h1>Test</h1>
  </body>
</html>

I then decided to trigger those events manually like this:

zombie = require("zombie");

zombie.visit("http://localhost:4567/", { debug: true }, function (err, browser, status) {

  doc = browser.document;
  console.log("firing DOMContentLoaded on " + doc);
  browser.fire("DOMContentLoaded", doc, function (err, browser, status) {

    body = browser.querySelector("body");
    console.log("firing load on " + body);
    browser.fire("load", body, function (err, browser, status) {

      console.log(browser.html());

    });
  });

});

Which works for this particular test page. My problem is a more general one, though: I want to be able to scrape more plex, AJAX-based sites like a friends list on Facebook (something like http://www.facebook./profile.php?id=100000028174850&sk=friends&v=friends). It is no problem to log into the site using zombie, but some content like those lists seem to be pletely loaded dynamically using AJAX, and I don't know how to trigger the event handlers that initiate the loading.

There are several questions I have regarding this problem:

  • Has somebody already implemented a similarly plex scraper without using a browser remote-controlling solution like Selenium?
  • Is there some reference on the loading process of a plex Javascript-based page?
  • Can somebody provide advice on how to debug a real browser to see what I might need to execute to trigger the Facebook event handlers?
  • Any other ideas about this topic?

Again, please do not point me to solutions involving controlling a real browser like Selenium, as I know about those. What is however wele are suggestions for a real in-memory renderer like WebKit accessible from the Ruby scripting language, but preferrably with the possibility to set cookies and preferrably also load raw HTML instead of triggering real HTTP requests.

Share Improve this question edited Sep 7, 2011 at 15:56 Niklas B. asked Sep 7, 2011 at 15:50 Niklas B.Niklas B. 95.3k18 gold badges199 silver badges226 bronze badges 2
  • Are you looking for a javascript test framework, or a web data-extraction tool? If you're just looking for a scree-scraping tool, it's possible to scrape most sites without executing their Javascript, even AJAX-heavy ones. – jches Commented Sep 7, 2011 at 16:34
  • 1 The question is about web scraping. You are right, often is indeed possible to do this without executing Js, e.g. by issuing REST requests manually. In the case of Facebook, scraping the mobile version of the site is quite possible using only HTTP and HTML parsing. But I am interested in a generic solution that understands Javascript and does not require a real browser instance. This seems to be possible, as env.Js and zombie.Js show, but it seems to be a tricks problem. – Niklas B. Commented Sep 7, 2011 at 17:02
Add a ment  | 

1 Answer 1

Reset to default 13

For purposes of data extraction, running a "headless browser" and triggering javascript events manually is not going to be the easiest thing to do. While not impossible, there are simpler ways to do it.

Most sites, even AJAX-heavy ones, can be scraped without executing a single line of their Javascript code. In fact it's usually easier than trying to figure out a site's Javascript code, which is often obfuscated, minified, and difficult to debug. If you have a solid understanding of HTTP you will understand why: (almost) all interactions with the server are encoded as HTTP requests, so whether they are initiated by Javascript, or the user clicking a link, or custom code in a bot program, there's no difference to the server. (I say almost because when Flash or applets get involved there's no telling what data is flying where; they can be application-specific. But anything done in Javascript will go over HTTP.)

That being said, it is possible to mimic a user on any website using custom software. First you have to be able to see the raw HTTP requests being sent to the server. You can use a proxy server to record requests made by a real browser to the target website. There are many, many tools you can use for this: Charles or Fiddler are handy, most dedicated screen-scraper tools have a basic proxy built-in, The Firebug extension for Firefox and Chrome have similar tools for viewing AJAX requests...you get the idea.

Once you can see the HTTP requests that are made as a result of a particular action on the website, it is easy to write a program to mimic these requests; just send the same requests to the server and it will treat your program just like a browser in which a particular action has been performed.

There are differing libraries for different languages offering different capabilities. For ruby, I have seen a lot of people using mechanize for ruby.

If data extraction is your only goal, then you'll almost always be able to get what you need by mimicking HTTP requests this way. No Javascript required.

Note - Since you mentioned Facebook, I should mention that scraping Facebook specifically can be exceptionally difficult (although not impossible), because Facebook has measures in place to detect automated access (they use more than just captchas); they will disable an account if they see suspicious activity ing from it. It is, after all, against their terms of service (section 3.2).

发布评论

评论列表(0)

  1. 暂无评论