最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Node js requests and cheerio wait for page to fully load - Stack Overflow

programmeradmin1浏览0评论

I'm trying to scrape images off a page but the page returns a placeholder source attr if that page isn't fully loaded, (takes about 0.5 seconds to fully load) how would I make request wait?

tried doing

function findCommonMovies(movie, callback){

    request(';q='+ movie +'&s=all', function (error, response, body) {
      if (error){
          return
      }else{
          var $ = cheerio.load(body);
          var title = $(".result_text").first().text().split("(")[0].split(" ").join('')
          var commonMovies = []
          // var endurl = $("a[name=tt] .result_text a").attr("href")
          var endurl = $('a[name=tt]').parent().parent().find(".findSection .findList .findResult .result_text a").attr("href");


          request('' + endurl, function (err, response, body) {

              if (err){
                  console.log(err)
              }else{

                  setInterval(function(){var $ = cheerio.load(body)}, 2000)

                  $(".rec_page .rec_item a img").each(function(){


                    var title = $(this).attr("title")
                    var image = $(this).attr("src")

                    commonMovies.push({title: title, image: image})
                  });
              }
              callback(commonMovies)
          });
      }
    });

}
findCommonMovies("Gotham", function(common){
  console.log(common)
})

I'm trying to scrape images off a page but the page returns a placeholder source attr if that page isn't fully loaded, (takes about 0.5 seconds to fully load) how would I make request wait?

tried doing

function findCommonMovies(movie, callback){

    request('http://www.imdb.com/find?ref_=nv_sr_fn&q='+ movie +'&s=all', function (error, response, body) {
      if (error){
          return
      }else{
          var $ = cheerio.load(body);
          var title = $(".result_text").first().text().split("(")[0].split(" ").join('')
          var commonMovies = []
          // var endurl = $("a[name=tt] .result_text a").attr("href")
          var endurl = $('a[name=tt]').parent().parent().find(".findSection .findList .findResult .result_text a").attr("href");


          request('http://www.imdb.com' + endurl, function (err, response, body) {

              if (err){
                  console.log(err)
              }else{

                  setInterval(function(){var $ = cheerio.load(body)}, 2000)

                  $(".rec_page .rec_item a img").each(function(){


                    var title = $(this).attr("title")
                    var image = $(this).attr("src")

                    commonMovies.push({title: title, image: image})
                  });
              }
              callback(commonMovies)
          });
      }
    });

}
findCommonMovies("Gotham", function(common){
  console.log(common)
})
Share Improve this question edited Oct 6, 2017 at 0:15 asked Oct 5, 2017 at 22:45 user8708917user8708917
Add a comment  | 

4 Answers 4

Reset to default 12

Cheerio is not a web browser. It's just a parser of HTML. Which means that the javascript functions which make the async requests are not being executed.

So. You can't do what you want unless you use something that acts as a web browser. Selenium for example adds an API to a lot of web browsers.

You need to download Selenium client and keep running it as long as you want to keep making requests to sites with async content loading.

Also, you are going to need a wrapper based on the language you are using and the webdriver you want. The webdriver is used to add support for different web browsers.

I assume you are using NodeJS or something similar based on js so, here you go.

And be sure to check the API.

Hope to be of some help.

You could also check PhantomJS.

you can set timeout:

var options = {
    url : 'http://www.imdb.com/find?ref_=nv_sr_fn&q='+ movie +'&s=all',
    timeout: 10000 //set waiting time till 10 minutes.
  }
  request(options, function(err, response, body){
    if (err) {
      console.log(err);
    }
   //do what you want here
}

setTimeout(function, millseconds to wait) will pause for how many seconds you want. setTimeout(function(){var $ = cheerio.load(body)}, 2000)

It appears to me like your callback is located in the wrong place and there should be no need for any timer. When request() calls its callback, the whole response is ready so no need for a timer.

Here's the code with the callback in the right place and also changed so that it has an error argument so the caller can propagate and detect errors:

function findCommonMovies(movie, callback){
    request('http://www.imdb.com/find?ref_=nv_sr_fn&q='+ movie +'&s=all', function (error, response, body) {
      if (error) {
          callback(error);
          return;
      } else {
          var $ = cheerio.load(body);
          var title = $(".result_text").first().text().split("(")[0].split(" ").join('')
          var commonMovies = [];
          // var endurl = $("a[name=tt] .result_text a").attr("href")
          var endurl = $('a[name=tt]').parent().parent().find(".findSection .findList .findResult .result_text a").attr("href");
          request('http://www.imdb.com' + endurl, function (err, response, body) {
              if (err) {
                  console.log(err)
                  callback(err); 
              } else {
                  var $ = cheerio.load(body);
                  $(".rec_page .rec_item a img").each(function(){
                    var title = $(this).attr("title");
                    var image = $(this).attr("src");
                    commonMovies.push({title, image});
                  });
                  callback(null, commonMovies);
              }
          });
       }
    });
}

findCommonMovies("Gotham", function(err, common) {
  if (err) {
     console.log(err);
  } else {
     console.log(common)
  }
});

Note: This will access ONLY the HTML markup served by the server for the URLs you request. If those pages have content that is inserted by browser Javascript, that content will not be present in what you get here and no delay will make it appear. That's because cheerio does not run browser Javascript, it JUST parses the HTML that the server originally sends. To run browser Javascript, you need a more complete browser engine than cheerio provides such as PhantomJS that will actually run the page's Javascript.

发布评论

评论列表(0)

  1. 暂无评论