最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Puppeteer infinite scrolling - Stack Overflow

programmeradmin2浏览0评论

I have a website to scrape and what I need to scrape is inside a div that has an id left_container_scroll that contains multiple a tags. This div has the infinite scroll on it and I can't make it work. I am trying to make the program scroll in that div.

I have tried to do something like this, but I get an error: Evaluation failed: ReferenceError: elem is not defined

htmlTag = '#left_container_scroll';

//I think I am doing something wrong here
let elem = await page.evaluate((htmlTag)=> {
    return document.querySelector(htmlTag);
})

previousHeight =  await page.evaluate("elem.scrollHeight");
await page.evaluate("window.scrollTo(0,elem.scrollHeight)");
await page.waitForFunction(`elem.scrollHeight > ${previousHeight}`);

I have a website to scrape and what I need to scrape is inside a div that has an id left_container_scroll that contains multiple a tags. This div has the infinite scroll on it and I can't make it work. I am trying to make the program scroll in that div.

I have tried to do something like this, but I get an error: Evaluation failed: ReferenceError: elem is not defined

htmlTag = '#left_container_scroll';

//I think I am doing something wrong here
let elem = await page.evaluate((htmlTag)=> {
    return document.querySelector(htmlTag);
})

previousHeight =  await page.evaluate("elem.scrollHeight");
await page.evaluate("window.scrollTo(0,elem.scrollHeight)");
await page.waitForFunction(`elem.scrollHeight > ${previousHeight}`);
Share Improve this question asked Aug 26, 2019 at 11:34 alicealice 791 gold badge2 silver badges8 bronze badges
Add a ment  | 

3 Answers 3

Reset to default 4

Some of this JavaScript code runs inside the browser, some inside the Node.js runtime, and they can't see each other's variables.

For example, page.evaluate("elem.scrollheight") cannot see the elem variable you've set above, since the variable is inside the Node.js runtime, and the code elem.scrollheight is being ran inside the browser (similar issue also with htmlTag earlier).
To pass values from Node.js to the browser, you would usually give additional arguments to page.evaluate.

Something like this might work (haven't tested if the scrolling works as intended, but at least Puppeteer runs the code)

// returns a Puppeteer ElementHandle (not browser DOM element)
let elem = await page.$(htmlTag)
// passes the ElementHandle back to the browser code (Puppeteer converts it back to DOM element)
let previousHeight = await page.evaluate(e => e.scrollHeight, elem)
// again, pass ElementHandle
await page.evaluate(e => window.scrollTo(0, e.scrollHeight), elem)
// pass both ElementHandle and previousHeight to the browser side
await page.waitForFunction((e, ph) => e.scrollHeight > ph, {}, elem, previousHeight)   

Made a quite simple solution last time I was webscraping, hopefully it will help out!

let lastHeight = await page.evaluate('document.body.scrollHeight');

while (true) {
    await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
    await page.waitForTimeout(2000); // sleep a bit
    let newHeight = await page.evaluate('document.body.scrollHeight');
    if (newHeight === lastHeight) {
        break;
    }
    lastHeight = newHeight;
}

I would take in consideration the element you want to pull, I assume that using infinite scrolling you are looking to get more element. I would set a base counter of the element you want pull, then have a loop that checks if the previous element count is equal to the new element count, this way, you can break the loop then extract the data you want. In my case, I'd set another check for element_limit e.g. 100, regardless if the loop is done or not, it'll break the loop. You may also want to consider having random timeouts between 1-5secs, this will at least give your script time for the page to load, remember that not all pages are created equally, and the network connection is also a concern.

发布评论

评论列表(0)

  1. 暂无评论