I have a website to scrape and what I need to scrape is inside a div that has an id left_container_scroll that contains multiple a tags. This div has the infinite scroll on it and I can't make it work. I am trying to make the program scroll in that div.
I have tried to do something like this, but I get an error: Evaluation failed: ReferenceError: elem is not defined
htmlTag = '#left_container_scroll';
//I think I am doing something wrong here
let elem = await page.evaluate((htmlTag)=> {
return document.querySelector(htmlTag);
})
previousHeight = await page.evaluate("elem.scrollHeight");
await page.evaluate("window.scrollTo(0,elem.scrollHeight)");
await page.waitForFunction(`elem.scrollHeight > ${previousHeight}`);
I have a website to scrape and what I need to scrape is inside a div that has an id left_container_scroll that contains multiple a tags. This div has the infinite scroll on it and I can't make it work. I am trying to make the program scroll in that div.
I have tried to do something like this, but I get an error: Evaluation failed: ReferenceError: elem is not defined
htmlTag = '#left_container_scroll';
//I think I am doing something wrong here
let elem = await page.evaluate((htmlTag)=> {
return document.querySelector(htmlTag);
})
previousHeight = await page.evaluate("elem.scrollHeight");
await page.evaluate("window.scrollTo(0,elem.scrollHeight)");
await page.waitForFunction(`elem.scrollHeight > ${previousHeight}`);
Share
Improve this question
asked Aug 26, 2019 at 11:34
alicealice
791 gold badge2 silver badges8 bronze badges
3 Answers
Reset to default 4Some of this JavaScript code runs inside the browser, some inside the Node.js runtime, and they can't see each other's variables.
For example, page.evaluate("elem.scrollheight")
cannot see the elem
variable you've set above, since the variable is inside the Node.js runtime, and the code elem.scrollheight
is being ran inside the browser (similar issue also with htmlTag
earlier).
To pass values from Node.js to the browser, you would usually give additional arguments to page.evaluate
.
Something like this might work (haven't tested if the scrolling works as intended, but at least Puppeteer runs the code)
// returns a Puppeteer ElementHandle (not browser DOM element)
let elem = await page.$(htmlTag)
// passes the ElementHandle back to the browser code (Puppeteer converts it back to DOM element)
let previousHeight = await page.evaluate(e => e.scrollHeight, elem)
// again, pass ElementHandle
await page.evaluate(e => window.scrollTo(0, e.scrollHeight), elem)
// pass both ElementHandle and previousHeight to the browser side
await page.waitForFunction((e, ph) => e.scrollHeight > ph, {}, elem, previousHeight)
Made a quite simple solution last time I was webscraping, hopefully it will help out!
let lastHeight = await page.evaluate('document.body.scrollHeight');
while (true) {
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
await page.waitForTimeout(2000); // sleep a bit
let newHeight = await page.evaluate('document.body.scrollHeight');
if (newHeight === lastHeight) {
break;
}
lastHeight = newHeight;
}
I would take in consideration the element you want to pull, I assume that using infinite scrolling you are looking to get more element. I would set a base counter of the element you want pull, then have a loop that checks if the previous element count is equal to the new element count, this way, you can break the loop then extract the data you want. In my case, I'd set another check for element_limit e.g. 100, regardless if the loop is done or not, it'll break the loop. You may also want to consider having random timeouts between 1-5secs, this will at least give your script time for the page to load, remember that not all pages are created equally, and the network connection is also a concern.