I want to use Puppeteer to respond to page updates. The page shows items and when I leave the page open new items can appear over time. E.g. every 10 seconds a new item is added.
I can use the following to wait for an item on the initial load of the page:
await page.waitFor(".item");
console.log("the initial items have been loaded")
How can I wait for / catch future items? I would like to achieve something like this (pseudo code):
await page.goto('http://mysite');
await page.waitFor(".item");
// check items (=these initial items)
// event when receiving new items:
// check item(s) (= the additional [or all] items)
I want to use Puppeteer to respond to page updates. The page shows items and when I leave the page open new items can appear over time. E.g. every 10 seconds a new item is added.
I can use the following to wait for an item on the initial load of the page:
await page.waitFor(".item");
console.log("the initial items have been loaded")
How can I wait for / catch future items? I would like to achieve something like this (pseudo code):
await page.goto('http://mysite');
await page.waitFor(".item");
// check items (=these initial items)
// event when receiving new items:
// check item(s) (= the additional [or all] items)
Share
Improve this question
edited Mar 21, 2021 at 18:36
hardkoded
21.7k3 gold badges61 silver badges74 bronze badges
asked Jan 9, 2019 at 11:27
wivkuwivku
2,7033 gold badges35 silver badges50 bronze badges
3 Answers
Reset to default 9You can use exposeFunction to expose a local function:
await page.exposeFunction('getItem', function(a) {
console.log(a);
});
Then you can use page.evaluate to create an observer and listen to new nodes created inside a parent node.
This example scrapes (it's just an idea, not a final work) the python chat in Stack Overflow, and prints new items being created in that chat.
var baseurl = 'https://chat.stackoverflow./rooms/6/python';
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto(baseurl);
await page.exposeFunction('getItem', function(a) {
console.log(a);
});
await page.evaluate(() => {
var observer = new MutationObserver((mutations) => {
for(var mutation of mutations) {
if(mutation.addedNodes.length) {
getItem(mutation.addedNodes[0].innerText);
}
}
});
observer.observe(document.getElementById("chat"), { attributes: false, childList: true, subtree: true });
});
As an alternative to the excellent current answer which injects a MutationObserver
using evaluate
which forwards the data to an exposed Node function, Puppeteer offers a higher-level function called page.waitForFunction
that blocks on an arbitrary predicate and uses either a MutationObserver
or requestAnimationFrame
under the hood to determine when to re-evaluate the predicate.
Calling page.waitForFunction
in a loop might add overhead since each new call involves registering a fresh observer or RAF. You'd have to profile for your use case. This isn't something I'd worry much about prematurely, though.
That said, the RAF option may provide tighter latency than MO for the cost of some extra CPU cycles to poll constantly.
Here's a minimal example on the following site that offers a periodically updating feed:
const wait = ms => new Promise(r => setTimeout(r, ms));
const r = (lo, hi) => ~~(Math.random() * (hi - lo) + lo);
const randomString = n =>
[...Array(n)].map(() => String.fromCharCode(r(97, 123))).join("");
(async () => {
for (let i = 0; i < 500; i++) {
const el = document.createElement("div");
document.body.appendChild(el);
el.innerText = randomString(r(5, 15));
await wait(r(1000, 5000));
}
})();
const puppeteer = require("puppeteer");
const html = `<!DOCTYPE html>
<html><body><div class="container"></div><script>
const wait = ms => new Promise(r => setTimeout(r, ms));
const r = (lo, hi) => ~~(Math.random() * (hi - lo) + lo);
const randomString = n =>
[...Array(n)].map(() => String.fromCharCode(r(97, 123))).join("")
;
(async () => {
for (;;) {
const el = document.createElement("div");
document.querySelector(".container").appendChild(el);
el.innerText = randomString(r(5, 15));
await wait(r(1000, 5000));
}
})();
</script></body></html>`;
let browser;
(async () => {
browser = await puppeteer.launch({headless: false});
const [page] = await browser.pages();
await page.setContent(html);
for (;;) {
await page.waitForFunction((el, oldLength) =>
el.children.length > oldLength, // predicate
{polling: "mutation" /* or: "raf" */, timeout: 10**8}, // wFF options
await page.$(".container"), // elem to watch
await page.$eval(".container", el => el.children.length), // oldLength
);
const selMostRecent = ".container div:last-child";
console.log(await page.$eval(selMostRecent, el => el.textContent));
}
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Note that this example is contrived; if multiple items are added to the feed at once, an item can be skipped. It'd be safer to grab all items beyond the oldLength
. You'll almost certainly need to adjust this code to match your feed's specific behavior.
See also:
- Pass a function inside
page.waitForFunction()
with puppeteer which shows a genericwaitForTextChange
helper function that wrapspage.waitForFunction
. - Realtime scrape a chat using Nodejs which aptly suggests the alternative approach of intercepting API responses as they populate the feed, when possible.
A simpler idea for waiting for text to change, you can use :last-child selector to wait for text of the last item to change:
await page.evaluate(sel => {
let originalText = document.querySelector(sel).innerText
return new Promise(resolve => {
let interval = setInterval(() => {
if(originalText !== document.querySelector(sel).innerText){
clearInterval(interval)
resolve()
}
}, 500)
})
}, 'item:last-child')