最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - puppeteer wait for pageDOM updates - respond to new items that are added after initial loading - Stack Overflow

programmeradmin3浏览0评论

I want to use Puppeteer to respond to page updates. The page shows items and when I leave the page open new items can appear over time. E.g. every 10 seconds a new item is added.

I can use the following to wait for an item on the initial load of the page:

await page.waitFor(".item");
console.log("the initial items have been loaded")

How can I wait for / catch future items? I would like to achieve something like this (pseudo code):

await page.goto('http://mysite');
await page.waitFor(".item");
// check items (=these initial items)

// event when receiving new items:
// check item(s) (= the additional [or all] items)

I want to use Puppeteer to respond to page updates. The page shows items and when I leave the page open new items can appear over time. E.g. every 10 seconds a new item is added.

I can use the following to wait for an item on the initial load of the page:

await page.waitFor(".item");
console.log("the initial items have been loaded")

How can I wait for / catch future items? I would like to achieve something like this (pseudo code):

await page.goto('http://mysite');
await page.waitFor(".item");
// check items (=these initial items)

// event when receiving new items:
// check item(s) (= the additional [or all] items)
Share Improve this question edited Mar 21, 2021 at 18:36 hardkoded 21.7k3 gold badges61 silver badges74 bronze badges asked Jan 9, 2019 at 11:27 wivkuwivku 2,7033 gold badges35 silver badges50 bronze badges
Add a ment  | 

3 Answers 3

Reset to default 9

You can use exposeFunction to expose a local function:

await page.exposeFunction('getItem', function(a) {
    console.log(a);
});

Then you can use page.evaluate to create an observer and listen to new nodes created inside a parent node.

This example scrapes (it's just an idea, not a final work) the python chat in Stack Overflow, and prints new items being created in that chat.

var baseurl =  'https://chat.stackoverflow./rooms/6/python';
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto(baseurl);

await page.exposeFunction('getItem', function(a) {
    console.log(a);
});

await page.evaluate(() => {
    var observer = new MutationObserver((mutations) => { 
        for(var mutation of mutations) {
            if(mutation.addedNodes.length) {
                getItem(mutation.addedNodes[0].innerText);
            }
        }
    });
    observer.observe(document.getElementById("chat"), { attributes: false, childList: true, subtree: true });
});

As an alternative to the excellent current answer which injects a MutationObserver using evaluate which forwards the data to an exposed Node function, Puppeteer offers a higher-level function called page.waitForFunction that blocks on an arbitrary predicate and uses either a MutationObserver or requestAnimationFrame under the hood to determine when to re-evaluate the predicate.

Calling page.waitForFunction in a loop might add overhead since each new call involves registering a fresh observer or RAF. You'd have to profile for your use case. This isn't something I'd worry much about prematurely, though.

That said, the RAF option may provide tighter latency than MO for the cost of some extra CPU cycles to poll constantly.

Here's a minimal example on the following site that offers a periodically updating feed:

const wait = ms => new Promise(r => setTimeout(r, ms));
const r = (lo, hi) => ~~(Math.random() * (hi - lo) + lo);

const randomString = n =>
  [...Array(n)].map(() => String.fromCharCode(r(97, 123))).join("");

(async () => {
  for (let i = 0; i < 500; i++) {
    const el = document.createElement("div");
    document.body.appendChild(el);
    el.innerText = randomString(r(5, 15));
    await wait(r(1000, 5000));
  }
})();

const puppeteer = require("puppeteer");

const html = `<!DOCTYPE html>
<html><body><div class="container"></div><script>
const wait = ms => new Promise(r => setTimeout(r, ms));
const r = (lo, hi) => ~~(Math.random() * (hi - lo) + lo);
const randomString = n =>
  [...Array(n)].map(() => String.fromCharCode(r(97, 123))).join("")
;
(async () => {
  for (;;) {
    const el = document.createElement("div");
    document.querySelector(".container").appendChild(el);
    el.innerText = randomString(r(5, 15));
    await wait(r(1000, 5000));
  }
})();
</script></body></html>`;

let browser;
(async () => {
  browser = await puppeteer.launch({headless: false});
  const [page] = await browser.pages();
  await page.setContent(html);
  
  for (;;) {
    await page.waitForFunction((el, oldLength) =>
      el.children.length > oldLength,                           // predicate
      {polling: "mutation" /* or: "raf" */, timeout: 10**8},    // wFF options
      await page.$(".container"),                               // elem to watch
      await page.$eval(".container", el => el.children.length), // oldLength
    );
    const selMostRecent = ".container div:last-child";
    console.log(await page.$eval(selMostRecent, el => el.textContent));
  }
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Note that this example is contrived; if multiple items are added to the feed at once, an item can be skipped. It'd be safer to grab all items beyond the oldLength. You'll almost certainly need to adjust this code to match your feed's specific behavior.

See also:

  • Pass a function inside page.waitForFunction() with puppeteer which shows a generic waitForTextChange helper function that wraps page.waitForFunction.
  • Realtime scrape a chat using Nodejs which aptly suggests the alternative approach of intercepting API responses as they populate the feed, when possible.

A simpler idea for waiting for text to change, you can use :last-child selector to wait for text of the last item to change:

await page.evaluate(sel => {
  let originalText = document.querySelector(sel).innerText
  return new Promise(resolve => {
    let interval = setInterval(() => {
      if(originalText !== document.querySelector(sel).innerText){
        clearInterval(interval)
        resolve()
      }
    }, 500)
  })
}, 'item:last-child')

  

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论