最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Can't find button using playwright (or puppeteer) for web scraping - Stack Overflow

programmeradmin6浏览0评论

There are many similar questions (like this: Scraping Websites With Playwright), yet I did not find and solution to this:

I have this url: =&fd=2025-02-17&td=2025-03-06&s=score&query=apple

Which leads to a page that looks like this.

I'm interested in the number 71 (that is circled in red). I think much of the content is server-rendered or somehow fetched. I first tried Rselenium as I'm more familiar with R. Yet, on my arm mac, I could not connect to the server on localhost...

I now am using playwright with node to somehow get this number. Yet I am still failing. My script looks like this:

const { firefox } = require("playwright");

(async () => {
  // Launch Firefox in headless mode
  const browser = await firefox.launch({ headless: false });
  const page = await browser.newPage();

  // Navigate to the website
  const url =
    "=&fd=2025-02-17&td=2025-03-06&s=score&query=ukraine";
  await page.goto(url, { waitUntil: "domcontentloaded" });

  // Check if the button exists before trying to click
  const buttonSelector = ".message-component";
  if (await page.$(buttonSelector)) {
    console.log("Clicking the button...");
    await page.click(buttonSelector);
    await page.waitForTimeout(2000); // Wait a bit for content to update
  } else {
    console.log("Button not found, continuing...");
  }

  // Extract all <h1> elements
  const h1s = await page.evaluate(() =>
    Array.from(document.querySelectorAll("h1")).map((el) => el.innerText.trim())
  );

  console.log("Extracted <h1> elements:", h1s);

  // Close the browser
  await browser.close();
})();

As I am prompted with this site first:

As I have to click the left button first to get to this site... However, I can't access this button either:/

If anyone has any idea on how I programatically can get this number, that would be very very much appreciated!:)

There are many similar questions (like this: Scraping Websites With Playwright), yet I did not find and solution to this:

I have this url: https://www.derstandard.at/search?n=&fd=2025-02-17&td=2025-03-06&s=score&query=apple

Which leads to a page that looks like this.

I'm interested in the number 71 (that is circled in red). I think much of the content is server-rendered or somehow fetched. I first tried Rselenium as I'm more familiar with R. Yet, on my arm mac, I could not connect to the server on localhost...

I now am using playwright with node to somehow get this number. Yet I am still failing. My script looks like this:

const { firefox } = require("playwright");

(async () => {
  // Launch Firefox in headless mode
  const browser = await firefox.launch({ headless: false });
  const page = await browser.newPage();

  // Navigate to the website
  const url =
    "https://www.derstandard.at/search?n=&fd=2025-02-17&td=2025-03-06&s=score&query=ukraine";
  await page.goto(url, { waitUntil: "domcontentloaded" });

  // Check if the button exists before trying to click
  const buttonSelector = ".message-component";
  if (await page.$(buttonSelector)) {
    console.log("Clicking the button...");
    await page.click(buttonSelector);
    await page.waitForTimeout(2000); // Wait a bit for content to update
  } else {
    console.log("Button not found, continuing...");
  }

  // Extract all <h1> elements
  const h1s = await page.evaluate(() =>
    Array.from(document.querySelectorAll("h1")).map((el) => el.innerText.trim())
  );

  console.log("Extracted <h1> elements:", h1s);

  // Close the browser
  await browser.close();
})();

As I am prompted with this site first:

As I have to click the left button first to get to this site... However, I can't access this button either:/

If anyone has any idea on how I programatically can get this number, that would be very very much appreciated!:)

Share Improve this question asked Mar 10 at 19:47 LennLenn 1,4991 gold badge12 silver badges28 bronze badges
Add a comment  | 

2 Answers 2

Reset to default 2
  // Initial load with popup 
  await page.goto(url);
  // Close the banner
  await page.frameLocator('iframe[title="SP Consent Message"]').getByRole("button", { name: "Einverstanden" }).click();
  // There's redirect after Einverstanden click, so need to open the page again.
  await page.goto(url);
  // Get root h1 text
  const rootH1Text = await page.locator("section > h1").textContent();
  // Cut extra text
  console.log(rootH1Text.trim().split(" Ergebnisse")[0]);

That consent button is in an iframe. After clicking, wait for a redirect, and grab the data from one of a few places, such as the title:

const puppeteer = require("puppeteer"); // ^24.4.0

const url = "<Your URL>";

let browser;
(async () => {
  browser = await puppeteer.launch({headless: true});
  const [page] = await browser.pages();
  const ua =
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";
  await page.setUserAgent(ua);
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const iframeElement = await page.waitForSelector("iframe[id^='sp_']");
  const iframe = await iframeElement.contentFrame();
  await iframe.locator("::-p-text(Einverstanden)").click();
  await page.waitForFunction(
    "!window.location.href.includes('consent') && document.title"
  );
  console.log(await page.title()); // => 71 Ergebnisse für „apple“ von 17. Februar 2025 bis 6. März 2025 nach Relevanz sortiert - derStandard.at
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

It's also good to block unnecessary resources, so there is room for improvement here.

发布评论

评论列表(0)

  1. 暂无评论