There are many similar questions (like this: Scraping Websites With Playwright), yet I did not find and solution to this:
I have this url: =&fd=2025-02-17&td=2025-03-06&s=score&query=apple
Which leads to a page that looks like this.
I'm interested in the number 71 (that is circled in red). I think much of the content is server-rendered or somehow fetched. I first tried Rselenium
as I'm more familiar with R. Yet, on my arm mac, I could not connect to the server on localhost...
I now am using playwright with node to somehow get this number. Yet I am still failing. My script looks like this:
const { firefox } = require("playwright");
(async () => {
// Launch Firefox in headless mode
const browser = await firefox.launch({ headless: false });
const page = await browser.newPage();
// Navigate to the website
const url =
"=&fd=2025-02-17&td=2025-03-06&s=score&query=ukraine";
await page.goto(url, { waitUntil: "domcontentloaded" });
// Check if the button exists before trying to click
const buttonSelector = ".message-component";
if (await page.$(buttonSelector)) {
console.log("Clicking the button...");
await page.click(buttonSelector);
await page.waitForTimeout(2000); // Wait a bit for content to update
} else {
console.log("Button not found, continuing...");
}
// Extract all <h1> elements
const h1s = await page.evaluate(() =>
Array.from(document.querySelectorAll("h1")).map((el) => el.innerText.trim())
);
console.log("Extracted <h1> elements:", h1s);
// Close the browser
await browser.close();
})();
As I am prompted with this site first:
As I have to click the left button first to get to this site... However, I can't access this button either:/
If anyone has any idea on how I programatically can get this number, that would be very very much appreciated!:)
There are many similar questions (like this: Scraping Websites With Playwright), yet I did not find and solution to this:
I have this url: https://www.derstandard.at/search?n=&fd=2025-02-17&td=2025-03-06&s=score&query=apple
Which leads to a page that looks like this.
I'm interested in the number 71 (that is circled in red). I think much of the content is server-rendered or somehow fetched. I first tried Rselenium
as I'm more familiar with R. Yet, on my arm mac, I could not connect to the server on localhost...
I now am using playwright with node to somehow get this number. Yet I am still failing. My script looks like this:
const { firefox } = require("playwright");
(async () => {
// Launch Firefox in headless mode
const browser = await firefox.launch({ headless: false });
const page = await browser.newPage();
// Navigate to the website
const url =
"https://www.derstandard.at/search?n=&fd=2025-02-17&td=2025-03-06&s=score&query=ukraine";
await page.goto(url, { waitUntil: "domcontentloaded" });
// Check if the button exists before trying to click
const buttonSelector = ".message-component";
if (await page.$(buttonSelector)) {
console.log("Clicking the button...");
await page.click(buttonSelector);
await page.waitForTimeout(2000); // Wait a bit for content to update
} else {
console.log("Button not found, continuing...");
}
// Extract all <h1> elements
const h1s = await page.evaluate(() =>
Array.from(document.querySelectorAll("h1")).map((el) => el.innerText.trim())
);
console.log("Extracted <h1> elements:", h1s);
// Close the browser
await browser.close();
})();
As I am prompted with this site first:
As I have to click the left button first to get to this site... However, I can't access this button either:/
If anyone has any idea on how I programatically can get this number, that would be very very much appreciated!:)
Share Improve this question asked Mar 10 at 19:47 LennLenn 1,4991 gold badge12 silver badges28 bronze badges2 Answers
Reset to default 2 // Initial load with popup
await page.goto(url);
// Close the banner
await page.frameLocator('iframe[title="SP Consent Message"]').getByRole("button", { name: "Einverstanden" }).click();
// There's redirect after Einverstanden click, so need to open the page again.
await page.goto(url);
// Get root h1 text
const rootH1Text = await page.locator("section > h1").textContent();
// Cut extra text
console.log(rootH1Text.trim().split(" Ergebnisse")[0]);
That consent button is in an iframe. After clicking, wait for a redirect, and grab the data from one of a few places, such as the title:
const puppeteer = require("puppeteer"); // ^24.4.0
const url = "<Your URL>";
let browser;
(async () => {
browser = await puppeteer.launch({headless: true});
const [page] = await browser.pages();
const ua =
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";
await page.setUserAgent(ua);
await page.goto(url, {waitUntil: "domcontentloaded"});
const iframeElement = await page.waitForSelector("iframe[id^='sp_']");
const iframe = await iframeElement.contentFrame();
await iframe.locator("::-p-text(Einverstanden)").click();
await page.waitForFunction(
"!window.location.href.includes('consent') && document.title"
);
console.log(await page.title()); // => 71 Ergebnisse für „apple“ von 17. Februar 2025 bis 6. März 2025 nach Relevanz sortiert - derStandard.at
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
It's also good to block unnecessary resources, so there is room for improvement here.