最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Puppeteer is not scraping the last page - Stack Overflow

programmeradmin2浏览0评论

I'm scraping a news article for my project using Puppeteer. I'm unable to scrape the last page. There are 10 pages and each page has 10 links (total 100 data). However, I’m noticing that sometimes it only scrapes data from 39 articles and other times, it scrapes around 90. I'm not sure why this happens.

Here's the code I'm using:

await page.goto(url, { timeout: 90000 })

    const result: Result[] = []

    await page.waitForSelector('div.gsc-cursor-page', { timeout: 90000 })

    let pageElements = await page.$$("div.gsc-cursor-page")

    await page.waitForSelector("div.gsc-resultsbox-visible", { timeout: 90000 })
    
    for(let i = 0; i < pageElements.length; i++){
        const pageElement = pageElements[i]

        // Click page element only if it's not the first page

        if(i !== 0){
            await page.evaluate((el) => {
                el.click()
            }, pageElement)

           // Wait for content to load after page navigation
            
            await page.waitForSelector("div.gsc-resultsbox-visible", { timeout: 90000 }).catch(err => {
                return
            })
        }


        // Re-fetch the page elements after navigation

        pageElements = await page.$$("div.gsc-cursor-page")   

        // Extract article links from the page

        let elements: ElementHandle<HTMLAnchorElement>[] = await page.$$("div.gsc-resultsbox-visible > div > div div > div.gsc-thumbnail-inside > div > a")

        for (const element of elements) {
            try {
                const link = await page.evaluate((el: HTMLAnchorElement) => el.href, element)
    

                // Open article page and scrape data

                const articlePage = await browser.newPage()
                await articlePage.goto(link, { waitUntil: 'load', timeout: 90000 })
                await articlePage.waitForSelector("h1.title", { timeout: 90000 }).catch(err => {
                    return
                })
    
                const title = await articlePage.$eval("h1.title", (element) => element.textContent.trim())
                const body = await articlePage.$$eval("div.articlebodycontent p", (elements) =>
                    elements.map((p) => p.textContent.replace(/\n/g, " ").replace(/\s+/g, " "))
                )
                result.push({ title, content: body.join(" ") })
                await articlePage.close()
            }
            catch (error) {
                console.log("Error extracting article:", error)
            }
        }
    }

    return { searchQuery, length: result.length, result }
}

How can I fix this issue ?

I'm scraping a news article for my project using Puppeteer. I'm unable to scrape the last page. There are 10 pages and each page has 10 links (total 100 data). However, I’m noticing that sometimes it only scrapes data from 39 articles and other times, it scrapes around 90. I'm not sure why this happens.

Here's the code I'm using:

await page.goto(url, { timeout: 90000 })

    const result: Result[] = []

    await page.waitForSelector('div.gsc-cursor-page', { timeout: 90000 })

    let pageElements = await page.$$("div.gsc-cursor-page")

    await page.waitForSelector("div.gsc-resultsbox-visible", { timeout: 90000 })
    
    for(let i = 0; i < pageElements.length; i++){
        const pageElement = pageElements[i]

        // Click page element only if it's not the first page

        if(i !== 0){
            await page.evaluate((el) => {
                el.click()
            }, pageElement)

           // Wait for content to load after page navigation
            
            await page.waitForSelector("div.gsc-resultsbox-visible", { timeout: 90000 }).catch(err => {
                return
            })
        }


        // Re-fetch the page elements after navigation

        pageElements = await page.$$("div.gsc-cursor-page")   

        // Extract article links from the page

        let elements: ElementHandle<HTMLAnchorElement>[] = await page.$$("div.gsc-resultsbox-visible > div > div div > div.gsc-thumbnail-inside > div > a")

        for (const element of elements) {
            try {
                const link = await page.evaluate((el: HTMLAnchorElement) => el.href, element)
    

                // Open article page and scrape data

                const articlePage = await browser.newPage()
                await articlePage.goto(link, { waitUntil: 'load', timeout: 90000 })
                await articlePage.waitForSelector("h1.title", { timeout: 90000 }).catch(err => {
                    return
                })
    
                const title = await articlePage.$eval("h1.title", (element) => element.textContent.trim())
                const body = await articlePage.$$eval("div.articlebodycontent p", (elements) =>
                    elements.map((p) => p.textContent.replace(/\n/g, " ").replace(/\s+/g, " "))
                )
                result.push({ title, content: body.join(" ") })
                await articlePage.close()
            }
            catch (error) {
                console.log("Error extracting article:", error)
            }
        }
    }

    return { searchQuery, length: result.length, result }
}

How can I fix this issue ?

Share Improve this question edited Mar 29 at 15:12 ggorlen 58k8 gold badges114 silver badges157 bronze badges asked Mar 21 at 7:54 PreethiPreethi 231 silver badge6 bronze badges 1
  • 1 const url = https://www.thehindu/search/#gsc.tab=0&gsc.q=${searchQuery}&gsc.sort= This is the URL – Preethi Commented Mar 24 at 3:16
Add a comment  | 

2 Answers 2

Reset to default 1

If you examine the site's network requests, it's making a call to a third party Google search API:

If you make that same request as the website, you can avoid the pain of automating it entirely:

const makeUrl = offset =>
  `https://cse.google/cse/element/v1?rsz=filtered_cse&num=20&start=${offset}&hl=en&source=gcsc&cselibv=75c56d121cde450a&cx=264d7caeb1ba04bfc&q=%24{searchQuery}&safe=active&cse_tok=AB-tC_51w3gnpTUdkduvVczddH5_%3A1743226475620&lr=&cr=&gl=&filter=0&sort=&as_oq=&as_sitesearch=&exp=cc&callback=google.search.cse.api15447&rurl=https%3A%2F%2Fwww.thehindu%2Fsearch%2F%23gsc.tab%3D0%26gsc.q%3D%24{searchQuery}%26gsc.sort%3D`;

(async () => {
  const results = [];
  const userAgent =
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36";

  for (let page = 0; page < 10; page++) {
    const res = await fetch(makeUrl(page * 20), {
      headers: {"User-Agent": userAgent},
    });

    if (!res.ok) {
      break;
    }

    const text = await res.text();
    const startIndex = text.indexOf("{");
    const endIndex = text.lastIndexOf(")");
    const json = text.slice(startIndex, endIndex);
    results.push(...JSON.parse(json).results);
  }

  console.log(results.map(e => e.title));
})();

But be careful with this since Google can block you pretty easily doing this--you'll probably want to add a user agent, proxy, or throttle requests. If detected, Puppeteer becomes useful again, but you can still keep the philosophy of avoiding the DOM and focus on intercepting those responses:

const fs = require("node:fs/promises");
const puppeteer = require("puppeteer"); // ^24.4.0

let browser;
(async () => {
  browser = await puppeteer.launch({headless: false});
  const [page] = await browser.pages();
  const ua =
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";
  await page.setUserAgent(ua);
  const results = [];

  for (let pg = 1; pg <= 10; pg++) {
    try {
      const responseArrived = page.waitForResponse(async res =>
        res.request().url().includes("google.search.cse") &&
        (await res.text()).includes("total_results")
      );
      await page.goto(
        `https://www.thehindu/search/#gsc.tab=0&gsc.q=%24{searchQuery}&gsc.sort=&gsc.page=${pg}`,
        {waitUntil: "domcontentloaded"}
      );
      const response = await responseArrived;
      const text = await response.text();
      const startIndex = text.indexOf("{");
      const endIndex = text.lastIndexOf(")");
      const json = text.slice(startIndex, endIndex);
      results.push(...JSON.parse(json).results);
    }
    catch {
      break;
    }
  }

  await fs.writeFile(
    "results.json",
    JSON.stringify(results, null, 2)
  );
  console.log(results.map(e => e.title));
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

I resolved the issues by adding a setTimeOut function after the page navigation.

await new Promise(r => setTimeout(r, 3000))

发布评论

评论列表(0)

  1. 暂无评论