I'm scraping a news article for my project using Puppeteer. I'm unable to scrape the last page. There are 10 pages and each page has 10 links (total 100 data). However, I’m noticing that sometimes it only scrapes data from 39 articles and other times, it scrapes around 90. I'm not sure why this happens.
Here's the code I'm using:
await page.goto(url, { timeout: 90000 })
const result: Result[] = []
await page.waitForSelector('div.gsc-cursor-page', { timeout: 90000 })
let pageElements = await page.$$("div.gsc-cursor-page")
await page.waitForSelector("div.gsc-resultsbox-visible", { timeout: 90000 })
for(let i = 0; i < pageElements.length; i++){
const pageElement = pageElements[i]
// Click page element only if it's not the first page
if(i !== 0){
await page.evaluate((el) => {
el.click()
}, pageElement)
// Wait for content to load after page navigation
await page.waitForSelector("div.gsc-resultsbox-visible", { timeout: 90000 }).catch(err => {
return
})
}
// Re-fetch the page elements after navigation
pageElements = await page.$$("div.gsc-cursor-page")
// Extract article links from the page
let elements: ElementHandle<HTMLAnchorElement>[] = await page.$$("div.gsc-resultsbox-visible > div > div div > div.gsc-thumbnail-inside > div > a")
for (const element of elements) {
try {
const link = await page.evaluate((el: HTMLAnchorElement) => el.href, element)
// Open article page and scrape data
const articlePage = await browser.newPage()
await articlePage.goto(link, { waitUntil: 'load', timeout: 90000 })
await articlePage.waitForSelector("h1.title", { timeout: 90000 }).catch(err => {
return
})
const title = await articlePage.$eval("h1.title", (element) => element.textContent.trim())
const body = await articlePage.$$eval("div.articlebodycontent p", (elements) =>
elements.map((p) => p.textContent.replace(/\n/g, " ").replace(/\s+/g, " "))
)
result.push({ title, content: body.join(" ") })
await articlePage.close()
}
catch (error) {
console.log("Error extracting article:", error)
}
}
}
return { searchQuery, length: result.length, result }
}
How can I fix this issue ?
I'm scraping a news article for my project using Puppeteer. I'm unable to scrape the last page. There are 10 pages and each page has 10 links (total 100 data). However, I’m noticing that sometimes it only scrapes data from 39 articles and other times, it scrapes around 90. I'm not sure why this happens.
Here's the code I'm using:
await page.goto(url, { timeout: 90000 })
const result: Result[] = []
await page.waitForSelector('div.gsc-cursor-page', { timeout: 90000 })
let pageElements = await page.$$("div.gsc-cursor-page")
await page.waitForSelector("div.gsc-resultsbox-visible", { timeout: 90000 })
for(let i = 0; i < pageElements.length; i++){
const pageElement = pageElements[i]
// Click page element only if it's not the first page
if(i !== 0){
await page.evaluate((el) => {
el.click()
}, pageElement)
// Wait for content to load after page navigation
await page.waitForSelector("div.gsc-resultsbox-visible", { timeout: 90000 }).catch(err => {
return
})
}
// Re-fetch the page elements after navigation
pageElements = await page.$$("div.gsc-cursor-page")
// Extract article links from the page
let elements: ElementHandle<HTMLAnchorElement>[] = await page.$$("div.gsc-resultsbox-visible > div > div div > div.gsc-thumbnail-inside > div > a")
for (const element of elements) {
try {
const link = await page.evaluate((el: HTMLAnchorElement) => el.href, element)
// Open article page and scrape data
const articlePage = await browser.newPage()
await articlePage.goto(link, { waitUntil: 'load', timeout: 90000 })
await articlePage.waitForSelector("h1.title", { timeout: 90000 }).catch(err => {
return
})
const title = await articlePage.$eval("h1.title", (element) => element.textContent.trim())
const body = await articlePage.$$eval("div.articlebodycontent p", (elements) =>
elements.map((p) => p.textContent.replace(/\n/g, " ").replace(/\s+/g, " "))
)
result.push({ title, content: body.join(" ") })
await articlePage.close()
}
catch (error) {
console.log("Error extracting article:", error)
}
}
}
return { searchQuery, length: result.length, result }
}
How can I fix this issue ?
Share Improve this question edited Mar 29 at 15:12 ggorlen 58k8 gold badges114 silver badges157 bronze badges asked Mar 21 at 7:54 PreethiPreethi 231 silver badge6 bronze badges 1 |2 Answers
Reset to default 1If you examine the site's network requests, it's making a call to a third party Google search API:
If you make that same request as the website, you can avoid the pain of automating it entirely:
const makeUrl = offset =>
`https://cse.google/cse/element/v1?rsz=filtered_cse&num=20&start=${offset}&hl=en&source=gcsc&cselibv=75c56d121cde450a&cx=264d7caeb1ba04bfc&q=%24{searchQuery}&safe=active&cse_tok=AB-tC_51w3gnpTUdkduvVczddH5_%3A1743226475620&lr=&cr=&gl=&filter=0&sort=&as_oq=&as_sitesearch=&exp=cc&callback=google.search.cse.api15447&rurl=https%3A%2F%2Fwww.thehindu%2Fsearch%2F%23gsc.tab%3D0%26gsc.q%3D%24{searchQuery}%26gsc.sort%3D`;
(async () => {
const results = [];
const userAgent =
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36";
for (let page = 0; page < 10; page++) {
const res = await fetch(makeUrl(page * 20), {
headers: {"User-Agent": userAgent},
});
if (!res.ok) {
break;
}
const text = await res.text();
const startIndex = text.indexOf("{");
const endIndex = text.lastIndexOf(")");
const json = text.slice(startIndex, endIndex);
results.push(...JSON.parse(json).results);
}
console.log(results.map(e => e.title));
})();
But be careful with this since Google can block you pretty easily doing this--you'll probably want to add a user agent, proxy, or throttle requests. If detected, Puppeteer becomes useful again, but you can still keep the philosophy of avoiding the DOM and focus on intercepting those responses:
const fs = require("node:fs/promises");
const puppeteer = require("puppeteer"); // ^24.4.0
let browser;
(async () => {
browser = await puppeteer.launch({headless: false});
const [page] = await browser.pages();
const ua =
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";
await page.setUserAgent(ua);
const results = [];
for (let pg = 1; pg <= 10; pg++) {
try {
const responseArrived = page.waitForResponse(async res =>
res.request().url().includes("google.search.cse") &&
(await res.text()).includes("total_results")
);
await page.goto(
`https://www.thehindu/search/#gsc.tab=0&gsc.q=%24{searchQuery}&gsc.sort=&gsc.page=${pg}`,
{waitUntil: "domcontentloaded"}
);
const response = await responseArrived;
const text = await response.text();
const startIndex = text.indexOf("{");
const endIndex = text.lastIndexOf(")");
const json = text.slice(startIndex, endIndex);
results.push(...JSON.parse(json).results);
}
catch {
break;
}
}
await fs.writeFile(
"results.json",
JSON.stringify(results, null, 2)
);
console.log(results.map(e => e.title));
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
I resolved the issues by adding a setTimeOut
function after the page navigation.
await new Promise(r => setTimeout(r, 3000))
https://www.thehindu/search/#gsc.tab=0&gsc.q=${searchQuery}&gsc.sort=
This is the URL – Preethi Commented Mar 24 at 3:16