最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Scraping Google search result links with Puppeteer - Stack Overflow

programmeradmin1浏览0评论

Below is the code I'm trying to use for web scraping Google. When I pass in a specific request, it doesn't return the list of links. I don't understand what's causing it. Can someone help, please?

const puppeteer = require("puppeteer");

const searchGoogle = async (searchQuery) => {
  /** by default puppeteer launch method have headless option true*/
  const browser = await puppeteer.launch({
    headless: false,
  });
  const page = await browser.newPage();
  await page.goto("/");
  await page.type('input[aria-label="Search"]', searchQuery);
  await page.keyboard.press("Enter");

  /** waitfor while loding the page, otherwise evaulate method will get failed. */
  await page.waitFor(5000);
  const list = await page.evaluate(() => {
    let data = [];
    /** this can be changed for other website.*/
    const list = document.querySelectorAll(".rc .r");
    for (const a of list) {
      data.push({
        title: a
          .querySelector(".LC20lb")
          .innerText.trim()
          .replace(/(\r\n|\n|\r)/gm, " "),
        link: a.querySelector("a").href,
      });
    }
    return data;
  });

  await browser.close();
};
module.exports = searchGoogle;

Below is the code I'm trying to use for web scraping Google. When I pass in a specific request, it doesn't return the list of links. I don't understand what's causing it. Can someone help, please?

const puppeteer = require("puppeteer");

const searchGoogle = async (searchQuery) => {
  /** by default puppeteer launch method have headless option true*/
  const browser = await puppeteer.launch({
    headless: false,
  });
  const page = await browser.newPage();
  await page.goto("https://www.google./");
  await page.type('input[aria-label="Search"]', searchQuery);
  await page.keyboard.press("Enter");

  /** waitfor while loding the page, otherwise evaulate method will get failed. */
  await page.waitFor(5000);
  const list = await page.evaluate(() => {
    let data = [];
    /** this can be changed for other website.*/
    const list = document.querySelectorAll(".rc .r");
    for (const a of list) {
      data.push({
        title: a
          .querySelector(".LC20lb")
          .innerText.trim()
          .replace(/(\r\n|\n|\r)/gm, " "),
        link: a.querySelector("a").href,
      });
    }
    return data;
  });

  await browser.close();
};
module.exports = searchGoogle;
Share Improve this question edited Oct 15, 2022 at 20:22 ggorlen 57.9k8 gold badges114 silver badges157 bronze badges asked May 13, 2021 at 6:55 addyaddy 831 silver badge6 bronze badges
Add a ment  | 

2 Answers 2

Reset to default 5

Note: As of 2025, it seems Google has implemented heavy anti-bot measure, so this answer does not appear to work. I'll leave it up for posterity and to update it when I get the chance.


await page.waitFor(5000); in this context causes a race condition. If the page doesn't load in 5 seconds, you can get a false negative. If the page loads quicker than 5 seconds, you've wasted time without reason. Only pick arbitrary delays as a last resort or if it's an intended part of the application logic.

A better approach is to use page.waitForSelector or page.waitForNavigation.

Secondly, I don't see results for the selector .rc .r. Google selectors aren't stable (as with many sites), and the page they show can depend on your user agent, so the answers below will probably require adjustment going forward. At the moment, this seems to work:

import puppeteer from "puppeteer"; // ^22.10.0

let browser;
(async () => {
  const searchQuery = "stack overflow";

  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  await page.setRequestInterception(true);
  await page.setJavaScriptEnabled(false);
  page.on("request", request => {
    request.resourceType() === "document"
      ? request.continue()
      : request.abort();
  });
  await page.goto("https://www.google./", {
    waitUntil: "domcontentloaded",
  });
  await page.type("textarea", searchQuery);
  await page.$eval('[aria-label="Google Search"]', el => el.click());
  const sel = ".Gx5Zad";
  await page.waitForSelector(sel);
  const searchResults = await page.$$eval(sel, els =>
    els
      .map(e => ({
        title: e.querySelector("h3")?.textContent,
        link: e.querySelector("a").href,
      }))
      .filter(e => e.title)
  );
  console.log(searchResults);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Output (yours may be different depending on what Google shows by the time you run the script):

[
  {
    title: 'Stack Overflow - Where Developers Learn, Share, & Build ...',
    link: 'https://stackoverflow./'
  },
  {
    title: 'Stack Overflow - Wikipedia',
    link: 'https://en.wikipedia/wiki/Stack_Overflow'
  },
  {
    title: 'Stack Overflow Blog - Essays, opinions, and advice on the act ...',
    link: 'https://stackoverflow.blog/'
  },
  {
    title: 'The Stack Overflow Podcast - Stack Overflow Blog',
    link: 'https://stackoverflow.blog/podcast/'
  },
  {
    title: 'Stack Overflow | LinkedIn',
    link: 'https://www.linkedin./pany/stack-overflow'
  }
]

Another approach is to encode your search terms as a URL query parameter and navigate directly to https://www.google./search?q=your+query+here, saving a navigation and potential selector mishap.

As with many scraping tasks, since the goal is to grab simple hrefs from the document, you might try switching to fetch/cheerio and working with the static HTML. On my machine, the following script runs about 5x faster than Puppeteer with two navigations and about 3x faster than Puppeteer navigating directly to the search results.

import cheerio from "cheerio"; // 1.0.0-rc.12

const query = "stack overflow";
const url = `https://www.google./search?q=${encodeURIComponent(query)}`;
const ua =
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36";

fetch(url, {
  headers: {
    "User-Agent": ua,
  },
})
  .then(res => res.text())
  .then(html => {
    const $ = cheerio.load(html);
    const searchResults = [...$(".LC20lb")].map(e => ({
      title: $(e).text().trim(),
      link: e.parentNode.attribs.href,
    }));
    console.log(searchResults);
  });

See also Click an element on first Google search result using Puppeteer.

I remend you use a simple GET request to scrape the Google Search Results instead of using Puppeteer which is very CPU intensive.

Here is the code which can help you to scrape links from the Search Results Page:

const unirest = require("unirest");
const cheerio = require("cheerio");

const getOrganicData = () => {
  return unirest
    .get("https://www.google./search?q=javascript&gl=us&hl=en")
    .headers({
      "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
    })
    .then((response) => {
      let $ = cheerio.load(response.body);

      let titles = [];
      let links = [];
      let snippets = [];
      let displayedLinks = [];

      $(".yuRUbf > a > h3").each((i, el) => {
        titles[i] = $(el).text();
      });
      $(".yuRUbf > a").each((i, el) => {
        links[i] = $(el).attr("href");
      });
      $(".g .VwiC3b ").each((i, el) => {
        snippets[i] = $(el).text();
      });
      $(".g .yuRUbf .NJjxre .tjvcx").each((i, el) => {
        displayedLinks[i] = $(el).text();
      });

      const organicResults = [];

      for (let i = 0; i < titles.length; i++) {
        organicResults[i] = {
          title: titles[i],
          links: links[i],
          snippet: snippets[i],
          displayedLink: displayedLinks[i],
        };
      }
      console.log(organicResults)
    });
};

getOrganicData();

If you want a plete explanation of this code, I have written a plete blog on it: How to scrape Google Organic Search Results

Alternative:

You can use Google Search API by Serpdog. Serpdog also offers 100 free credits on the first signup.

Scraping can be time-consuming sometimes, but you can use this pre-cooked structured JSON data which makes your work easier, and also you don't have to maintain the Google CSS selectors from time to time which is a big headache.

How to use:

const axios = require('axios');

axios.get('https://api.serpdog.io/search?api_key=APIKEY&q=coffee&gl=us')
  .then(response => {
    console.log(response.data);
  })
  .catch(error => {
    console.log(error);
  });

Results:

 "organic_results": [
{
  "title": "9 Health Benefits of Coffee, Based on Science - Healthline",
  "link": "https://www.healthline./nutrition/top-evidence-based-health-benefits-of-coffee",
  "displayed_link": "https://www.healthline. › Wellness Topics › Nutrition",
  "snippet": "Coffee is a popular beverage that researchers have studied extensively for its many health benefits, including its ability to increase energy levels, promote ...",
  "rank": 1
},
{
  "title": "The Coffee Bean & Tea Leaf | CBTL",
  "link": "https://www.coffeebean./",
  "displayed_link": "https://www.coffeebean.",
  "snippet": "Born and brewed in Southern California since 1963, The Coffee Bean & Tea Leaf® is passionate about connecting loyal customers with carefully handcrafted ...",
  "rank": 2
},
.......

Disclaimer: I am the founder of serpdog.io.

发布评论

评论列表(0)

  1. 暂无评论