最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Scraping Amazon prices with Puppeteer - Stack Overflow

programmeradmin3浏览0评论

I tried to scrape an Amazon page to get the price of a product, but the scraping result gives me different amounts of money than shown in the actual browser. I checked many times but couldn't get the right result. It gives me $89.99 dollars while on the actual site the product costs $58.95. Does Amazon confuse web scrapers and crawlers intentionally or is it my fault? I used Puppeteer and JSDom in NodeJS.

NodeJS code:

const puppeteer = require('puppeteer');
const jsdom = require('jsdom');
const { JSDOM } = jsdom;

const url = ';keywords=Deathadder%2BChroma&qid=1625425444&sr=8-2&th=1';

async function configureBrowser() {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);
    return page;
}

async function pageContent() {
    let page = await configureBrowser();
    // await page.reload();
    let html = await page.evaluate(() => document.body.innerHTML);
    await page.close();

    console.log(new JSDOM(html).window.document.querySelector('#priceblock_ourprice').textContent);

    // return new JSDOM(html).window.document.querySelector('#priceblock_ourprice').textContent;
}

module.exports = pageContent;

I tried to scrape an Amazon page to get the price of a product, but the scraping result gives me different amounts of money than shown in the actual browser. I checked many times but couldn't get the right result. It gives me $89.99 dollars while on the actual site the product costs $58.95. Does Amazon confuse web scrapers and crawlers intentionally or is it my fault? I used Puppeteer and JSDom in NodeJS.

NodeJS code:

const puppeteer = require('puppeteer');
const jsdom = require('jsdom');
const { JSDOM } = jsdom;

const url = 'https://www.amazon./Razer-DeathAdder-Chroma-Multi-Color-Comfortable/dp/B00MYTSDU4/ref=sr_1_2?dchild=1&keywords=Deathadder%2BChroma&qid=1625425444&sr=8-2&th=1';

async function configureBrowser() {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);
    return page;
}

async function pageContent() {
    let page = await configureBrowser();
    // await page.reload();
    let html = await page.evaluate(() => document.body.innerHTML);
    await page.close();

    console.log(new JSDOM(html).window.document.querySelector('#priceblock_ourprice').textContent);

    // return new JSDOM(html).window.document.querySelector('#priceblock_ourprice').textContent;
}

module.exports = pageContent;
Share Improve this question edited Jun 9, 2022 at 16:12 ggorlen 58k8 gold badges114 silver badges157 bronze badges asked Jul 4, 2021 at 20:48 KoboldMinesKoboldMines 4902 gold badges8 silver badges22 bronze badges
Add a ment  | 

2 Answers 2

Reset to default 5

It's odd to bine JSDom with Puppeteer. Puppeteer already has a full suite of selectors and works on the actual, realtime DOM inside the webpage, so to dump and re-parse the entire HTML using a simulated DOM like JSDom is an unnecessary layer of indirection that can lead to confusion.

When the page is injecting the content dynamically, just use Puppeteer alone:

const puppeteer = require("puppeteer"); // ^23.0.1

const url = "<your URL>";

let browser;
(async () => {
  browser = await puppeteer.launch({headless: "new"});
  const [page] = await browser.pages();
  await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");
  await page.setJavaScriptEnabled(false);
  await page.setRequestInterception(true);
  page.on("request", req => req.url() === url ? req.continue() : req.abort());
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const el = await page.waitForSelector(".a-price .a-offscreen");
  const price = await el.evaluate(el => el.innerText);
  console.log(price);
})()
  .catch(err => console.error(err))
  .finally(() => browser.close());

Since the price you want appears to be baked into the static HTML in this case, I've disabled JS and resource requests. But you can potentially go a step further and skip Puppeteer and use JSDom along with a basic HTTP request to get the data:

<span class="a-price a-text-price a-size-medium apexPriceToPay" data-a-size="b" data-a-color="price">
  <span class="a-offscreen">$53.00</span>
  <span aria-hidden="true">$53.00</span>
</span>
const axios = require("axios"); // ^1.6.8
const {JSDOM} = require("jsdom"); // ^24.0.0

const url = "<your URL>";

(async () => {
  const {data: html} = await axios.get(url, {
    headers: { // https://www.zenrows./blog/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja#full-set-of-headers
      "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
      "Accept-Encoding": "gzip, deflate, br", 
      "Accept-Language": "en-US,en;q=0.9", 
      "Sec-Ch-Ua": "\"Chromium\";v=\"92\", \" Not A;Brand\";v=\"99\", \"Google Chrome\";v=\"92\"", 
      "Sec-Ch-Ua-Mobile": "?0", 
      "Sec-Fetch-Dest": "document", 
      "Sec-Fetch-Mode": "navigate", 
      "Sec-Fetch-Site": "none", 
      "Sec-Fetch-User": "?1", 
      "Upgrade-Insecure-Requests": "1", 
      "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36", 
      "X-Amzn-Trace-Id": "Root=1-60ff12bb-55defac340ac48081d670f9d"
    }
  });
  const price = new JSDOM(html)
    .window
    .document
    .querySelector(".a-price .a-offscreen")
    .textContent;
  console.log(price);
})()
  .catch(err => console.error(err));

Does Amazon confuse web scrapers and crawlers intentionally?

It's possible that you're offered a different price based on location or other factors, such as running the script multiple times, but some of these changes occur even when visiting the page as a normal user.

Amazon changes selectors often and may take more intensive measures to block scraping, so some of the code here will require tweaks and updates to work in the future.

If ggorlen answer didn't help you could give this way a try only using puppeteer.

const puppeteer = require("puppeteer");

const scrape = async (url) => {
  let browser, page;

  try {
    console.log('opening browser');
    browser = await puppeteer.launch();
    page = await browser.newPage();
    await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 60000 });

    await page.waitForSelector('#priceblock_ourprice', { visible: true });

    const data = await page.evaluate(() => {
      return [
        JSON.stringify(document.getElementById('priceblock_ourprice').innerText)
      ];
    });

    const [price] = [ JSON.parse(data[0]) ];

    console.log({ price });
    return { price };

  } catch (error) {
    console.log('scrape error', error.message);
  } finally {
    if (browser) {
      await browser.close();
      console.log('closing browser');
    }
  }
}

scrape('https://www.amazon./Razer-DeathAdder-Chroma-Multi-Color-Comfortable/dp/B00MYTSDU4/ref=sr_1_2?dchild=1&keywords=Deathadder%2BChroma&qid=1625425444&sr=8-2&th=1');
发布评论

评论列表(0)

  1. 暂无评论