javascript - Scraping Amazon prices with Puppeteer

I tried to scrape an Amazon page to get the price of a product, but the scraping result gives me different amounts of money than shown in the actual browser. I checked many times but couldn't get the right result. It gives me $89.99 dollars while on the actual site the product costs $58.95. Does Amazon confuse web scrapers and crawlers intentionally or is it my fault? I used Puppeteer and JSDom in NodeJS.

NodeJS code:

const puppeteer = require('puppeteer');
const jsdom = require('jsdom');
const { JSDOM } = jsdom;

const url = ';keywords=Deathadder%2BChroma&qid=1625425444&sr=8-2&th=1';

async function configureBrowser() {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);
    return page;
}

async function pageContent() {
    let page = await configureBrowser();
    // await page.reload();
    let html = await page.evaluate(() => document.body.innerHTML);
    await page.close();

    console.log(new JSDOM(html).window.document.querySelector('#priceblock_ourprice').textContent);

    // return new JSDOM(html).window.document.querySelector('#priceblock_ourprice').textContent;
}

module.exports = pageContent;

NodeJS code:

const puppeteer = require('puppeteer');
const jsdom = require('jsdom');
const { JSDOM } = jsdom;

const url = 'https://www.amazon./Razer-DeathAdder-Chroma-Multi-Color-Comfortable/dp/B00MYTSDU4/ref=sr_1_2?dchild=1&keywords=Deathadder%2BChroma&qid=1625425444&sr=8-2&th=1';

async function configureBrowser() {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);
    return page;
}

async function pageContent() {
    let page = await configureBrowser();
    // await page.reload();
    let html = await page.evaluate(() => document.body.innerHTML);
    await page.close();

    console.log(new JSDOM(html).window.document.querySelector('#priceblock_ourprice').textContent);

    // return new JSDOM(html).window.document.querySelector('#priceblock_ourprice').textContent;
}

module.exports = pageContent;

Share Improve this question edited Jun 9, 2022 at 16:12 ggorlen 58k8 gold badges114 silver badges157 bronze badges asked Jul 4, 2021 at 20:48 KoboldMines 4902 gold badges8 silver badges22 bronze badges

Add a ment |

2 Answers 2

Sorted by: Reset to default 5

It's odd to bine JSDom with Puppeteer. Puppeteer already has a full suite of selectors and works on the actual, realtime DOM inside the webpage, so to dump and re-parse the entire HTML using a simulated DOM like JSDom is an unnecessary layer of indirection that can lead to confusion.

When the page is injecting the content dynamically, just use Puppeteer alone:

const puppeteer = require("puppeteer"); // ^23.0.1

const url = "<your URL>";

let browser;
(async () => {
  browser = await puppeteer.launch({headless: "new"});
  const [page] = await browser.pages();
  await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");
  await page.setJavaScriptEnabled(false);
  await page.setRequestInterception(true);
  page.on("request", req => req.url() === url ? req.continue() : req.abort());
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const el = await page.waitForSelector(".a-price .a-offscreen");
  const price = await el.evaluate(el => el.innerText);
  console.log(price);
})()
  .catch(err => console.error(err))
  .finally(() => browser.close());

Since the price you want appears to be baked into the static HTML in this case, I've disabled JS and resource requests. But you can potentially go a step further and skip Puppeteer and use JSDom along with a basic HTTP request to get the data:

<span class="a-price a-text-price a-size-medium apexPriceToPay" data-a-size="b" data-a-color="price">
  <span class="a-offscreen">$53.00</span>
  <span aria-hidden="true">$53.00</span>
</span>

const axios = require("axios"); // ^1.6.8
const {JSDOM} = require("jsdom"); // ^24.0.0

const url = "<your URL>";

(async () => {
  const {data: html} = await axios.get(url, {
    headers: { // https://www.zenrows./blog/stealth-web-scraping-in-python-avoid-blocking-like-a-ninja#full-set-of-headers
      "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 
      "Accept-Encoding": "gzip, deflate, br", 
      "Accept-Language": "en-US,en;q=0.9", 
      "Sec-Ch-Ua": "\"Chromium\";v=\"92\", \" Not A;Brand\";v=\"99\", \"Google Chrome\";v=\"92\"", 
      "Sec-Ch-Ua-Mobile": "?0", 
      "Sec-Fetch-Dest": "document", 
      "Sec-Fetch-Mode": "navigate", 
      "Sec-Fetch-Site": "none", 
      "Sec-Fetch-User": "?1", 
      "Upgrade-Insecure-Requests": "1", 
      "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36", 
      "X-Amzn-Trace-Id": "Root=1-60ff12bb-55defac340ac48081d670f9d"
    }
  });
  const price = new JSDOM(html)
    .window
    .document
    .querySelector(".a-price .a-offscreen")
    .textContent;
  console.log(price);
})()
  .catch(err => console.error(err));

Does Amazon confuse web scrapers and crawlers intentionally?

It's possible that you're offered a different price based on location or other factors, such as running the script multiple times, but some of these changes occur even when visiting the page as a normal user.

Amazon changes selectors often and may take more intensive measures to block scraping, so some of the code here will require tweaks and updates to work in the future.

If ggorlen answer didn't help you could give this way a try only using puppeteer.

const puppeteer = require("puppeteer");

const scrape = async (url) => {
  let browser, page;

  try {
    console.log('opening browser');
    browser = await puppeteer.launch();
    page = await browser.newPage();
    await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 60000 });

    await page.waitForSelector('#priceblock_ourprice', { visible: true });

    const data = await page.evaluate(() => {
      return [
        JSON.stringify(document.getElementById('priceblock_ourprice').innerText)
      ];
    });

    const [price] = [ JSON.parse(data[0]) ];

    console.log({ price });
    return { price };

  } catch (error) {
    console.log('scrape error', error.message);
  } finally {
    if (browser) {
      await browser.close();
      console.log('closing browser');
    }
  }
}

scrape('https://www.amazon./Razer-DeathAdder-Chroma-Multi-Color-Comfortable/dp/B00MYTSDU4/ref=sr_1_2?dchild=1&keywords=Deathadder%2BChroma&qid=1625425444&sr=8-2&th=1');

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

javascript - Scraping Amazon prices with Puppeteer - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)