node.js - Bypassing Cloudflare with Puppeteer and FlareSolver

In the past few weeks, we have been working on web scraping on /. Initially, we used only Puppeteer, but quite often, the browser encountered a Cloudflare challenge page displaying the message:

"Waiting for the website to respond."

To overcome this, we tried several alternative approaches:

Puppeteer + rotating proxy
puppeteer-extra-plugin-stealth
Puppeteer-real-browser + rotating proxy
Puppeteer-real-browser + FlareSolverr pre-request + rotating proxy

Current Approach

To address the issue, we decided to make a pre-request to the site using FlareSolverr. We then extracted the obtained cookies and user agent and passed them to Puppeteer for browser navigation.

However, we encountered two key issues:

FlareSolverr fails to solve the Cloudflare challenge

When FlareSolverr detects a Cloudflare challenge, it fails to bypass it, logging the error:

Error solving the challenge. Timeout after X seconds.

When FlareSolverr does not detect a challenge and successfully completes the pre-request, we extract the cookies and user agent, set them in Puppeteer, and navigate to the target page.

However, Puppeteer still encounters the Cloudflare challenge page. This suggests that FlareSolverr might not be detecting the challenge properly and, therefore, does not retrieve the necessary cookies.

Question What are we doing wrong? It seems that FlareSolverr reduces the likelihood of hitting the challenge but fails when it actually encounters one.

What would be the best approach to ensure Puppeteer can bypass Cloudflare protection?

Code Snippet typescript

let flaresolverrData: any;
let attempts = 0;
const maxAttempts = 5;

// pre request to flaresolverr, if it fails we try again.
while (attempts < maxAttempts) {
  try {
    let response = await fetch("http://localhost:8191/v1", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        cmd: "request.get",
        url: actionParams.url,
        maxTimeout: 5000,
        session: flaresolverrSessionId,
      }),
    });

    flaresolverrData = await response.json();

    if (flaresolverrData?.status === "error") {
      throw new Error(flaresolverrData?.message ?? "FlareSolverr error");
    }

    break;
  } catch (error: any) {
    attempts++;
    if (attempts === maxAttempts) {
      throw new Error(`Failed to fetch after ${attempts} attempts: ${error.message}`);
    }
    await new Promise((resolve) => setTimeout(resolve, 1000)); // Wait 1 second between retries
  }
}

if (!flaresolverrData) throw new Error("FlareSolverr data not found");

const cookies = flaresolverrData.solution.cookies;
const userAgent = flaresolverrData.solution.userAgent;

if (!userAgent) throw new Error("User agent not found");

if (cookies.length !== 0) {
  await browser.setCookie(
    ...cookies.map((cookie: any) => ({ ...cookie, expires: cookie?.expiry ?? 0 }))
  );
}

await browserPage.setUserAgent(userAgent);

await delay(Math.random() * 10000 + 1000);
await browserPage.goto(actionParams.url, { waitUntil: "networkidle0" });

const content = await browserPage.content();
return content;

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

node.js - Bypassing Cloudflare with Puppeteer and FlareSolver - Stack Overflow

与本文相关的文章

评论列表(0)