python - I can`t find request URL with JSON information while web scraping

Im trying to scrape table from this website using bs4 and request libraries, but I couldnt find any relevant information in XHR or JS sections of Chrome inspect and find the json file.

I was hoping to find an API gateway of it from XHR or JS of website, but I didn`t manage to find anything there, so i decided to scrape data from each page using this code (with the help of ChatGPT)that extracts data from each page and saves it into .csv file:

async def scrape_page(session, page):
    """Scrape data from a single page."""
    url = BASE_URL + str(page)
    html = await fetch_page(session, url, page)

    if not html:
        logging.error(f"❌ Failed to fetch page {page}. Skipping...")
        return [], None  # Skip if page fails

    soup = BeautifulSoup(html, "html.parser")
    table = soup.find("table", {"id": lambda x: x and x.startswith("guid-")})

    if not table:
        logging.warning(f"⚠️ No table found on page {page}. Check if the structure has changed!")
        return [], None

    titles = [th.text.strip() for th in table.find_all("th")]
    rows = table.find_all("tr")[1:]  # Skip first row (headers)
    data = [[td.text.strip() for td in row.find_all("td")] for row in rows]

    print(f"✅ Scraped {len(data)} records from page {page}")  # DEBUG PRINT
    return data, titles

The thing that I encountered is that there are too many pages (approximately 20,000 pages with 15 rows of data on each page) and it`s taking too long for me too to scrape all of it. Any suggestions on how I could optimize this process?

Im trying to scrape table from this website using bs4 and request libraries, but I couldnt find any relevant information in XHR or JS sections of Chrome inspect and find the json file.

async def scrape_page(session, page):
    """Scrape data from a single page."""
    url = BASE_URL + str(page)
    html = await fetch_page(session, url, page)

    if not html:
        logging.error(f"❌ Failed to fetch page {page}. Skipping...")
        return [], None  # Skip if page fails

    soup = BeautifulSoup(html, "html.parser")
    table = soup.find("table", {"id": lambda x: x and x.startswith("guid-")})

    if not table:
        logging.warning(f"⚠️ No table found on page {page}. Check if the structure has changed!")
        return [], None

    titles = [th.text.strip() for th in table.find_all("th")]
    rows = table.find_all("tr")[1:]  # Skip first row (headers)
    data = [[td.text.strip() for td in row.find_all("td")] for row in rows]

    print(f"✅ Scraped {len(data)} records from page {page}")  # DEBUG PRINT
    return data, titles

Share Improve this question asked Mar 15 at 17:00 Olzhas Ibragimov 11 bronze badge

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

I examined the target website you're scraping, and since the data isn't loaded via a visible API, your current method is necessary. However, to improve efficiency and speed, consider using an asynchronous approach. Here’s how:

Use Scrapy – It's optimized for fast and scalable scraping.
Process in Batches – Instead of scraping pages one by one, fetch 100 at a time to reduce overhead.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - I can`t find request URL with JSON information while web scraping - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)