Im trying to scrape table from this website using bs4 and request libraries, but I couldnt find any relevant information in XHR or JS sections of Chrome inspect and find the json file.
I was hoping to find an API gateway of it from XHR or JS of website, but I didn`t manage to find anything there, so i decided to scrape data from each page using this code (with the help of ChatGPT)that extracts data from each page and saves it into .csv file:
async def scrape_page(session, page):
"""Scrape data from a single page."""
url = BASE_URL + str(page)
html = await fetch_page(session, url, page)
if not html:
logging.error(f"❌ Failed to fetch page {page}. Skipping...")
return [], None # Skip if page fails
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {"id": lambda x: x and x.startswith("guid-")})
if not table:
logging.warning(f"⚠️ No table found on page {page}. Check if the structure has changed!")
return [], None
titles = [th.text.strip() for th in table.find_all("th")]
rows = table.find_all("tr")[1:] # Skip first row (headers)
data = [[td.text.strip() for td in row.find_all("td")] for row in rows]
print(f"✅ Scraped {len(data)} records from page {page}") # DEBUG PRINT
return data, titles
The thing that I encountered is that there are too many pages (approximately 20,000 pages with 15 rows of data on each page) and it`s taking too long for me too to scrape all of it. Any suggestions on how I could optimize this process?
Im trying to scrape table from this website using bs4 and request libraries, but I couldnt find any relevant information in XHR or JS sections of Chrome inspect and find the json file.
I was hoping to find an API gateway of it from XHR or JS of website, but I didn`t manage to find anything there, so i decided to scrape data from each page using this code (with the help of ChatGPT)that extracts data from each page and saves it into .csv file:
async def scrape_page(session, page):
"""Scrape data from a single page."""
url = BASE_URL + str(page)
html = await fetch_page(session, url, page)
if not html:
logging.error(f"❌ Failed to fetch page {page}. Skipping...")
return [], None # Skip if page fails
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {"id": lambda x: x and x.startswith("guid-")})
if not table:
logging.warning(f"⚠️ No table found on page {page}. Check if the structure has changed!")
return [], None
titles = [th.text.strip() for th in table.find_all("th")]
rows = table.find_all("tr")[1:] # Skip first row (headers)
data = [[td.text.strip() for td in row.find_all("td")] for row in rows]
print(f"✅ Scraped {len(data)} records from page {page}") # DEBUG PRINT
return data, titles
The thing that I encountered is that there are too many pages (approximately 20,000 pages with 15 rows of data on each page) and it`s taking too long for me too to scrape all of it. Any suggestions on how I could optimize this process?
Share Improve this question asked Mar 15 at 17:00 Olzhas IbragimovOlzhas Ibragimov 11 bronze badge1 Answer
Reset to default 0I examined the target website you're scraping, and since the data isn't loaded via a visible API, your current method is necessary. However, to improve efficiency and speed, consider using an asynchronous approach. Here’s how:
- Use Scrapy – It's optimized for fast and scalable scraping.
- Process in Batches – Instead of scraping pages one by one, fetch 100 at a time to reduce overhead.