Description
I am trying to scrape a site / . I am trying to get the name, price, and discounted price of the products listed
Now when i run a script to scrape the just the first page of the site by using scrapy crawl
the number of requests i start making as shown in the console is enormous. I understand that i am using Playwright middleware so it will act like a browser and make other requests for images and stuff to render the site in full detail, but even considering that the scale of GET
requests being shown in my console is enormous. This is making it really hard for me to debug and its slowing down my scraper. I want to understand what these requests are and why I am making them. Also how can i limit these requests just to make the bare minimum? I know about page.route
method but am unable to understand what resource types to filter out such that at least my scraper gets its data
Steps to Reproduce
- Ran
scrapy crawl product_playwright
with the following code present in for my spider
import scrapy
from scrapy_playwright.page import PageMethod
class ProductPlaywrightSpider(scrapy.Spider):
name = "product_playwright"
def start_requests(self):
url = "/"
yield scrapy.Request(url, self.parse,
meta=dict(
playwright=True,
playwright_include_page=True,
errback=self.errback,
))
async def parse(self, response):
page = response.meta["playwright_page"]
products = response.xpath('//li[@class="bp-p-blueberryDealCard bp-p-filterGrid_item bp-p-dealCard bp-c-card"]')
screenshot = await page.screenshot(
path=f'D:\\Desktop\\Company\\PythonProject\\slickdeals\\pic_of_courses\\page1.png',
full_page=True)
await page.close()
for product in products:
yield dict(
name = product.xpath(".//a[@class='bp-c-card_title bp-c-link']/text()").get(),
discounted_price = product.xpath('.//descendant::span[@class="bp-p-dealCard_price"]/text()').get(),
original_price = product.xpath('.//descendant::span[@class="bp-p-dealCard_originalPrice"]/text()').get(),
name_of_store = product.xpath('.//descendant::span[@class="bp-c-card_subtitle"]/text()').get()
)
async def errback(self, failure):
page = failure.request.meta["playwright_page"]
await page.close()
Expected behavior: (Had to trim a lot as the limit was exceeding characters was 354000)
I am not sure about this
Actual behavior:
scrapy crawl product_playwright
2025-02-09 12:03:47 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: slickdeals)
2025-02-09 12:03:47 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.12.8 (tags/v3.12.8:2dc476b, Dec 3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform Windows-11-10.0.26100-SP0
2025-02-09 12:03:47 [scrapy.addons] INFO: Enabled addons:
[]
2025-02-09 12:03:47 [asyncio] DEBUG: Using selector: SelectSelector
2025-02-09 12:03:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2025-02-09 12:03:47 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2025-02-09 12:03:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2025-02-09 12:03:47 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2025-02-09 12:03:47 [scrapy.extensions.telnet] INFO: Telnet Password: 8a6d05924bf76441
2025-02-09 12:03:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2025-02-09 12:03:47 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'slickdeals',
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'slickdeals.spiders',
'SPIDER_MODULES': ['slickdeals.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2025-02-09 12:03:47 [asyncio] DEBUG: Using proactor: IocpProactor
2025-02-09 12:03:47 [scrapy-playwright] INFO: Started loop on separate thread: <ProactorEventLoop running=True closed=False debug=False>
2025-02-09 12:03:48 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2025-02-09 12:03:48 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2025-02-09 12:03:48 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2025-02-09 12:03:48 [scrapy.core.engine] INFO: Spider opened
2025-02-09 12:03:48 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2025-02-09 12:03:48 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2025-02-09 12:03:48 [scrapy-playwright] INFO: Starting download handler
2025-02-09 12:03:48 [scrapy-playwright] INFO: Starting download handler
2025-02-09 12:03:53 [scrapy-playwright] INFO: Launching browser chromium
2025-02-09 12:03:53 [scrapy-playwright] INFO: Browser chromium launched
2025-02-09 12:03:53 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2025-02-09 12:03:55 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2025-02-09 12:03:55 [scrapy-playwright] DEBUG: [Context=default] Request: <GET /> (resource type: document)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 />
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .php?section=app.php&url=%2Fcomputer-deals%2F> (resource type: other, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET :wght@400;600;700&display=swap> (resource type: stylesheet, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET ;style=14&n=global-critical-desktop%2Cglobal-desktop%2Clegacy-global-desktop%2Cjqueryui%2Ccomponents> (resource type: stylesheet, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET /[email protected]> (resource type: image, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET /[email protected]> (resource type: image, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .es.00330623.js> (resource type: script, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET /[email protected]> (resource type: image, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 /[email protected]>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 /[email protected]>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .browser.2e0e5c307a4e1b12459ebc95cb41237cc62c717b960391ec21ae6b3a2d3df526.js> (resource type: script, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 .es.00330623.js>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 /[email protected]>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 :wght@400;600;700&display=swap>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 .browser.2e0e5c307a4e1b12459ebc95cb41237cc62c717b960391ec21ae6b3a2d3df526.js>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 .php?section=app.php&url=%2Fcomputer-deals%2F>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js?9310> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Response: <200 .js?9310>
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Response: <301 .js> (location: ;upapi=true)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Response: <200 ;style=14&n=global-critical-desktop%2Cglobal-desktop%2Clegacy-global-desktop%2Cjqueryui%2Ccomponents>
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET ;upapi=true> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Response: <200 ;upapi=true>
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js?1> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js?id=GTM-5XP5PSM&l=gtmDl> (resource type: script, referrer: /)
2025-02-09 13:13:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 282,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 998621,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 30.030142,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2025, 2, 9, 7, 43, 15, 263535, tzinfo=datetime.timezone.utc),
'item_scraped_count': 40,
'items_per_minute': None,
'log_count/DEBUG': 637,
'log_count/INFO': 17,
'playwright/browser_count': 1,
'playwright/context_count': 1,
'playwright/context_count/max_concurrent': 1,
'playwright/context_count/persistent/False': 1,
'playwright/context_count/remote/False': 1,
'playwright/page_count': 1,
'playwright/page_count/max_concurrent': 1,
'playwright/request_count': 322,
'playwright/request_count/method/GET': 306,
'playwright/request_count/method/HEAD': 1,
'playwright/request_count/method/POST': 15,
'playwright/request_count/navigation': 33,
'playwright/request_count/resource_type/document': 33,
'playwright/request_count/resource_type/fetch': 48,
'playwright/request_count/resource_type/font': 6,
'playwright/request_count/resource_type/image': 157,
'playwright/request_count/resource_type/other': 2,
'playwright/request_count/resource_type/script': 62,
'playwright/request_count/resource_type/stylesheet': 2,
'playwright/request_count/resource_type/xhr': 12,
'playwright/response_count': 266,
'playwright/response_count/method/GET': 252,
'playwright/response_count/method/HEAD': 1,
'playwright/response_count/method/POST': 13,
'playwright/response_count/resource_type/document': 32,
'playwright/response_count/resource_type/fetch': 45,
'playwright/response_count/resource_type/font': 6,
'playwright/response_count/resource_type/image': 121,
'playwright/response_count/resource_type/other': 2,
'playwright/response_count/resource_type/script': 47,
'playwright/response_count/resource_type/stylesheet': 2,
'playwright/response_count/resource_type/xhr': 11,
'response_received_count': 1,
'responses_per_minute': None,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2025, 2, 9, 7, 42, 45, 233393, tzinfo=datetime.timezone.utc)}
2025-02-09 13:13:15 [scrapy.core.engine] INFO: Spider closed (finished)
2025-02-09 13:13:15 [scrapy-playwright] INFO: Closing download handler
2025-02-09 13:13:16 [scrapy-playwright] INFO: Closing download handler
2025-02-09 13:13:16 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False, remote=False)
2025-02-09 13:13:16 [scrapy-playwright] INFO: Closing browser
2025-02-09 13:13:16 [scrapy-playwright] DEBUG: Browser disconnected
Reproduces how often: every time
Versions(Have added Playwright's version too)
scrapy version --verbose
Scrapy : 2.12.0
playwright : 1.49.1
scrapy-playwright : 0.0.42
lxml : 5.3.0.0
libxml2 : 2.11.7
cssselect : 1.2.0
parsel : 1.10.0
w3lib : 2.2.1
Twisted : 24.11.0
Python : 3.12.8 (tags/v3.12.8:2dc476b, Dec 3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)]
pyOpenSSL : 25.0.0 (OpenSSL 3.4.0 22 Oct 2024)
cryptography : 44.0.0
Platform : Windows-11-10.0.26100-SP0