web scraping - What resource types to filter out to make the scraper avoid unnecessary requests in scrapy-playwright

Description

I am trying to scrape a site / . I am trying to get the name, price, and discounted price of the products listed

Now when i run a script to scrape the just the first page of the site by using scrapy crawl the number of requests i start making as shown in the console is enormous. I understand that i am using Playwright middleware so it will act like a browser and make other requests for images and stuff to render the site in full detail, but even considering that the scale of GET requests being shown in my console is enormous. This is making it really hard for me to debug and its slowing down my scraper. I want to understand what these requests are and why I am making them. Also how can i limit these requests just to make the bare minimum? I know about page.route method but am unable to understand what resource types to filter out such that at least my scraper gets its data

Steps to Reproduce

Ran scrapy crawl product_playwright with the following code present in for my spider

import scrapy
from scrapy_playwright.page import PageMethod


class ProductPlaywrightSpider(scrapy.Spider):
    name = "product_playwright"
    def start_requests(self):
        url = "/"
        yield scrapy.Request(url, self.parse,
                             meta=dict(
                                 playwright=True,
                                 playwright_include_page=True,
                                 errback=self.errback,
                             ))

    async def parse(self, response):
        page = response.meta["playwright_page"]
        products = response.xpath('//li[@class="bp-p-blueberryDealCard bp-p-filterGrid_item bp-p-dealCard bp-c-card"]')
        screenshot = await page.screenshot(
            path=f'D:\\Desktop\\Company\\PythonProject\\slickdeals\\pic_of_courses\\page1.png',
            full_page=True)
        await page.close()
        for product in products:
            yield dict(
                name = product.xpath(".//a[@class='bp-c-card_title bp-c-link']/text()").get(),
                discounted_price = product.xpath('.//descendant::span[@class="bp-p-dealCard_price"]/text()').get(),
                original_price = product.xpath('.//descendant::span[@class="bp-p-dealCard_originalPrice"]/text()').get(),
                name_of_store = product.xpath('.//descendant::span[@class="bp-c-card_subtitle"]/text()').get()
            )


    async def errback(self, failure):
        page = failure.request.meta["playwright_page"]
        await page.close()

Expected behavior: (Had to trim a lot as the limit was exceeding characters was 354000)

I am not sure about this

Actual behavior:

scrapy crawl product_playwright
2025-02-09 12:03:47 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: slickdeals)
2025-02-09 12:03:47 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.12.8 (tags/v3.12.8:2dc476b, Dec  3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform Windows-11-10.0.26100-SP0
2025-02-09 12:03:47 [scrapy.addons] INFO: Enabled addons:
[]
2025-02-09 12:03:47 [asyncio] DEBUG: Using selector: SelectSelector
2025-02-09 12:03:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2025-02-09 12:03:47 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2025-02-09 12:03:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2025-02-09 12:03:47 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2025-02-09 12:03:47 [scrapy.extensions.telnet] INFO: Telnet Password: 8a6d05924bf76441
2025-02-09 12:03:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2025-02-09 12:03:47 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'slickdeals',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'slickdeals.spiders',
 'SPIDER_MODULES': ['slickdeals.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2025-02-09 12:03:47 [asyncio] DEBUG: Using proactor: IocpProactor
2025-02-09 12:03:47 [scrapy-playwright] INFO: Started loop on separate thread: <ProactorEventLoop running=True closed=False debug=False>
2025-02-09 12:03:48 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2025-02-09 12:03:48 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2025-02-09 12:03:48 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2025-02-09 12:03:48 [scrapy.core.engine] INFO: Spider opened
2025-02-09 12:03:48 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2025-02-09 12:03:48 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2025-02-09 12:03:48 [scrapy-playwright] INFO: Starting download handler
2025-02-09 12:03:48 [scrapy-playwright] INFO: Starting download handler
2025-02-09 12:03:53 [scrapy-playwright] INFO: Launching browser chromium
2025-02-09 12:03:53 [scrapy-playwright] INFO: Browser chromium launched
2025-02-09 12:03:53 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2025-02-09 12:03:55 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2025-02-09 12:03:55 [scrapy-playwright] DEBUG: [Context=default] Request: <GET /> (resource type: document)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 />
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .php?section=app.php&url=%2Fcomputer-deals%2F> (resource type: other, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET :wght@400;600;700&display=swap> (resource type: stylesheet, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET ;style=14&n=global-critical-desktop%2Cglobal-desktop%2Clegacy-global-desktop%2Cjqueryui%2Ccomponents> (resource type: stylesheet, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET /[email protected]> (resource type: image, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET /[email protected]> (resource type: image, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .es.00330623.js> (resource type: script, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET /[email protected]> (resource type: image, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 /[email protected]>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 /[email protected]>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .browser.2e0e5c307a4e1b12459ebc95cb41237cc62c717b960391ec21ae6b3a2d3df526.js> (resource type: script, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 .es.00330623.js>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 /[email protected]>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 :wght@400;600;700&display=swap>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 .browser.2e0e5c307a4e1b12459ebc95cb41237cc62c717b960391ec21ae6b3a2d3df526.js>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 .php?section=app.php&url=%2Fcomputer-deals%2F>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js?9310> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Response: <200 .js?9310>
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Response: <301 .js> (location: ;upapi=true)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Response: <200 ;style=14&n=global-critical-desktop%2Cglobal-desktop%2Clegacy-global-desktop%2Cjqueryui%2Ccomponents>
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET ;upapi=true> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Response: <200 ;upapi=true>
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js?1> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js?id=GTM-5XP5PSM&l=gtmDl> (resource type: script, referrer: /)
2025-02-09 13:13:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 282,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 998621,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 30.030142,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2025, 2, 9, 7, 43, 15, 263535, tzinfo=datetime.timezone.utc),
 'item_scraped_count': 40,
 'items_per_minute': None,
 'log_count/DEBUG': 637,
 'log_count/INFO': 17,
 'playwright/browser_count': 1,
 'playwright/context_count': 1,
 'playwright/context_count/max_concurrent': 1,
 'playwright/context_count/persistent/False': 1,
 'playwright/context_count/remote/False': 1,
 'playwright/page_count': 1,
 'playwright/page_count/max_concurrent': 1,
 'playwright/request_count': 322,
 'playwright/request_count/method/GET': 306,
 'playwright/request_count/method/HEAD': 1,
 'playwright/request_count/method/POST': 15,
 'playwright/request_count/navigation': 33,
 'playwright/request_count/resource_type/document': 33,
 'playwright/request_count/resource_type/fetch': 48,
 'playwright/request_count/resource_type/font': 6,
 'playwright/request_count/resource_type/image': 157,
 'playwright/request_count/resource_type/other': 2,
 'playwright/request_count/resource_type/script': 62,
 'playwright/request_count/resource_type/stylesheet': 2,
 'playwright/request_count/resource_type/xhr': 12,
 'playwright/response_count': 266,
 'playwright/response_count/method/GET': 252,
 'playwright/response_count/method/HEAD': 1,
 'playwright/response_count/method/POST': 13,
 'playwright/response_count/resource_type/document': 32,
 'playwright/response_count/resource_type/fetch': 45,
 'playwright/response_count/resource_type/font': 6,
 'playwright/response_count/resource_type/image': 121,
 'playwright/response_count/resource_type/other': 2,
 'playwright/response_count/resource_type/script': 47,
 'playwright/response_count/resource_type/stylesheet': 2,
 'playwright/response_count/resource_type/xhr': 11,
 'response_received_count': 1,
 'responses_per_minute': None,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2025, 2, 9, 7, 42, 45, 233393, tzinfo=datetime.timezone.utc)}
2025-02-09 13:13:15 [scrapy.core.engine] INFO: Spider closed (finished)
2025-02-09 13:13:15 [scrapy-playwright] INFO: Closing download handler
2025-02-09 13:13:16 [scrapy-playwright] INFO: Closing download handler
2025-02-09 13:13:16 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False, remote=False)
2025-02-09 13:13:16 [scrapy-playwright] INFO: Closing browser
2025-02-09 13:13:16 [scrapy-playwright] DEBUG: Browser disconnected

Reproduces how often: every time

Versions(Have added Playwright's version too)

scrapy version --verbose
Scrapy       : 2.12.0
playwright   : 1.49.1
scrapy-playwright : 0.0.42
lxml         : 5.3.0.0
libxml2      : 2.11.7
cssselect    : 1.2.0
parsel       : 1.10.0
w3lib        : 2.2.1
Twisted      : 24.11.0
Python       : 3.12.8 (tags/v3.12.8:2dc476b, Dec  3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)]
pyOpenSSL    : 25.0.0 (OpenSSL 3.4.0 22 Oct 2024)
cryptography : 44.0.0
Platform     : Windows-11-10.0.26100-SP0

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

web scraping - What resource types to filter out to make the scraper avoid unnecessary requests in scrapy-playwright - Stack Ove

Description

Steps to Reproduce

Versions(Have added Playwright's version too)

与本文相关的文章

评论列表(0)