最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

web scraping - What resource types to filter out to make the scraper avoid unnecessary requests in scrapy-playwright - Stack Ove

programmeradmin1浏览0评论

Description

I am trying to scrape a site / . I am trying to get the name, price, and discounted price of the products listed

Now when i run a script to scrape the just the first page of the site by using scrapy crawl the number of requests i start making as shown in the console is enormous. I understand that i am using Playwright middleware so it will act like a browser and make other requests for images and stuff to render the site in full detail, but even considering that the scale of GET requests being shown in my console is enormous. This is making it really hard for me to debug and its slowing down my scraper. I want to understand what these requests are and why I am making them. Also how can i limit these requests just to make the bare minimum? I know about page.route method but am unable to understand what resource types to filter out such that at least my scraper gets its data

Steps to Reproduce

  1. Ran scrapy crawl product_playwright with the following code present in for my spider
import scrapy
from scrapy_playwright.page import PageMethod


class ProductPlaywrightSpider(scrapy.Spider):
    name = "product_playwright"
    def start_requests(self):
        url = "/"
        yield scrapy.Request(url, self.parse,
                             meta=dict(
                                 playwright=True,
                                 playwright_include_page=True,
                                 errback=self.errback,
                             ))

    async def parse(self, response):
        page = response.meta["playwright_page"]
        products = response.xpath('//li[@class="bp-p-blueberryDealCard bp-p-filterGrid_item bp-p-dealCard bp-c-card"]')
        screenshot = await page.screenshot(
            path=f'D:\\Desktop\\Company\\PythonProject\\slickdeals\\pic_of_courses\\page1.png',
            full_page=True)
        await page.close()
        for product in products:
            yield dict(
                name = product.xpath(".//a[@class='bp-c-card_title bp-c-link']/text()").get(),
                discounted_price = product.xpath('.//descendant::span[@class="bp-p-dealCard_price"]/text()').get(),
                original_price = product.xpath('.//descendant::span[@class="bp-p-dealCard_originalPrice"]/text()').get(),
                name_of_store = product.xpath('.//descendant::span[@class="bp-c-card_subtitle"]/text()').get()
            )


    async def errback(self, failure):
        page = failure.request.meta["playwright_page"]
        await page.close()

Expected behavior: (Had to trim a lot as the limit was exceeding characters was 354000)

I am not sure about this

Actual behavior:

scrapy crawl product_playwright
2025-02-09 12:03:47 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: slickdeals)
2025-02-09 12:03:47 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.11.0, Python 3.12.8 (tags/v3.12.8:2dc476b, Dec  3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)], pyOpenSSL 24.3.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform Windows-11-10.0.26100-SP0
2025-02-09 12:03:47 [scrapy.addons] INFO: Enabled addons:
[]
2025-02-09 12:03:47 [asyncio] DEBUG: Using selector: SelectSelector
2025-02-09 12:03:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2025-02-09 12:03:47 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2025-02-09 12:03:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2025-02-09 12:03:47 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2025-02-09 12:03:47 [scrapy.extensions.telnet] INFO: Telnet Password: 8a6d05924bf76441
2025-02-09 12:03:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2025-02-09 12:03:47 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'slickdeals',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'slickdeals.spiders',
 'SPIDER_MODULES': ['slickdeals.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2025-02-09 12:03:47 [asyncio] DEBUG: Using proactor: IocpProactor
2025-02-09 12:03:47 [scrapy-playwright] INFO: Started loop on separate thread: <ProactorEventLoop running=True closed=False debug=False>
2025-02-09 12:03:48 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2025-02-09 12:03:48 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2025-02-09 12:03:48 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2025-02-09 12:03:48 [scrapy.core.engine] INFO: Spider opened
2025-02-09 12:03:48 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2025-02-09 12:03:48 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2025-02-09 12:03:48 [scrapy-playwright] INFO: Starting download handler
2025-02-09 12:03:48 [scrapy-playwright] INFO: Starting download handler
2025-02-09 12:03:53 [scrapy-playwright] INFO: Launching browser chromium
2025-02-09 12:03:53 [scrapy-playwright] INFO: Browser chromium launched
2025-02-09 12:03:53 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2025-02-09 12:03:55 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2025-02-09 12:03:55 [scrapy-playwright] DEBUG: [Context=default] Request: <GET /> (resource type: document)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 />
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .php?section=app.php&url=%2Fcomputer-deals%2F> (resource type: other, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET :wght@400;600;700&display=swap> (resource type: stylesheet, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET ;style=14&n=global-critical-desktop%2Cglobal-desktop%2Clegacy-global-desktop%2Cjqueryui%2Ccomponents> (resource type: stylesheet, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET /[email protected]> (resource type: image, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET /[email protected]> (resource type: image, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .es.00330623.js> (resource type: script, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET /[email protected]> (resource type: image, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 /[email protected]>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 /[email protected]>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .browser.2e0e5c307a4e1b12459ebc95cb41237cc62c717b960391ec21ae6b3a2d3df526.js> (resource type: script, referrer: /)
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 .es.00330623.js>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 /[email protected]>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 :wght@400;600;700&display=swap>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 .browser.2e0e5c307a4e1b12459ebc95cb41237cc62c717b960391ec21ae6b3a2d3df526.js>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Response: <200 .php?section=app.php&url=%2Fcomputer-deals%2F>
2025-02-09 12:03:59 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js?9310> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Response: <200 .js?9310>
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Response: <301 .js> (location: ;upapi=true)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Response: <200 ;style=14&n=global-critical-desktop%2Cglobal-desktop%2Clegacy-global-desktop%2Cjqueryui%2Ccomponents>
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET ;upapi=true> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Response: <200 ;upapi=true>
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js?1> (resource type: script, referrer: /)
2025-02-09 12:04:00 [scrapy-playwright] DEBUG: [Context=default] Request: <GET .js?id=GTM-5XP5PSM&l=gtmDl> (resource type: script, referrer: /)
2025-02-09 13:13:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 282,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 998621,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 30.030142,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2025, 2, 9, 7, 43, 15, 263535, tzinfo=datetime.timezone.utc),
 'item_scraped_count': 40,
 'items_per_minute': None,
 'log_count/DEBUG': 637,
 'log_count/INFO': 17,
 'playwright/browser_count': 1,
 'playwright/context_count': 1,
 'playwright/context_count/max_concurrent': 1,
 'playwright/context_count/persistent/False': 1,
 'playwright/context_count/remote/False': 1,
 'playwright/page_count': 1,
 'playwright/page_count/max_concurrent': 1,
 'playwright/request_count': 322,
 'playwright/request_count/method/GET': 306,
 'playwright/request_count/method/HEAD': 1,
 'playwright/request_count/method/POST': 15,
 'playwright/request_count/navigation': 33,
 'playwright/request_count/resource_type/document': 33,
 'playwright/request_count/resource_type/fetch': 48,
 'playwright/request_count/resource_type/font': 6,
 'playwright/request_count/resource_type/image': 157,
 'playwright/request_count/resource_type/other': 2,
 'playwright/request_count/resource_type/script': 62,
 'playwright/request_count/resource_type/stylesheet': 2,
 'playwright/request_count/resource_type/xhr': 12,
 'playwright/response_count': 266,
 'playwright/response_count/method/GET': 252,
 'playwright/response_count/method/HEAD': 1,
 'playwright/response_count/method/POST': 13,
 'playwright/response_count/resource_type/document': 32,
 'playwright/response_count/resource_type/fetch': 45,
 'playwright/response_count/resource_type/font': 6,
 'playwright/response_count/resource_type/image': 121,
 'playwright/response_count/resource_type/other': 2,
 'playwright/response_count/resource_type/script': 47,
 'playwright/response_count/resource_type/stylesheet': 2,
 'playwright/response_count/resource_type/xhr': 11,
 'response_received_count': 1,
 'responses_per_minute': None,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2025, 2, 9, 7, 42, 45, 233393, tzinfo=datetime.timezone.utc)}
2025-02-09 13:13:15 [scrapy.core.engine] INFO: Spider closed (finished)
2025-02-09 13:13:15 [scrapy-playwright] INFO: Closing download handler
2025-02-09 13:13:16 [scrapy-playwright] INFO: Closing download handler
2025-02-09 13:13:16 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False, remote=False)
2025-02-09 13:13:16 [scrapy-playwright] INFO: Closing browser
2025-02-09 13:13:16 [scrapy-playwright] DEBUG: Browser disconnected

Reproduces how often: every time

Versions(Have added Playwright's version too)

scrapy version --verbose
Scrapy       : 2.12.0
playwright   : 1.49.1
scrapy-playwright : 0.0.42
lxml         : 5.3.0.0
libxml2      : 2.11.7
cssselect    : 1.2.0
parsel       : 1.10.0
w3lib        : 2.2.1
Twisted      : 24.11.0
Python       : 3.12.8 (tags/v3.12.8:2dc476b, Dec  3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)]
pyOpenSSL    : 25.0.0 (OpenSSL 3.4.0 22 Oct 2024)
cryptography : 44.0.0
Platform     : Windows-11-10.0.26100-SP0

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论