最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Unable to use CrawlerProcess to run a spider as a script to debug it at breakpoints - Stack Overflow

programmeradmin0浏览0评论

Description

I was trying to debug my scrapy spider by using CrawlerProcess but I keep facing this problem

Steps to Reproduce

  1. I made a python file by the name of runner.py (its in the same directory as my scrapy.cfg file) with the following code
from scrapy.crawler import CrawlerProcess
from scrapy.utils import project
from glasses_shop_uk.spiders.bestsellers import BestsellersSpider

process = CrawlerProcess(settings=project.get_project_settings())
process.crawl(BestsellersSpider)
process.start()
  1. In my spider script named bestsellers.py this is the code (Breakpoint shown with the commented line)
import scrapy

class BestsellersSpider(scrapy.Spider):
    name = "bestsellers"
    allowed_domains = ["www.glassesshop"]
    start_urls = [";]


    def parse(self, response):
        # my breakpoint is here
        products = response.xpath('//div[@id="product-lists"]/div/div[@class="pb-5 mb-lg-3 product-list-row text-center product-list-item"]')
        for product in products:
            product_name = product.xpath('.//descendant::a[starts-with(@class,"active product-img")]/img/@alt').get()
            product_price = product.xpath('.//descendant::div[@class="p-price"]/div[starts-with(@class,"active")]/span/span[1]/text()').get()
            product_url_relative = product.xpath('.//descendant::a[starts-with(@class,"active product-img")]/@href').get()
            product_url_full = f"{product_url_relative}"
            yield scrapy.Request(product_url_full, callback=self.parse_product, meta={"product_name": product_name, "product_price": product_price})

    def parse_product(self, response):
        product_name = response.meta["product_name"]
        product_price = response.meta["product_price"]
        product_img_url = response.xpath('//*[@id="app"]/div/section/div/div/div[1]/div[1]/div[2]/div[2]/div/div/div[1]/img/@src').get()
        yield {
            "product_name": product_name,
            "product_price": product_price,
            "product_img_url": product_img_url,
            "product_url" : response.url
        }
  1. Now I run the runner.py script in debugging mode and I get a traceback

Expected behavior:

I enter the debugger after the parse method has started on the response from the start_url

Actual behavior:

import sys; print('Python %s on %s' % (sys.version, sys.platform))
D:\Desktop\Company\PythonProject\.venv\Scripts\python.exe -X pycache_prefix=C:\Users\Risha\AppData\Local\JetBrains\PyCharm2024.3\cpython-cache "C:/Program Files/JetBrains/PyCharm 2024.3.1.1/plugins/python-ce/helpers/pydev/pydevd.py" --multiprocess --qt-support=auto --port 29781 --file D:\Desktop\Company\PythonProject\glasses_shop_uk\runner.py 
2025-02-01 14:49:59 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: glasses_shop_uk)
2025-02-01 14:49:59 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.10.0, w3lib 2.2.1, Twisted 24.11.0, Python 3.12.8 (tags/v3.12.8:2dc476b, Dec  3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)], pyOpenSSL 25.0.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform Windows-11-10.0.26100-SP0
2025-02-01 14:49:59 [scrapy.addons] INFO: Enabled addons:
[]
2025-02-01 14:49:59 [asyncio] DEBUG: Using selector: SelectSelector
2025-02-01 14:49:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2025-02-01 14:49:59 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2025-02-01 14:49:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2025-02-01 14:49:59 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2025-02-01 14:49:59 [scrapy.extensions.telnet] INFO: Telnet Password: dc50ac58d34cb67d
2025-02-01 14:50:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2025-02-01 14:50:00 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'glasses_shop_uk',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'glasses_shop_uk.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['glasses_shop_uk.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2025-02-01 14:50:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2025-02-01 14:50:00 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2025-02-01 14:50:00 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2025-02-01 14:50:00 [scrapy.core.engine] INFO: Spider opened
2025-02-01 14:50:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2025-02-01 14:50:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2025-02-01 14:50:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET .txt> (referer: )
2025-02-01 14:50:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET ; (referer: )
2025-02-01 14:50:03 [asyncio] ERROR: Exception in callback <Task pending name='Task-1' coro=<SpiderMiddlewareManager.scrape_response.<locals>.process_callback_output() running at D:\Desktop\Company\PythonProject\.venv\Lib\site-packages\scrapy\core\spidermw.py:313> cb=[Deferred.fromFuture.<locals>.adapt() at D:\Desktop\Company\PythonProject\.venv\Lib\site-packages\twisted\internet\defer.py:1251]>()
handle: <Handle <Task pending name='Task-1' coro=<SpiderMiddlewareManager.scrape_response.<locals>.process_callback_output() running at D:\Desktop\Company\PythonProject\.venv\Lib\site-packages\scrapy\core\spidermw.py:313> cb=[Deferred.fromFuture.<locals>.adapt() at D:\Desktop\Company\PythonProject\.venv\Lib\site-packages\twisted\internet\defer.py:1251]>()>
Traceback (most recent call last):
  File "C:\Users\Risha\AppData\Local\Programs\Python\Python312\Lib\asyncio\events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
TypeError: 'Task' object is not callable
2025-02-01 14:51:00 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2025-02-01 14:52:00 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2025-02-01 14:53:00 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

Reproduces how often: every time

Versions

scrapy version --verbose
Scrapy       : 2.12.0
lxml         : 5.3.0.0
libxml2      : 2.11.7
cssselect    : 1.2.0
parsel       : 1.10.0
w3lib        : 2.2.1
Twisted      : 24.11.0
Python       : 3.12.8 (tags/v3.12.8:2dc476b, Dec  3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)]
pyOpenSSL    : 25.0.0 (OpenSSL 3.4.0 22 Oct 2024)
cryptography : 44.0.0
Platform     : Windows-11-10.0.26100-SP0

Please help as this is a really helpful feature for debugging a spider

发布评论

评论列表(0)

  1. 暂无评论