Description
I was trying to debug my scrapy spider by using CrawlerProcess but I keep facing this problem
Steps to Reproduce
- I made a python file by the name of runner.py (its in the same directory as my scrapy.cfg file) with the following code
from scrapy.crawler import CrawlerProcess
from scrapy.utils import project
from glasses_shop_uk.spiders.bestsellers import BestsellersSpider
process = CrawlerProcess(settings=project.get_project_settings())
process.crawl(BestsellersSpider)
process.start()
- In my spider script named bestsellers.py this is the code (Breakpoint shown with the commented line)
import scrapy
class BestsellersSpider(scrapy.Spider):
name = "bestsellers"
allowed_domains = ["www.glassesshop"]
start_urls = [";]
def parse(self, response):
# my breakpoint is here
products = response.xpath('//div[@id="product-lists"]/div/div[@class="pb-5 mb-lg-3 product-list-row text-center product-list-item"]')
for product in products:
product_name = product.xpath('.//descendant::a[starts-with(@class,"active product-img")]/img/@alt').get()
product_price = product.xpath('.//descendant::div[@class="p-price"]/div[starts-with(@class,"active")]/span/span[1]/text()').get()
product_url_relative = product.xpath('.//descendant::a[starts-with(@class,"active product-img")]/@href').get()
product_url_full = f"{product_url_relative}"
yield scrapy.Request(product_url_full, callback=self.parse_product, meta={"product_name": product_name, "product_price": product_price})
def parse_product(self, response):
product_name = response.meta["product_name"]
product_price = response.meta["product_price"]
product_img_url = response.xpath('//*[@id="app"]/div/section/div/div/div[1]/div[1]/div[2]/div[2]/div/div/div[1]/img/@src').get()
yield {
"product_name": product_name,
"product_price": product_price,
"product_img_url": product_img_url,
"product_url" : response.url
}
- Now I run the runner.py script in debugging mode and I get a traceback
Expected behavior:
I enter the debugger after the parse method has started on the response from the start_url
Actual behavior:
import sys; print('Python %s on %s' % (sys.version, sys.platform))
D:\Desktop\Company\PythonProject\.venv\Scripts\python.exe -X pycache_prefix=C:\Users\Risha\AppData\Local\JetBrains\PyCharm2024.3\cpython-cache "C:/Program Files/JetBrains/PyCharm 2024.3.1.1/plugins/python-ce/helpers/pydev/pydevd.py" --multiprocess --qt-support=auto --port 29781 --file D:\Desktop\Company\PythonProject\glasses_shop_uk\runner.py
2025-02-01 14:49:59 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: glasses_shop_uk)
2025-02-01 14:49:59 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.10.0, w3lib 2.2.1, Twisted 24.11.0, Python 3.12.8 (tags/v3.12.8:2dc476b, Dec 3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)], pyOpenSSL 25.0.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform Windows-11-10.0.26100-SP0
2025-02-01 14:49:59 [scrapy.addons] INFO: Enabled addons:
[]
2025-02-01 14:49:59 [asyncio] DEBUG: Using selector: SelectSelector
2025-02-01 14:49:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2025-02-01 14:49:59 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2025-02-01 14:49:59 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2025-02-01 14:49:59 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2025-02-01 14:49:59 [scrapy.extensions.telnet] INFO: Telnet Password: dc50ac58d34cb67d
2025-02-01 14:50:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2025-02-01 14:50:00 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'glasses_shop_uk',
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'glasses_shop_uk.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['glasses_shop_uk.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2025-02-01 14:50:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2025-02-01 14:50:00 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2025-02-01 14:50:00 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2025-02-01 14:50:00 [scrapy.core.engine] INFO: Spider opened
2025-02-01 14:50:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2025-02-01 14:50:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2025-02-01 14:50:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET .txt> (referer: )
2025-02-01 14:50:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET ; (referer: )
2025-02-01 14:50:03 [asyncio] ERROR: Exception in callback <Task pending name='Task-1' coro=<SpiderMiddlewareManager.scrape_response.<locals>.process_callback_output() running at D:\Desktop\Company\PythonProject\.venv\Lib\site-packages\scrapy\core\spidermw.py:313> cb=[Deferred.fromFuture.<locals>.adapt() at D:\Desktop\Company\PythonProject\.venv\Lib\site-packages\twisted\internet\defer.py:1251]>()
handle: <Handle <Task pending name='Task-1' coro=<SpiderMiddlewareManager.scrape_response.<locals>.process_callback_output() running at D:\Desktop\Company\PythonProject\.venv\Lib\site-packages\scrapy\core\spidermw.py:313> cb=[Deferred.fromFuture.<locals>.adapt() at D:\Desktop\Company\PythonProject\.venv\Lib\site-packages\twisted\internet\defer.py:1251]>()>
Traceback (most recent call last):
File "C:\Users\Risha\AppData\Local\Programs\Python\Python312\Lib\asyncio\events.py", line 88, in _run
self._context.run(self._callback, *self._args)
TypeError: 'Task' object is not callable
2025-02-01 14:51:00 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2025-02-01 14:52:00 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2025-02-01 14:53:00 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Reproduces how often: every time
Versions
scrapy version --verbose
Scrapy : 2.12.0
lxml : 5.3.0.0
libxml2 : 2.11.7
cssselect : 1.2.0
parsel : 1.10.0
w3lib : 2.2.1
Twisted : 24.11.0
Python : 3.12.8 (tags/v3.12.8:2dc476b, Dec 3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)]
pyOpenSSL : 25.0.0 (OpenSSL 3.4.0 22 Oct 2024)
cryptography : 44.0.0
Platform : Windows-11-10.0.26100-SP0
Please help as this is a really helpful feature for debugging a spider