最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Get sitemap urls using scrapy without scraping urls - Stack Overflow

programmeradmin6浏览0评论

I am building my first scrapy and am trying to retrieve the urls in the sitemap. The code works, but it seems that scrapy is also already crawling the website urls themselves. I do not want that. How do I get only the urls in the sitemap? Thanks a lot for your advice:

import scrapy
from scrapy.crawler import CrawlerProcess

class SitemapSpider(scrapy.spiders.SitemapSpider):
    name = "sitemap_spider"
    
    def __init__(self, sitemap_url, *args, **kwargs):
        super(SitemapSpider, self).__init__(*args, **kwargs)
        self.sitemap_urls = [sitemap_url]
        self.extracted_urls = []

    def parse(self, response):
        print(response.url )
        yield None

def run_sitemap_scraper(sitemap_url):
    
    # Run the scraper
    process = CrawlerProcess()
    process.crawl(SitemapSpider, sitemap_url=sitemap_url)
    process.start()

# Example usage
run_sitemap_scraper(".xml")

I am building my first scrapy and am trying to retrieve the urls in the sitemap. The code works, but it seems that scrapy is also already crawling the website urls themselves. I do not want that. How do I get only the urls in the sitemap? Thanks a lot for your advice:

import scrapy
from scrapy.crawler import CrawlerProcess

class SitemapSpider(scrapy.spiders.SitemapSpider):
    name = "sitemap_spider"
    
    def __init__(self, sitemap_url, *args, **kwargs):
        super(SitemapSpider, self).__init__(*args, **kwargs)
        self.sitemap_urls = [sitemap_url]
        self.extracted_urls = []

    def parse(self, response):
        print(response.url )
        yield None

def run_sitemap_scraper(sitemap_url):
    
    # Run the scraper
    process = CrawlerProcess()
    process.crawl(SitemapSpider, sitemap_url=sitemap_url)
    process.start()

# Example usage
run_sitemap_scraper("https://ferienparkguide.de/sitemap_index.xml")
Share Improve this question asked Mar 25 at 6:51 hal1988hal1988 31 bronze badge 1
  • It's possible that you don't need Scrapy in this case. You can adapt the parsing code in SitemapSpider for your custom script or just do it from scratch. – wRAR Commented Mar 25 at 20:00
Add a comment  | 

2 Answers 2

Reset to default 0

I solved my case a bit different in the end. Instead of the def parse above, I used:

    # Only follow xml and txt entries
    def sitemap_filter(self, entries):
        for entry in entries:
            print(entry["loc"])
            if entry["loc"].endswith(".xml") or entry["loc"].endswith(".txt"):
                yield entry

    # Do not parse any websites
    def parse(self, response):
        yield None

Have a look at scrapy.spiders.SitemapSpider._parse_sitemap

发布评论

评论列表(0)

  1. 暂无评论