最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Extracting text from Wikisource using BeautifulSoup returns empty result - Stack Overflow

programmeradmin0浏览0评论

I'm trying to extract the text of a book from a Wikisource page using BeautifulSoup, but the result is always empty. The page I'm working on is Le Père Goriot by Balzac.

Here's the code I'm using:

import requests
from bs4 import BeautifulSoup

def extract_text(url):
    try:
        # Fetch the page content
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find the main text section
        text_section = soup.find("div", {"class": "mw-parser-output"})
        if not text_section:
            raise ValueError("Text section not found.")
        
        # Extract text from paragraphs and other elements
        text_elements = text_section.find_all(["p", "div"])
        text = "\n".join(element.get_text().strip() for element in text_elements if element.get_text().strip())
        
        return text
    except Exception as e:
        print(f"Error extracting text from {url}: {e}")
        return None

# Example usage
url = "(1855)"
text = extract_text(url)
if text:
    print(text)
else:
    print("No text found.")

Problem: The extract_text function always returns an empty string, even though the page clearly contains text. I suspect the issue is related to the structure of the Wikisource page, but I'm not sure how to fix it.

I'm trying to extract the text of a book from a Wikisource page using BeautifulSoup, but the result is always empty. The page I'm working on is Le Père Goriot by Balzac.

Here's the code I'm using:

import requests
from bs4 import BeautifulSoup

def extract_text(url):
    try:
        # Fetch the page content
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find the main text section
        text_section = soup.find("div", {"class": "mw-parser-output"})
        if not text_section:
            raise ValueError("Text section not found.")
        
        # Extract text from paragraphs and other elements
        text_elements = text_section.find_all(["p", "div"])
        text = "\n".join(element.get_text().strip() for element in text_elements if element.get_text().strip())
        
        return text
    except Exception as e:
        print(f"Error extracting text from {url}: {e}")
        return None

# Example usage
url = "https://fr.wikisource./wiki/Le_P%C3%A8re_Goriot_(1855)"
text = extract_text(url)
if text:
    print(text)
else:
    print("No text found.")

Problem: The extract_text function always returns an empty string, even though the page clearly contains text. I suspect the issue is related to the structure of the Wikisource page, but I'm not sure how to fix it.

Share Improve this question edited Jan 30 at 21:41 Hugo Durif asked Jan 30 at 21:31 Hugo DurifHugo Durif 133 bronze badges 2
  • The problem is that the text_selection does not have a p or div tag, are you sure that you are looking at the right tag attribute to pull the information that you want? – Andrew Ryan Commented Jan 30 at 22:02
  • I'm sorry, but I'm new to this and I don't fully understand how it works yet. However, it seems to me that the book's text is indeed contained within several <p> tags while I use 'inspect element'. – Hugo Durif Commented Jan 30 at 22:18
Add a comment  | 

2 Answers 2

Reset to default 0

To find the text section you are using the class mw-parser-output. But this class is present for two different div elements. And the first one with this class doesn't contain the texts. The find function returns the first element found. That is why you can't get the texts.

The div with class prp-pages-output contains all the text you want and the div is inside the second div with the class you have used. You can use this class to get the texts.

You don't need to parse the p and div tags to get the text. You can get the text directly from the parent element and it would work fine.

import requests
from bs4 import BeautifulSoup

def extract_text(url):
    try:
        # Fetch the page content
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find the main text section
        text_section = soup.find("div", {"class": "prp-pages-output"})
        if not text_section:
            raise ValueError("Text section not found.")
        
        # Extract text from paragraphs and other elements
        text = text_section.get_text().strip()
        
        return text
    except Exception as e:
        print(f"Error extracting text from {url}: {e}")
        return None

# Example usage
url = "https://fr.wikisource./wiki/Le_P%C3%A8re_Goriot_(1855)"
text = extract_text(url)
if text:
    print(text)
else:
    print("No text found.")

But the first div and first two p tag elements are not the text from the book but the data about the book and the previous/next book's title/link. So if you want just the book content and not other texts, then try the following. Here I have used the CSS selector which selects all the elements after the div tag that contains the meta info.

import requests
from bs4 import BeautifulSoup

def extract_text(url):
    try:
        # Fetch the page content
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Extract the text
        text_elements = soup.select("div.prp-pages-output > div[itemid] ~ *")
        text = "\n".join(element.get_text().strip() for element in text_elements)
        
        return text
    except Exception as e:
        print(f"Error extracting text from {url}: {e}")
        return None

# Example usage
url = "https://fr.wikisource./wiki/Le_P%C3%A8re_Goriot_(1855)"
text = extract_text(url)
if text:
    print(text)
else:
    print("No text found.")

I think the class was incorrect because I inspected the page and changed the class in the script and I think it works:

import requests
from bs4 import BeautifulSoup



def extract_text(url):
    try:
        # Fetch the page content
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')


        # Find the main text section
        text_section = soup.find("div", {"class": "prp-pages-output"})
        if not text_section:
            raise ValueError("Text section not found.")

        # Extract text from paragraphs and other elements
        text_elements = text_section.find_all(["p", "div"])
        text = "\n".join(element.get_text().strip() for element in text_elements if element.get_text().strip())

        return text
    except Exception as e:
        print(f"Error extracting text from {url}: {e}")
        return None

# Example usage
url = "https://fr.wikisource./wiki/Le_P%C3%A8re_Goriot_(1855)"
text = extract_text(url)
if text:
    print(text)
else:
    print("No text found.")
发布评论

评论列表(0)

  1. 暂无评论