python - Extracting text from Wikisource using BeautifulSoup returns empty result

I'm trying to extract the text of a book from a Wikisource page using BeautifulSoup, but the result is always empty. The page I'm working on is Le Père Goriot by Balzac.

Here's the code I'm using:

import requests
from bs4 import BeautifulSoup

def extract_text(url):
    try:
        # Fetch the page content
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find the main text section
        text_section = soup.find("div", {"class": "mw-parser-output"})
        if not text_section:
            raise ValueError("Text section not found.")
        
        # Extract text from paragraphs and other elements
        text_elements = text_section.find_all(["p", "div"])
        text = "\n".join(element.get_text().strip() for element in text_elements if element.get_text().strip())
        
        return text
    except Exception as e:
        print(f"Error extracting text from {url}: {e}")
        return None

# Example usage
url = "(1855)"
text = extract_text(url)
if text:
    print(text)
else:
    print("No text found.")

Problem: The extract_text function always returns an empty string, even though the page clearly contains text. I suspect the issue is related to the structure of the Wikisource page, but I'm not sure how to fix it.

I'm trying to extract the text of a book from a Wikisource page using BeautifulSoup, but the result is always empty. The page I'm working on is Le Père Goriot by Balzac.

Here's the code I'm using:

import requests
from bs4 import BeautifulSoup

def extract_text(url):
    try:
        # Fetch the page content
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find the main text section
        text_section = soup.find("div", {"class": "mw-parser-output"})
        if not text_section:
            raise ValueError("Text section not found.")
        
        # Extract text from paragraphs and other elements
        text_elements = text_section.find_all(["p", "div"])
        text = "\n".join(element.get_text().strip() for element in text_elements if element.get_text().strip())
        
        return text
    except Exception as e:
        print(f"Error extracting text from {url}: {e}")
        return None

# Example usage
url = "https://fr.wikisource./wiki/Le_P%C3%A8re_Goriot_(1855)"
text = extract_text(url)
if text:
    print(text)
else:
    print("No text found.")

Share Improve this question edited Jan 30 at 21:41 asked Jan 30 at 21:31 Hugo Durif 133 bronze badges

The problem is that the text_selection does not have a p or div tag, are you sure that you are looking at the right tag attribute to pull the information that you want? – Andrew Ryan Commented Jan 30 at 22:02
I'm sorry, but I'm new to this and I don't fully understand how it works yet. However, it seems to me that the book's text is indeed contained within several <p> tags while I use 'inspect element'. – Hugo Durif Commented Jan 30 at 22:18

Add a comment |

2 Answers 2

Sorted by: Reset to default 0

To find the text section you are using the class mw-parser-output. But this class is present for two different div elements. And the first one with this class doesn't contain the texts. The find function returns the first element found. That is why you can't get the texts.

The div with class prp-pages-output contains all the text you want and the div is inside the second div with the class you have used. You can use this class to get the texts.

You don't need to parse the p and div tags to get the text. You can get the text directly from the parent element and it would work fine.

import requests
from bs4 import BeautifulSoup

def extract_text(url):
    try:
        # Fetch the page content
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find the main text section
        text_section = soup.find("div", {"class": "prp-pages-output"})
        if not text_section:
            raise ValueError("Text section not found.")
        
        # Extract text from paragraphs and other elements
        text = text_section.get_text().strip()
        
        return text
    except Exception as e:
        print(f"Error extracting text from {url}: {e}")
        return None

# Example usage
url = "https://fr.wikisource./wiki/Le_P%C3%A8re_Goriot_(1855)"
text = extract_text(url)
if text:
    print(text)
else:
    print("No text found.")

But the first div and first two p tag elements are not the text from the book but the data about the book and the previous/next book's title/link. So if you want just the book content and not other texts, then try the following. Here I have used the CSS selector which selects all the elements after the div tag that contains the meta info.

import requests
from bs4 import BeautifulSoup

def extract_text(url):
    try:
        # Fetch the page content
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Extract the text
        text_elements = soup.select("div.prp-pages-output > div[itemid] ~ *")
        text = "\n".join(element.get_text().strip() for element in text_elements)
        
        return text
    except Exception as e:
        print(f"Error extracting text from {url}: {e}")
        return None

# Example usage
url = "https://fr.wikisource./wiki/Le_P%C3%A8re_Goriot_(1855)"
text = extract_text(url)
if text:
    print(text)
else:
    print("No text found.")

I think the class was incorrect because I inspected the page and changed the class in the script and I think it works:

import requests
from bs4 import BeautifulSoup



def extract_text(url):
    try:
        # Fetch the page content
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')


        # Find the main text section
        text_section = soup.find("div", {"class": "prp-pages-output"})
        if not text_section:
            raise ValueError("Text section not found.")

        # Extract text from paragraphs and other elements
        text_elements = text_section.find_all(["p", "div"])
        text = "\n".join(element.get_text().strip() for element in text_elements if element.get_text().strip())

        return text
    except Exception as e:
        print(f"Error extracting text from {url}: {e}")
        return None

# Example usage
url = "https://fr.wikisource./wiki/Le_P%C3%A8re_Goriot_(1855)"
text = extract_text(url)
if text:
    print(text)
else:
    print("No text found.")

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Extracting text from Wikisource using BeautifulSoup returns empty result - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)