最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

selenium webdriver - How can I scrap the data within the "informations détaillées" section o

programmeradmin2浏览0评论

Hi guys i am brand new with webscraping.

I am trying to webscrape the the data within the "informations détaillées" section of this webpage (:/12148/cb42768809f/date) so that I will be able to fill in a SQL database with every field it has.

This is a Test URL. I got a list of 500 URL similar to this one that I requested with this website's API. And I intend to apply my Python function to all the URLs of this list once it works.

Any piece of advice to help me extract the information I need from this webpage please? TY so much!

First I tried to use beautifulsoup but the problem is that the "informations détaillées" section appears only when you click the dropdown button.

I tried several beautiful soup bites of codes though, like the following one, but it didn't work:

def get_metadata_bs4(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    try:
        
        title = soup.find("h1").text.strip() if soup.find("h1") else "Titre inconnu"

        
        publisher = soup.select_one("dl dd:nth-of-type(1)").text.strip() if soup.select_one("dl dd:nth-of-type(1)") else "Auteur inconnu"

        
        Date of publication = soup.select_one("dl dd:nth-of-type(2)").text.strip() if soup.select_one("dl dd:nth-of-type(2)") else "Date inconnue"

        return {"title": title, "author": author, "Date of publication": Date of publication}
    
    except Exception as e:
        print(f"Erreur pour {url}: {e}")
        return None

# Tester avec un seul lien
url_test = ":/12148/cb42768809f/date"
print(get_metadata_bs4(url_test))

So I tried selenium, but this is is my first time using this Python Library… I tried to find the correct X-Path of the source code and replace "metadata-class" with this X-path within the following code block:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdrivermon.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# Configuration Selenium
chrome_options = Options()
chrome_options.add_argument("--headless")  # Mode sans interface graphique
driver = webdriver.Chrome(options=chrome_options)

def get_metadata_from_notice(url):
    driver.get(url)
    time.sleep(2)  # Laisser le temps de charger
    
    try:
        # Cliquer sur le dropdown "Informations détaillées"
        dropdown = WebDriverWait(driver, 5).until(
            EC.element_to_be_clickable((By.XPATH, "//div[contains(text(), 'Informations détaillées')]"))
        )
        dropdown.click()
        time.sleep(2)  # Attendre le chargement après le clic
    except Exception as e:
        print(f"⚠️ Erreur lors du clic sur {url} : {e}")
        return None

    try:
        # Extraction des métadonnées après ouverture du dropdown
        metadata_section = driver.find_element(By.XPATH, "//div[@class='metadata-class']")  # À remplacer par la bonne classe
        metadata_text = metadata_section.text
        return {"url": url, "metadata": metadata_text}
    except Exception as e:
        print(f"⚠️ Impossible de récupérer les métadonnées pour {url} : {e}")
        return None

# Test sur une URL
test_url = ":/12148/cb42768809f/date"
print(get_metadata_from_notice(test_url))

# Fermer Selenium
driver.quit()

but it keeps giving me results like this one:

⚠️ Impossible de récupérer les métadonnées pour :/12148/cb42768809f/date
⚠️ Erreur sur :/12148/cb452698066/date : Message: 
Stacktrace:
    GetHandleVerifier [0x00007FF7940A02F5+28725]
    (No symbol) [0x00007FF794002AE0]
    (No symbol) [0x00007FF793E9510A]
    (No symbol) [0x00007FF793EE93D2]
    (No symbol) [0x00007FF793EE95FC]
    (No symbol) [0x00007FF793F33407]
    (No symbol) [0x00007FF793F0FFEF]
    (No symbol) [0x00007FF793F30181]
    (No symbol) [0x00007FF793F0FD53]
    (No symbol) [0x00007FF793EDA0E3]
    (No symbol) [0x00007FF793EDB471]
    GetHandleVerifier [0x00007FF7943CF30D+3366989]
    GetHandleVerifier [0x00007FF7943E12F0+3440688]
    GetHandleVerifier [0x00007FF7943D78FD+3401277]
    GetHandleVerifier [0x00007FF79416AAAB+858091]
    (No symbol) [0x00007FF79400E74F]
    (No symbol) [0x00007FF79400A304]
    (No symbol) [0x00007FF79400A49D]
    (No symbol) [0x00007FF793FF8B69]
    BaseThreadInitThunk [0x00007FFC0A7D259D+29]
    RtlUserThreadStart [0x00007FFC0BA0AF38+40]

Hi guys i am brand new with webscraping.

I am trying to webscrape the the data within the "informations détaillées" section of this webpage (https://gallica.bnf.fr/ark:/12148/cb42768809f/date) so that I will be able to fill in a SQL database with every field it has.

This is a Test URL. I got a list of 500 URL similar to this one that I requested with this website's API. And I intend to apply my Python function to all the URLs of this list once it works.

Any piece of advice to help me extract the information I need from this webpage please? TY so much!

First I tried to use beautifulsoup but the problem is that the "informations détaillées" section appears only when you click the dropdown button.

I tried several beautiful soup bites of codes though, like the following one, but it didn't work:

def get_metadata_bs4(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    try:
        
        title = soup.find("h1").text.strip() if soup.find("h1") else "Titre inconnu"

        
        publisher = soup.select_one("dl dd:nth-of-type(1)").text.strip() if soup.select_one("dl dd:nth-of-type(1)") else "Auteur inconnu"

        
        Date of publication = soup.select_one("dl dd:nth-of-type(2)").text.strip() if soup.select_one("dl dd:nth-of-type(2)") else "Date inconnue"

        return {"title": title, "author": author, "Date of publication": Date of publication}
    
    except Exception as e:
        print(f"Erreur pour {url}: {e}")
        return None

# Tester avec un seul lien
url_test = "https://gallica.bnf.fr/ark:/12148/cb42768809f/date"
print(get_metadata_bs4(url_test))

So I tried selenium, but this is is my first time using this Python Library… I tried to find the correct X-Path of the source code and replace "metadata-class" with this X-path within the following code block:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# Configuration Selenium
chrome_options = Options()
chrome_options.add_argument("--headless")  # Mode sans interface graphique
driver = webdriver.Chrome(options=chrome_options)

def get_metadata_from_notice(url):
    driver.get(url)
    time.sleep(2)  # Laisser le temps de charger
    
    try:
        # Cliquer sur le dropdown "Informations détaillées"
        dropdown = WebDriverWait(driver, 5).until(
            EC.element_to_be_clickable((By.XPATH, "//div[contains(text(), 'Informations détaillées')]"))
        )
        dropdown.click()
        time.sleep(2)  # Attendre le chargement après le clic
    except Exception as e:
        print(f"⚠️ Erreur lors du clic sur {url} : {e}")
        return None

    try:
        # Extraction des métadonnées après ouverture du dropdown
        metadata_section = driver.find_element(By.XPATH, "//div[@class='metadata-class']")  # À remplacer par la bonne classe
        metadata_text = metadata_section.text
        return {"url": url, "metadata": metadata_text}
    except Exception as e:
        print(f"⚠️ Impossible de récupérer les métadonnées pour {url} : {e}")
        return None

# Test sur une URL
test_url = "https://gallica.bnf.fr/ark:/12148/cb42768809f/date"
print(get_metadata_from_notice(test_url))

# Fermer Selenium
driver.quit()

but it keeps giving me results like this one:

⚠️ Impossible de récupérer les métadonnées pour https://gallica.bnf.fr/ark:/12148/cb42768809f/date
⚠️ Erreur sur https://gallica.bnf.fr/ark:/12148/cb452698066/date : Message: 
Stacktrace:
    GetHandleVerifier [0x00007FF7940A02F5+28725]
    (No symbol) [0x00007FF794002AE0]
    (No symbol) [0x00007FF793E9510A]
    (No symbol) [0x00007FF793EE93D2]
    (No symbol) [0x00007FF793EE95FC]
    (No symbol) [0x00007FF793F33407]
    (No symbol) [0x00007FF793F0FFEF]
    (No symbol) [0x00007FF793F30181]
    (No symbol) [0x00007FF793F0FD53]
    (No symbol) [0x00007FF793EDA0E3]
    (No symbol) [0x00007FF793EDB471]
    GetHandleVerifier [0x00007FF7943CF30D+3366989]
    GetHandleVerifier [0x00007FF7943E12F0+3440688]
    GetHandleVerifier [0x00007FF7943D78FD+3401277]
    GetHandleVerifier [0x00007FF79416AAAB+858091]
    (No symbol) [0x00007FF79400E74F]
    (No symbol) [0x00007FF79400A304]
    (No symbol) [0x00007FF79400A49D]
    (No symbol) [0x00007FF793FF8B69]
    BaseThreadInitThunk [0x00007FFC0A7D259D+29]
    RtlUserThreadStart [0x00007FFC0BA0AF38+40]
Share Improve this question edited yesterday chitown88 28.6k5 gold badges31 silver badges63 bronze badges asked Feb 7 at 9:15 Moran HananeMoran Hanane 132 bronze badges New contributor Moran Hanane is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
Add a comment  | 

2 Answers 2

Reset to default 2

No need to use Selenium, just a simple curl request in a shell will yield the result :

curl https://gallica.bnf.fr/services/ajax/notice/ark:/12148/cb42768809f/date

How did I found this ? Simply open the devtools of your browser, select the network tab and click on "Informations détaillées", a new GET entry will appear.

As @iliak mentioned, you can get the information by a get request. You have to insert services/ajax/notice/ in your URLs. Then you have to parse the json to get the data.

For selenium, try the following code. It gets the information and formats the data using pandas.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd

# Configuration Selenium
chrome_options = Options()
chrome_options.add_argument("--headless")  # Mode sans interface graphique
driver = webdriver.Chrome(options=chrome_options)
wait = WebDriverWait(driver, 10)

def get_metadata_from_notice(url):
    driver.get(url)

    details = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "div#moreInfosRegion")))
    details.click()
    metadata_section = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "dl.noticeDetailsArea")))
    # metadata_text = metadata_section.text
    # return {"url": url, "metadata": metadata_text}

    titles = metadata_section.find_elements(By.XPATH,"./dt")
    data =[]
    for title in titles:
        content = title.find_element(By.XPATH,"./following-sibling::dd[1]").text
        data.append({"Title":title.text, "Content":content})
    return data


# Test sur une URL
test_url = "https://gallica.bnf.fr/ark:/12148/cb42768809f/date"
df = pd.DataFrame(get_metadata_from_notice(test_url))
print(df)

# Fermer Selenium
driver.quit()

OUTPUT:

                 title                                            content
0              Title :   Bulletin paroissial (Valence (Drôme), Paroiss...
1              Title :   Bulletin paroissial mensuel de la cathédrale ...
2          Publisher :                                        F. Rouet ()
3   Publication date :                                               1907
4            Subject :   Guerre mondiale (1914-1918) -- Aspect religie...
5       Relationship :     http://catalogue.bnf.fr/ark:/12148/cb42768809f
6           Language :                                             french
7           Language :                                             French
8         Identifier :                        ark:/12148/cb42768809f/date
9             Source :   Bibliothèque nationale de France, département...
10
11

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论