最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

web scraping - Selenium and ChromeDriver Setup Issue in Google Colab: Compatibility Problems with undetected-chromedriver (and n

programmeradmin5浏览0评论

I am a social scientist with very limited background on computational methods and Python (I am mostly self taught). This is my first time posting here, so please bear with me if I misuse any technical terms or my descriptions are verbose.

tldr: I want to scrape news articles from a website using Selenium and ChromeDriver in Google Colab. I need to use undetected-chromedriver to avoid detection by anti-bot mechanisms. But the script that I used earlier is no longer working, and I've not been able to fix it for weeks now. According to AI, the problem might be related to "a TypeError related to the executable_path argument, which seems to be a compatibility issue between undetected-chromedriver and Selenium 4.10.0."

Now some details:

I'm trying to scrape news texts from a website that does not allow me to use Beautiful Soup that I was able to use for other sites and blocks bots. About a month ago, I finally managed to run a script in Google Colab, which successfully scraped some data and stored them as .txt files overnight on my drive before it stopped. (This happened despite the activated Caffeine extension, so maybe Colab kicked me out at some point? This was my first time using Colab, so I might be wrong).

I’m using Google Colab with the following packages:

selenium==4.10.0 undetected-chromedriver (version 3.5.5)

# Install required packages
!pip install selenium==4.10.0 undetected-chromedriver

# Download and set up Chrome and ChromeDriver
!wget -q -O /tmp/chrome-linux64.zip     .0.5790.102/linux64/chrome-linux64.zip
!unzip -o /tmp/chrome-linux64.zip -d /tmp
!mv /tmp/chrome-linux64/chrome /usr/local/bin/chrome
!chmod +x /usr/local/bin/chrome
!chrome --version

!wget -q -O /tmp/chromedriver-linux64.zip .0.5790.102/linux64/chromedriver-linux64.zip
!unzip -o /tmp/chromedriver-linux64.zip -d /tmp
!mv /tmp/chromedriver-linux64/chromedriver /usr/local/bin/chromedriver
!chmod +x /usr/local/bin/chromedriver
!chromedriver --version

import undetected_chromedriver as uc
from selenium.webdrivermon.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
import time
import csv

def setup_driver():
    options = uc.ChromeOptions()
    options.headless = False  # Set to True for headless mode
    driver = uc.Chrome(options=options)
    return driver

def scrape_news():
    driver = setup_driver()
    try:
        url = "/"
        driver.get(url)
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, 'a.news-page__card'))
        )
        print("Page loaded successfully.")
    finally:
        driver.quit()

if __name__ == '__main__':
    scrape_news()

Anyway, after that initial successful run, I've tried to run it again, but it has not worked ever since, due to persistent compatibility issues between Selenium, ChromeDriver, and the undetected-chromedriver.

I've also tried to manually install lower versions of Chrome and ChromeDriver using wget and unzip, but that did not work either (or maybe I'm not doing it properly?)

I've also tried using google-colab-selenium package to simplify the setup, but it resulted in errors related to the DriverFinder module, which returned the following error. I've also tried downgrading Selenium to 4.6.0 and 4.10.0 to ensure compatibility with google-colab-selenium, but the issue persists:

Requirement already satisfied: h11<1,>=0.9.0 in /usr/local/lib/python3.11/dist-packages (from wsproto>=0.14->trio-websocket~=0.9->selenium->google-colab-selenium[undetected]) (0.14.0)
Downloading google_colab_selenium-1.0.14-py3-none-any.whl (8.2 kB)
Installing collected packages: google-colab-selenium
Successfully installed google-colab-selenium-1.0.14
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-c1a995ad582d> in <cell line: 0>()
    173 
    174 if __name__ == '__main__':
--> 175     scrape_news()

4 frames
/usr/local/lib/python3.11/dist-packages/google_colab_selenium/colab_selenium_manager.py in <module>
      9 from selenium.webdriver.chrome.service import Service
     10 from selenium.webdriver.chrome.options import Options
---> 11 from selenium.webdrivermon.driver_finder import DriverFinder
     12 
     13 

ModuleNotFoundError: No module named 'selenium.webdrivermon.driver_finder'

---------------------------------------------------------------------------

Is there a way to resolve the ModuleNotFoundError: No module named 'selenium.webdrivermon.driver_finder' error when using google-colab-selenium? Or, is there a better way to set up Selenium and ChromeDriver in Google Colab that avoids these compatibility issues in general? If there are no solutions, what are some alternative tools or libraries I can use for web scraping in Colab that are more reliable? I am quite stuck, and all suggestions and help are very welcome.

I am a social scientist with very limited background on computational methods and Python (I am mostly self taught). This is my first time posting here, so please bear with me if I misuse any technical terms or my descriptions are verbose.

tldr: I want to scrape news articles from a website using Selenium and ChromeDriver in Google Colab. I need to use undetected-chromedriver to avoid detection by anti-bot mechanisms. But the script that I used earlier is no longer working, and I've not been able to fix it for weeks now. According to AI, the problem might be related to "a TypeError related to the executable_path argument, which seems to be a compatibility issue between undetected-chromedriver and Selenium 4.10.0."

Now some details:

I'm trying to scrape news texts from a website that does not allow me to use Beautiful Soup that I was able to use for other sites and blocks bots. About a month ago, I finally managed to run a script in Google Colab, which successfully scraped some data and stored them as .txt files overnight on my drive before it stopped. (This happened despite the activated Caffeine extension, so maybe Colab kicked me out at some point? This was my first time using Colab, so I might be wrong).

I’m using Google Colab with the following packages:

selenium==4.10.0 undetected-chromedriver (version 3.5.5)

# Install required packages
!pip install selenium==4.10.0 undetected-chromedriver

# Download and set up Chrome and ChromeDriver
!wget -q -O /tmp/chrome-linux64.zip     https://edgedl.me.gvt1/edgedl/chrome/chrome-for-testing/115.0.5790.102/linux64/chrome-linux64.zip
!unzip -o /tmp/chrome-linux64.zip -d /tmp
!mv /tmp/chrome-linux64/chrome /usr/local/bin/chrome
!chmod +x /usr/local/bin/chrome
!chrome --version

!wget -q -O /tmp/chromedriver-linux64.zip https://edgedl.me.gvt1/edgedl/chrome/chrome-for-testing/115.0.5790.102/linux64/chromedriver-linux64.zip
!unzip -o /tmp/chromedriver-linux64.zip -d /tmp
!mv /tmp/chromedriver-linux64/chromedriver /usr/local/bin/chromedriver
!chmod +x /usr/local/bin/chromedriver
!chromedriver --version

import undetected_chromedriver as uc
from selenium.webdrivermon.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
import time
import csv

def setup_driver():
    options = uc.ChromeOptions()
    options.headless = False  # Set to True for headless mode
    driver = uc.Chrome(options=options)
    return driver

def scrape_news():
    driver = setup_driver()
    try:
        url = "https://www.akparti..tr/haberler/"
        driver.get(url)
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, 'a.news-page__card'))
        )
        print("Page loaded successfully.")
    finally:
        driver.quit()

if __name__ == '__main__':
    scrape_news()

Anyway, after that initial successful run, I've tried to run it again, but it has not worked ever since, due to persistent compatibility issues between Selenium, ChromeDriver, and the undetected-chromedriver.

I've also tried to manually install lower versions of Chrome and ChromeDriver using wget and unzip, but that did not work either (or maybe I'm not doing it properly?)

I've also tried using google-colab-selenium package to simplify the setup, but it resulted in errors related to the DriverFinder module, which returned the following error. I've also tried downgrading Selenium to 4.6.0 and 4.10.0 to ensure compatibility with google-colab-selenium, but the issue persists:

Requirement already satisfied: h11<1,>=0.9.0 in /usr/local/lib/python3.11/dist-packages (from wsproto>=0.14->trio-websocket~=0.9->selenium->google-colab-selenium[undetected]) (0.14.0)
Downloading google_colab_selenium-1.0.14-py3-none-any.whl (8.2 kB)
Installing collected packages: google-colab-selenium
Successfully installed google-colab-selenium-1.0.14
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-c1a995ad582d> in <cell line: 0>()
    173 
    174 if __name__ == '__main__':
--> 175     scrape_news()

4 frames
/usr/local/lib/python3.11/dist-packages/google_colab_selenium/colab_selenium_manager.py in <module>
      9 from selenium.webdriver.chrome.service import Service
     10 from selenium.webdriver.chrome.options import Options
---> 11 from selenium.webdrivermon.driver_finder import DriverFinder
     12 
     13 

ModuleNotFoundError: No module named 'selenium.webdrivermon.driver_finder'

---------------------------------------------------------------------------

Is there a way to resolve the ModuleNotFoundError: No module named 'selenium.webdrivermon.driver_finder' error when using google-colab-selenium? Or, is there a better way to set up Selenium and ChromeDriver in Google Colab that avoids these compatibility issues in general? If there are no solutions, what are some alternative tools or libraries I can use for web scraping in Colab that are more reliable? I am quite stuck, and all suggestions and help are very welcome.

Share Improve this question asked Mar 10 at 12:10 Hedda GablerHedda Gabler 112 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

Your issue is due to compatibility problems between Selenium 4.10.0 and undetected-chromedriver. Here’s a reliable way to set up Selenium and ChromeDriver in Google Colab:

1. Install Required Packages

!pip install selenium==4.9.1 undetected-chromedriver==3.4.6

2. Setup Chrome & ChromeDriver

!apt-get update
!apt-get install -y unzip wget
!wget -q -O /tmp/chromedriver.zip https://chromedriver.storage.googleapis/114.0.5735.90/chromedriver_linux64.zip
!unzip -o /tmp/chromedriver.zip -d /usr/local/bin/
!chmod +x /usr/local/bin/chromedriver

!wget -q -O /tmp/google-chrome-stable_current_amd64.deb https://dl.google/linux/direct/google-chrome-stable_current_amd64.deb
!dpkg -i /tmp/google-chrome-stable_current_amd64.deb
!apt-get -fy install

3. Use undetected-chromedriver

import undetected_chromedriver as uc
from selenium.webdrivermon.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def setup_driver():
    options = uc.ChromeOptions()
    options.add_argument("--headless")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    
    driver = uc.Chrome(options=options, version_main=114)  # Ensure compatibility
    return driver

def scrape_news():
    driver = setup_driver()
    try:
        url = "https://www.akparti..tr/haberler/"
        driver.get(url)
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, 'a.news-page__card'))
        )
        print("Page loaded successfully.")
    finally:
        driver.quit()

if __name__ == '__main__':
    scrape_news()

Why This Works?

  1. Downgrades Selenium to 4.9.1 (last stable version with undetected-chromedriver 3.4.6).

  2. Installs ChromeDriver version 114 (compatible with Selenium and undetected-chromedriver).

  3. Uses version_main=114 in uc.Chrome() to ensure compatibility.

  4. Avoids google-colab-selenium, which has module issues.

Alternative to Selenium?

  • Scrapy (Better for large-scale scraping)

  • Playwright (More reliable anti-bot evasion)

  • Requests + BeautifulSoup (If JS rendering is not needed)

I bet this will fix your issue!

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论