I want to scrape anchor links with class="_1UoZlX" from the search results from this particular page - ;as-pos=1_1_ic_sam&as-show=on&otracker=start&page=6&q=samsung+mobiles&sid=tyy%2F4io
When I created a soup from the page I realised that the search results are being rendered using React JS and hence I can't find them in the page source (or in the soup).
Here's my code
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdrivermon.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
listUrls = [';as-pos=1_1_ic_sam&as-show=on&otracker=start&page=6&q=samsung+mobiles&sid=tyy%2F4iof']
PHANTOMJS_PATH = './phantomjs'
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
urls=[]
for url in listUrls:
browser.get(url)
wait = WebDriverWait(browser, 20)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "_1UoZlX")))
soup = BeautifulSoup(browser.page_source,"html.parser")
results = soup.findAll('a',{'class':"_1UoZlX"})
for result in results:
link = result["href"]
print link
urls.append(link)
print urls
This is the error I'm getting.
Traceback (most recent call last):
File "fetch_urls.py", line 19, in <module>
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "_1UoZlX")))
File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
seleniummon.exceptions.TimeoutException: Message:
Screenshot: available via screen
Someone mentioned in this answer that there is a way to use selenium to process the javascript on a page. Can someone elaborate on that? I did some googling but couldn't find an approach that works for this particular case.
I want to scrape anchor links with class="_1UoZlX" from the search results from this particular page - https://www.flipkart./search?as=on&as-pos=1_1_ic_sam&as-show=on&otracker=start&page=6&q=samsung+mobiles&sid=tyy%2F4io
When I created a soup from the page I realised that the search results are being rendered using React JS and hence I can't find them in the page source (or in the soup).
Here's my code
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.mon.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
listUrls = ['https://www.flipkart./search?as=on&as-pos=1_1_ic_sam&as-show=on&otracker=start&page=6&q=samsung+mobiles&sid=tyy%2F4iof']
PHANTOMJS_PATH = './phantomjs'
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
urls=[]
for url in listUrls:
browser.get(url)
wait = WebDriverWait(browser, 20)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "_1UoZlX")))
soup = BeautifulSoup(browser.page_source,"html.parser")
results = soup.findAll('a',{'class':"_1UoZlX"})
for result in results:
link = result["href"]
print link
urls.append(link)
print urls
This is the error I'm getting.
Traceback (most recent call last):
File "fetch_urls.py", line 19, in <module>
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "_1UoZlX")))
File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.mon.exceptions.TimeoutException: Message:
Screenshot: available via screen
Someone mentioned in this answer that there is a way to use selenium to process the javascript on a page. Can someone elaborate on that? I did some googling but couldn't find an approach that works for this particular case.
Share Improve this question edited May 23, 2017 at 12:24 CommunityBot 11 silver badge asked Dec 26, 2016 at 12:38 dontpanicdontpanic 871 gold badge1 silver badge7 bronze badges2 Answers
Reset to default 3There is no problem with your code but the website you are scraping - it does not stop loading for some reason that prevents the parsing of the page and subsequent code you wrote.
I tried with wikipedia to confirm the same:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.mon.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
listUrls = ["https://en.wikipedia/wiki/List_of_state_and_union_territory_capitals_in_India"]
# browser = webdriver.PhantomJS('/usr/local/bin/phantomjs')
browser = webdriver.Chrome("./chromedriver")
urls=[]
for url in listUrls:
browser.get(url)
soup = BeautifulSoup(browser.page_source,"html.parser")
results = soup.findAll('a',{'class':"mw-redirect"})
for result in results:
link = result["href"]
urls.append(link)
print urls
Outputs:
[u'/wiki/List_of_states_and_territories_of_India_by_area', u'/wiki/List_of_Indian_states_by_GDP_per_capita', u'/wiki/Constitutional_republic', u'/wiki/States_and_territories_of_India', u'/wiki/National_Capital_Territory_of_Delhi', u'/wiki/States_Reorganisation_Act', u'/wiki/High_Courts_of_India', u'/wiki/Delhi_NCT', u'/wiki/Bengaluru', u'/wiki/Madras', u'/wiki/Andhra_Pradesh_Capital_City', u'/wiki/States_and_territories_of_India', u'/wiki/Jammu_(city)']
P.S. I'm using a chrome driver in order to run the script against the real chrome browser for debugging purposes. Download the chrome driver from https://chromedriver.storage.googleapis./index.html?path=2.27/
Selenium will render the page including the Javascript. Your code is working properly. It is waiting for the element to be generated. In your case, Selenium didn't get that CSS element. The URL which you gave is not rendering the result page. Instead of that, It is generating the following error page.
https://i.sstatic/8C6BW.jpg
This page is not having the CSS class. Your code is waiting for that particular CSS element. Try Firefox
web driver to see what is happening.