I'm trying to write a python script which parses one element from a website and simply prints it.
I couldn't figure out how to achieve this, without selenium
's webdiver
, in order to open a browser which handles the scripts to properly display the website.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('.shtml#!product/910000800509')
content = browser.page_source
print(content[42000:43000])
browser.close()
This is just a rough draft which will print the contents, including the element of interest <span class="prod-price-inner">£13.00</span>
.
How could I get the element of interest without the browser opening, or even without a browser at all?
edit: I've previously tried to use urllib
or in bash
wget
, which both lack the required javascript interpretation.
I'm trying to write a python script which parses one element from a website and simply prints it.
I couldn't figure out how to achieve this, without selenium
's webdiver
, in order to open a browser which handles the scripts to properly display the website.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://groceries.asda./asda-webstore/pages/landing/home.shtml#!product/910000800509')
content = browser.page_source
print(content[42000:43000])
browser.close()
This is just a rough draft which will print the contents, including the element of interest <span class="prod-price-inner">£13.00</span>
.
How could I get the element of interest without the browser opening, or even without a browser at all?
edit: I've previously tried to use urllib
or in bash
wget
, which both lack the required javascript interpretation.
- I'm planning to create a small Python script. – boolean.is.null Commented Oct 13, 2015 at 0:30
- Ok, I'm working on it :) I'll post my answer in a bit. Just to make sure I got it right, You need the price element, right ? – Pedro Lobito Commented Oct 13, 2015 at 0:32
- 1 You want to hide the browser? Duplicate of stackoverflow./questions/5370762/… – RobertB Commented Oct 13, 2015 at 0:36
-
In the meanwhile, you can take a look at crummy./software/BeautifulSoup/bs4/doc, to install use
pip install BeautifulSoup4
– Pedro Lobito Commented Oct 13, 2015 at 0:51 - You can only parse that page with a browser. The page doesn't display anything if javascript isn't enabled. Selenium is the way to go. – Pedro Lobito Commented Oct 13, 2015 at 0:57
2 Answers
Reset to default 2As other answers mentioned, this webpage requires javascript to render content, so you can't simply get and process the page with lxml, Beautiful Soup, or similar library. But there's a much simpler way to get the information you want.
I noticed that the link you provided fetches data from an internal API in a structured fashion. It appears that the product number is 910000800509
based on the url. If you look at the networking tab in Chrome dev tools (or your brower's equivalent dev tools), you'll see that a GET request is being made to following URL: http://groceries.asda./api/items/view?itemid=910000800509.
You can make the request like this with just the json and requests modules:
import json
import requests
url = 'http://groceries.asda./api/items/view?itemid=910000800509'
r = requests.get(url)
price = r.json()['items'][0]['price']
print price
£13.00
This also gives you access to lots of other information about the product, since the request returns some JSON with product details.
How could I get the element of interest without the browser opening, or even without a browser at all?
After inspecting the page you're trying to parse :
http://groceries.asda./asda-webstore/pages/landing/home.shtml#!product/910000800509
I realized that it only displays the content if javascript
is enabled, based on that, you need to use a real browser.
Conclusion:
The way to go, if you need to automatize, is:
selenium