How does one parse html documents which make heavy use of javascript? I know there are a few libraries in python which can parse static xml/html files and I'm basically looking for a programme or library (or even firefox plugin) which reads html+javascript, executes the javascript bit and outputs html code without javascript so it would look identical if displayed in a browser.
As a simple example
<a href="javascript:web_link(34, true);">link</a>
should be replaced by the appropriate value the javascript function returns, e.g.
<a href="">link</a>
A more plex example would be a saved facebook html page which is littered with loads of javascript code.
Probably related to How to "execute" HTML+Javascript page with Node.js but do I really need Node.js and JSDOM? Also slightly related is Python library for rendering HTML and javascript but I'm not interested in rendering just the pure html output.
How does one parse html documents which make heavy use of javascript? I know there are a few libraries in python which can parse static xml/html files and I'm basically looking for a programme or library (or even firefox plugin) which reads html+javascript, executes the javascript bit and outputs html code without javascript so it would look identical if displayed in a browser.
As a simple example
<a href="javascript:web_link(34, true);">link</a>
should be replaced by the appropriate value the javascript function returns, e.g.
<a href="http://www.example.">link</a>
A more plex example would be a saved facebook html page which is littered with loads of javascript code.
Probably related to How to "execute" HTML+Javascript page with Node.js but do I really need Node.js and JSDOM? Also slightly related is Python library for rendering HTML and javascript but I'm not interested in rendering just the pure html output.
Share Improve this question edited May 23, 2017 at 9:59 CommunityBot 11 silver badge asked Aug 15, 2011 at 10:53 tomtom 3334 silver badges12 bronze badges 2- Either get a JavaScript runtime and sort something out with it, or analyse the code and work out what it's going to end up (strongly per-site configuration). – Chris Morgan Commented Aug 17, 2011 at 14:46
- stackoverflow./questions/19465510/… – gliptak Commented Oct 31, 2013 at 1:30
3 Answers
Reset to default 3You can use Selenium with python as detailed here
Example:
import xmlrpclib
# Make an object to represent the XML-RPC server.
server_url = "http://localhost:8080/selenium-driver/RPC2"
app = xmlrpclib.ServerProxy(server_url)
# Bump timeout a little higher than the default 5 seconds
app.setTimeout(15)
import os
os.system('start run_firefox.bat')
print app.open('http://localhost:8080/AUT/000000A/http/www.amazon./')
print app.verifyTitle('Amazon.: Wele')
print app.verifySelected('url', 'All Products')
print app.select('url', 'Books')
print app.verifySelected('url', 'Books')
print app.verifyValue('field-keywords', '')
print app.type('field-keywords', 'Python Cookbook')
print app.clickAndWait('Go')
print app.verifyTitle('Amazon.: Books Search Results: Python Cookbook')
print app.verifyTextPresent('Python Cookbook', '')
print app.verifyTextPresent('Alex Martellibot, David Ascher', '')
print app.testComplete()
From Mozilla Gecko FAQ:
Q. Can you invoke the Gecko engine from a Unix shell script? Could you send it HTML and get back a web page that might be sent to the printer?
A. Not really supported; you can probably get something close to what you want by writing your own application using Gecko's embedding APIs, though. Note that it's currently not possible to print without a widget on the screen to render to.
Embedding Gecko in a program that outputs what you want may be way too heavy, but at least your output will be as good as it gets.
PhantomJS can be loaded using Selenium
$ ipython
In [1]: from selenium import webdriver
In [2]: browser=webdriver.PhantomJS()
In [3]: browser.get('http://seleniumhq/')
In [4]: browser.title
Out[4]: u'Selenium - Web Browser Automation'