最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - How to get the source code of webbrowser with python - Stack Overflow

programmeradmin3浏览0评论

I am writing a spider with scrapy, however, I e across some website which rendered with js, thus the urllib2.open_url does not work. I have found that I could open the browser with webbrowser.open_new(url), however, I did not find how to get the src code of page with webbrowser. Are there any way that I could use to do this with webbrowser, or are there any other solutions without webbrowser to deal with the js sites?

I am writing a spider with scrapy, however, I e across some website which rendered with js, thus the urllib2.open_url does not work. I have found that I could open the browser with webbrowser.open_new(url), however, I did not find how to get the src code of page with webbrowser. Are there any way that I could use to do this with webbrowser, or are there any other solutions without webbrowser to deal with the js sites?

Share edited Jan 11, 2013 at 3:01 valentinas 4,3371 gold badge22 silver badges27 bronze badges asked Jan 11, 2013 at 2:56 user806135user806135 532 silver badges9 bronze badges 1
  • A webbrowser does not store the markup of a page, it holds a DOM. – Bergi Commented Jan 11, 2013 at 3:07
Add a ment  | 

4 Answers 4

Reset to default 5

You can use scraper with Webkit engine available out there.

One of them is dryscrape.

Example:

import dryscrape

search_term = 'dryscrape'

# set up a web scraping session
sess = dryscrape.Session(base_url = 'http://google.')

# we don't need images
sess.set_attribute('auto_load_images', False)

# visit homepage and search for a term
sess.visit('/')
q = sess.at_xpath('//*[@name="q"]')
q.set(search_term)
q.form().submit()

# extract all links
for link in sess.xpath('//a[@href]'):
  print link['href']

# save a screenshot of the web page
sess.render('google.png')
print "Screenshot written to 'google.png'"

See more info at:

https://github./niklasb/dryscrape
https://dryscrape.readthedocs/en/latest/index.html

If you need a full js engine, there are a number of ways you can drive webkit from Python. Until recently, these sort of things were done with Selenium. Selenium drives an entire browser.

More recently there are newer and simpler ways to run a webkit engine (which includes the v8 javascript engine) from Python. See this SO question: Headless Browser for Python (Javascript support REQUIRED!)

It references this blog as an example Scraping Javascript Webpages with Webkit . It looks to do more or less just what you need.

I'm trying to find an answer to the same problem for a few days now.

I suggest you try QT framework with WebKit. There are two python bindings. One is PyQt and the other one is PySide. You can use them directly if you want to create something more plex or you want to have 100% control over your code.

For trivial stuff like executing JavaScript in a browser environment you can use Ghost.py. It has some sort of documentation and some problems when using it from the mand line but otherwise it's just great.

If you need to process JavaScript you'll need to implement a JavaScript engine. This makes your spider much more plex. Mainly because JavaScript almost always modifies the DOM based on time or an action taken by the user. This makes it extremely challenging to process JS in a crawler. If you really need to process JavaScript in your spider you can have a look at the JavaScript engine by Mozilla: https://developer.mozilla/en/docs/SpiderMonkey

发布评论

评论列表(0)

  1. 暂无评论