最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

java - How to scrape HTTPS javascript web pages - Stack Overflow

programmeradmin3浏览0评论

I am trying to monitor day-to-day prices from an online catalogue. The site uses HTTPS and generates the catalogue pages with javascript. How can i interface with the site and make it generate the pages I need?

I have done this with other sites where the HTML can easily be accessed, I have no problem parseing the HTML once generated.

I only know Python and Java.

Thanks in advance.

I am trying to monitor day-to-day prices from an online catalogue. The site uses HTTPS and generates the catalogue pages with javascript. How can i interface with the site and make it generate the pages I need?

I have done this with other sites where the HTML can easily be accessed, I have no problem parseing the HTML once generated.

I only know Python and Java.

Thanks in advance.

Share Improve this question asked Apr 6, 2011 at 5:41 jsjjsj 9,39118 gold badges60 silver badges107 bronze badges
Add a comment  | 

3 Answers 3

Reset to default 11

Take a look at HTMLUnit - a headless Java browser that can be fully controlled by your code. A simple example can be seen here: http://htmlunit.sourceforge.net/gettingStarted.html

(obligatory warning: by screen-scraping the site, you may be breaking its ToS, and possibly open yourself to lawsuits; check whether you are allowed to do it before you start)

If they've created a Web API that their JavaScript interfaces with, you might be able to scrape that directly, rather than trying to go the HTML route.

If they've obfuscated it or that option isn't available for some other reason, you'll basically need a Web browser to evaluate the JavaScript and then scrap the browser's DOM. Perhaps write a browser plugin?

I use webkit through it's python bindings for scraping javascript content. See here for example.

发布评论

评论列表(0)

  1. 暂无评论