最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Getting Final HTML with Javascript rendered Java as String - Stack Overflow

programmeradmin8浏览0评论

I want to fetch data from an HTML page(scrape it). But it contains reviews in javascript. In normal java url fetch I am only getting the HTML(actual one) without Javascript executed. I want the final page with Javascript executed.

Example :- .asp

This page has ments as a facebook plugin which are fetched as Javascript.

Also similar to this even on this.

What should I do?

I want to fetch data from an HTML page(scrape it). But it contains reviews in javascript. In normal java url fetch I am only getting the HTML(actual one) without Javascript executed. I want the final page with Javascript executed.

Example :- http://www.glamsham./movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp

This page has ments as a facebook plugin which are fetched as Javascript.

Also similar to this even on this. http://www.imdb./title/tt0848228/reviews

What should I do?

Share Improve this question edited Jun 3, 2012 at 17:25 Pointy 414k62 gold badges594 silver badges626 bronze badges asked Jun 3, 2012 at 17:21 KillerTheLordKillerTheLord 1671 gold badge3 silver badges9 bronze badges 2
  • 1 Your only real option for doing things like that in general is to harness a web browser as a ponent for your own software. Have the browser fetch the page and simulate whatever interactions are necessary for the JavaScript to do what it does, then examine the DOM. – Pointy Commented Jun 3, 2012 at 17:27
  • There should be a way to implement the facebook API to fetch the ments from that post as well, together with the rest of the page contents. – Fabrício Matté Commented Jun 3, 2012 at 17:30
Add a ment  | 

3 Answers 3

Reset to default 7

Use phantomjs: http://phantomjs

var page = require('webpage').create();
page.open("http://www.glamsham./movies/reviews/rowdy-rathore-movie-review-cheers-for-rowdy-akki-051207.asp")
setTimeout(function(){
    // Where you want to save it    
    page.render("screenshoot.png")  
    // You can access its content using jQuery
    var fbments = page.evaluate(function(){
        return $(".fb-ments iframe").contents().find(".postContainer") 
    }) 
},10000)

You have to use the option in phantom --web-security=no to allow cross-domain interaction (ie for facebook iframe)

To municate with other applications from phantomjs you can use a web server or make a POST request: https://github./ariya/phantomjs/blob/master/examples/post.js

You can use HTML Unit, A java based "GUI LESS Browser". You can easily get the final rendered output of any page because this loads the page as a web browser do so and returns the final rendered output. You can disable this behaviour though.

UPDATE: You were asking for example? You don't have to do anything extra for doing that:

Example:

WebClient webClient = new WebClient();
HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));

UPDATE 2: You can get iframe as follows:

HtmlPage myFrame = (HtmlPage) myPage.getFrameByName(myIframeName).getEnclosedPage();

Please read the documentation from above link. There is nothing you can't do about getting page content in HTMLUnit

The simple way to solve that problem. Hello, you can use HtmlUnit is java API, i think it can help you to access the executed js content, as a simple html.

WebClient webClient = new WebClient();
HtmlPage myPage = (HtmlPage) webClient.getPage(new URL("YourURL"));
System.out.println(myPage.getVisibleText());
发布评论

评论列表(0)

  1. 暂无评论