Is there a way to retrieve the fully rendered html from a page with javascript post rendering ? If I use curl, it simply retrieves the base html, but lacks the post rendering of iframes, javascript processing etc.
What would be the best way to acplish this?
Is there a way to retrieve the fully rendered html from a page with javascript post rendering ? If I use curl, it simply retrieves the base html, but lacks the post rendering of iframes, javascript processing etc.
What would be the best way to acplish this?
Share Improve this question edited May 30, 2012 at 6:15 user406009 asked May 30, 2012 at 6:12 Joshua HongJoshua Hong 1111 gold badge2 silver badges5 bronze badges 1- You will need a whole browser setup. See possible duplicate stackoverflow./q/125177/1048572 – Bergi Commented May 30, 2012 at 6:35
6 Answers
Reset to default 3As no-one else has answered (except the copmment above, but I'll e to that later) I'll try to help as much as possible.
There no "simple" answer. PHP can't process javascript/navigate the DOM natively, so you need something that can.
Your options as I see it:
If you are after screen grab (which is what I'm hoping as you also want Flash to load), I suggest you use one of the mercial APIs that are out there for doing this. You can find some in this list http://www.programmableweb./apitag/?q=thumbnail, for example http://www.programmableweb./api/convertapi-web2image
Otherwise you need to run something yourself that can handle Javascript and the DOM on, orconnected to, your server. For this, you'd need an automated browser that you can run serverside and get the information you need. Follow the list in Bergi's ment above and you'd need to test a suitable solution - the main one Selinium is great for "unit testing" on a known website, but I'm not sure on how I'd script it to handle random sites, for example. As you would (presumably) only have one "automated browser" and you don't know how long each page will take to load, you'd need to queue the requests and handle one at a time. You'd also need to ensure pop-up alert()s are handled, all the third party libraries (you say you want flash?!) installed, handle redirects, timeouts and potential memory hogs (if running this non-stop, you'll periodically want to kill your browser and restart it to clean out the memory!). Also handle virus attacks, pop-up windows and requests to close the browser pletely.
Thirdly, VB has a web-browser ponent. I used it for a project a long time ago to do something similarish, but on a known site. Whether it's possible with .NET (to me, it' a huge security risk), and how you program for unknowns (e.g. pop-ups and Flash) I have no idea. But if you're desparate an adventurous .NET developer may be able to suggest more.
In summary - if you want more than a screen grab and can choose option 1, good luck ;)
Use a "terminal" browser like w3m or lynx. Even if the site you want to access needs login, this is possible, for example:
curl [-u login:pass] http://www.a_page. | w3m -T text/html -dump
or
curl [-u login:pass] http://www.a_page. | lynx -stdin -dump
This should give you the whole html with all frames etc.
If you're looking for something scriptable with no GUI you could use a headless browser. I've used PhantomJS for similar tasks.
If still relevant, I found that the easy way to this is using PhantomJs as a Service;
public string GetPagePhantomJs(string url)
{
using (var client = new System.Net.Http.HttpClient())
{
client.DefaultRequestHeaders.ExpectContinue = false;
var pageRequestJson = new System.Net.Http.StringContent(@"{'url':'" + url + "','renderType':'plainText','outputAsJson':false }");
var response = client.PostAsync("https://PhantomJsCloud./api/browser/v2/SECRET_KEY/", pageRequestJson).Result;
return response.Content.ReadAsStringAsync().Result;
}
}
It is really simple, when subscribing to the service there is a free plan that allows 500 pages/day. The SECRET_KEY is to be replaced by your own key that you will get.
look at this mand line IECapt.exe
It has no javascript support, but lynx
was useful for me in a situation where I needed to do processing of data from a webpage. This way I got the (plaintext) rendering and didn't have to filter through the raw html tags as with curl
.
lynx -nonumbers -dump -width=9999999 ${url} | grep ... et cetera.