最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - How to programmatically capture a web page with forced updates - Stack Overflow

programmeradmin2浏览0评论

I need to capture a web site and am looking for an appropriate library or program to do this. The website uses Java Script and pushes updates to the page and I need to capture these as well as the page itself. I am using curl to capture the page itself but I don't know how to capture the updates. Where given a choice I would use C++.

Regards

I need to capture a web site and am looking for an appropriate library or program to do this. The website uses Java Script and pushes updates to the page and I need to capture these as well as the page itself. I am using curl to capture the page itself but I don't know how to capture the updates. Where given a choice I would use C++.

Regards

Share Improve this question asked Dec 27, 2008 at 15:25 Howard MayHoward May 6,6699 gold badges38 silver badges48 bronze badges
Add a ment  | 

5 Answers 5

Reset to default 2

Install Firefox and GreaseMonkey. Have the GM script add DOM events where appropriate to track modifications. You can then use XMLHttpRequest to send the information to a server, or write them to local files with XPCOM file IO opearation.

With this, you can do what you want in a dozen lines and little to no reverse engineering, whereas what others have advised (screen scraping) will require thousands of lines of code for a JavaScript heavy site IMO.

Addenda: this is /not/ a job for C++. And should you do it in C++ anyway, you will end up havin to reverse engineer JS, so you might as well just learn enough JS to use GreaseMonkey in the first place.

If you still want to use c++ and curl try to figure out what the javascript in the page does - I assume it just uses the timer to send a AJAX request and updates the page (although it could be more plicated). Use a tool like firefox with firebug (the "Net" spying is what you want) to see what kind of a request it is - you'll get:

  • an url of the request
  • parameters
  • the returned contents (it could be html, text, xml or json)

With a bit of luck you'll have enough to mimic the behavior in c++ with curl. If you can't make anything out of the gathered data, you'll have to browse through the javascript and try to figure out what it is doing (but most of the time page updates are really simple).

The easy way to do this would be to do this inside a browser, eg. as a firefox plugin (written in javascript) - if this is needed for anything other than a pet project this might be a bit "unelegant", but it should be really easy to do:

  • monitor the DOM tree for updates (html DOM level 2 has all kinds of "mutation" events, but I never used them so I don't know much about them or if they "work"/are supported - see DOM mutation events). There is even a possibility this kind of stuff would work in greasemonkey which would mean you wouldn't have to make a full firefox plugin - eg. Post-processing a page after it renders should get you started (you don't want to track 'load', but something like "DOMSubtreeModified"). If the mutation events don't work you can always use a timer and pare the html contents.
  • or do as the firebug does and monitor the network requests and do something with the results

Take a look at SpiderMonkey.

I've not actually used it in anger so am unsure if it will do what you want. I have e across it used optionally with the Scrapy web-crawling and screen-scraping framework written in Python.

Alternatively, can you reverse-engineer how the JavaScript push updates are carried out, and access these directly. It sounds like you'll need to store these updates and/or apply them to the base HTML page.

If you are looking for static web page scraping BeautifulSoup (Python) is one of the best and easiest.

If you are looking to scrape some javascript rendered tickers or something, that cannot be done until the page is rendered, hence not possible with BeautifulSoup alone. you will have to use a headless browser like Crowbar - Similie (uses XULRunner) which renders the javascript content on a headless browser and the output of this rendered content can be used as an input to the BeautifulSoup scraper.

Problem is your web pages are updating because script code is executing on the page. Using curl isn't going to get you there for that ..

Not sure of your exact needs .. but you could write a javascript injector bookmarklet that adds a button to any web page and lets you grab the DOM or body html manually whenever you want... This is how many of the clip marking apps work.

If you need something that automatically captures updates as they occur - like a movie .. then you're going to need something more involved...

发布评论

评论列表(0)

  1. 暂无评论