I want to code a perl application that would crawl some websites and collect images and links from such webpages. Because the most of pages use JavaScript that generate a HTML content, I need to code quasi a client browser with JavaScript support to be able to parse a final HTML code that is generated and/or modified by JavaScript. What are my options?
If possible, please publish some implementation code or link to some example(s).
I want to code a perl application that would crawl some websites and collect images and links from such webpages. Because the most of pages use JavaScript that generate a HTML content, I need to code quasi a client browser with JavaScript support to be able to parse a final HTML code that is generated and/or modified by JavaScript. What are my options?
If possible, please publish some implementation code or link to some example(s).
Share Improve this question edited Mar 5, 2012 at 1:41 Ωmega asked Mar 4, 2012 at 23:35 ΩmegaΩmega 43.7k35 gold badges141 silver badges211 bronze badges 1- possible duplicate of How can I handle Javascript in a Perl web crawler? – Ilmari Karonen Commented Mar 5, 2012 at 1:41
5 Answers
Reset to default 10There are several options.
- Win32::IE::Mechanize on Windows
- Mozilla::Mechanize
- WWW::Mechanize::Firefox
- WWW::Selenium
- Wight
Options that spring to mind:
You could have Perl use Selenium and have a full-blown browser do the work for you.
You can download and compile V8 or another open source JavaScript engine and have Perl call an external program to evaluate the JavaScript.
I don't think Perl's LWP module supports JavaScript, but you might want to check that if you haven't done so already.
WWW::Scripter with the WWW::Scripter::Plugin::JavaScript and WWW::Scripter::Plugin::Ajax plugins seems like the closest you'll get without using an actual browser (the modules WWW::Selenium, Mozilla::Mechanize or Win32::IE::Mechanize use real browsers).
Check the complete working example featured in the Scraping pages full of JavaScript. It uses Web::Scraper for HTML processing and Gtk3::WebKit to process dynamic content. However, the later one is quite a PITA to install. If there are not-that-many pages you need to scrape (< 1000), fetching the post-processed DOM content through PhantomJS is an interesting option. I've written the following script for that purpose:
var page = require('webpage').create(),
system = require('system'),
fs = require('fs'),
address, output;
if (system.args.length < 3 || system.args.length > 5) {
console.log('Usage: phantomjs --load-images=no html.js URL filename');
phantom.exit(1);
} else {
address = system.args[1];
output = system.args[2];
page.open(address, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
} else {
fs.write(output, page.content, 'w');
}
phantom.exit();
});
}
There's something like that on the CPAN already, it's a module called Wight, but I haven't tested it yet.
WWW::Mechanize::Firefox can use with mozrepl, with all javascript action.