I want to do some basic scripting and I'm trying to do it in javascript. I want to basically download a wikiquote page and scrape it.
What's the best way to do this? How do I get the page? I tried to do it via jQuery.get()
$.get('', function(data) { console.log(data); })
But the log is simply some error object and the console displays
XMLHttpRequest cannot load . Origin null is not allowed by Access-Control-Allow-Origin. en.wikiquote/wiki/Last_words
GET undefined (undefined)
So I guess I'm not taking the correct approach. What should I be doing?
Also, once I DO download the file, what tools are available for me to traverse it? XPath? RegEx? Is there a way to generate a DOM model from it and attach jquery?
An interesting possibility would be to somehow just open a tiny pop-up which downloads the page and then run my script to scrape the page and return data. I am aware this sounds lot like script injection. Is it even possible to do this in a friendly manner?
I want to do some basic scripting and I'm trying to do it in javascript. I want to basically download a wikiquote page and scrape it.
What's the best way to do this? How do I get the page? I tried to do it via jQuery.get()
$.get('http://en.wikiquote/wiki/Last_words', function(data) { console.log(data); })
But the log is simply some error object and the console displays
XMLHttpRequest cannot load http://en.wikiquote/wiki/Last_words. Origin null is not allowed by Access-Control-Allow-Origin. en.wikiquote/wiki/Last_words
GET http://en.wikiquote/wiki/Last_words undefined (undefined)
So I guess I'm not taking the correct approach. What should I be doing?
Also, once I DO download the file, what tools are available for me to traverse it? XPath? RegEx? Is there a way to generate a DOM model from it and attach jquery?
An interesting possibility would be to somehow just open a tiny pop-up which downloads the page and then run my script to scrape the page and return data. I am aware this sounds lot like script injection. Is it even possible to do this in a friendly manner?
Share Improve this question asked May 27, 2011 at 15:55 George MauerGeorge Mauer 122k140 gold badges396 silver badges630 bronze badges3 Answers
Reset to default 5Assuming you are limiting yourself to JavaScript running in the browser, and documents that are not on the same host as the page running the script — you can't.
The Same Origin security policy makes this impossible. Without it a webpage could request data from any site (including LAN sites) that the user can access, with their ip address, their cookies, and anything else that might be used for authentication. (All your banking are belong to us).
WikiQuote exposes an API. You can use JSONP to make a request to the API and get the data all pre-parsed and ready to go:
document.body.appendChild(document.createElement("script")).src =
"http://en.wikiquote/w/api.php?action=query&titles=Last_words" +
"&prop=revisions&rvlimit=1&rvprop=content&format=json&callback=handleQuote";
function handleQuote(quote)
{
// quote is the response from wikiquote
}
Note that the response is returned as wiki markup, not html. You'll have to do some parsing to get html, if that's what you're after. Edit: Use action=parse&page=Last_words
to get html.
You can preview the JSON response in your browser by changing the format
argument from json
to jsonfm
and paste it in your browser:
Wiki markup:
http://en.wikiquote/w/api.php?action=query&titles=Last_words&prop=revisions&rvlimit=1&rvprop=content&format=jsonfm&callback=handleQuote
HTML:
http://en.wikiquote/w/api.php?action=parse&page=Last_words&format=jsonfm&callback=handleQuote
Edit: I really only answered half (or less) of your question. As for how to interact with the data once you've got it, jQuery makes it simple. If you pass an html string into $()
, jQuery constructs the elements for you. Then, you can access it via jQuery or DOM methods:
var paragraphs = $(someHTML).find("p");
A simple way to get the HTML from any domain via JavaScript, is to make your ajax request to a local server page that requests the document for you. You could write a generic handler ashx page, with something like:
public void ProcessRequest(HttpContext context)
{
string url = Request.QueryString["url"];
if (Uri.IsWellFormedUriString(url, UriKind.Absolute))
{
context.Response.Write(new WebClient().DownloadString(url));
}
}
And then call it with jQuery:
var url = encodeURIComponent("http://en.wikiquote/wiki/Last_words");
$.get("fetch.ashx?url=" + url, function (response)
{
var $response = $(response);
});
Edit: Newer browsers do support some cross-domain data retrieval through JavaScript by implementing Cross-Origin Resource Sharing (CORS). FireFox and Chrome support CORS via XMLHttpRequest
. IE8 and IE9 support CORS with XDomainRequest
. The catch is that the server also has to support CORS. In short, the server must include a response header of Access-Control-Allow-Origin: *
in order for the client to process the response. And sadly, it appears wikiquote is not sending that header in its response. Here's a hefty article on the internals of CORS.
XMLHttpRequest
cannot be used for cross-domain requests. You could load the page using an iframe
and try to get the details from there, but i remend to do this server-side (using a DOM or SAX parser, to answer your other question) since doing it in JavaScript is clearly not very elegant.