libraries - Javascript: Download and interact with another page

I want to do some basic scripting and I'm trying to do it in javascript. I want to basically download a wikiquote page and scrape it.

What's the best way to do this? How do I get the page? I tried to do it via jQuery.get()

$.get('', function(data) { console.log(data); })

But the log is simply some error object and the console displays

XMLHttpRequest cannot load . Origin null is not allowed by Access-Control-Allow-Origin. en.wikiquote/wiki/Last_words

GET undefined (undefined)

So I guess I'm not taking the correct approach. What should I be doing?

Also, once I DO download the file, what tools are available for me to traverse it? XPath? RegEx? Is there a way to generate a DOM model from it and attach jquery?

An interesting possibility would be to somehow just open a tiny pop-up which downloads the page and then run my script to scrape the page and return data. I am aware this sounds lot like script injection. Is it even possible to do this in a friendly manner?

I want to do some basic scripting and I'm trying to do it in javascript. I want to basically download a wikiquote page and scrape it.

What's the best way to do this? How do I get the page? I tried to do it via jQuery.get()

$.get('http://en.wikiquote/wiki/Last_words', function(data) { console.log(data); })

But the log is simply some error object and the console displays

XMLHttpRequest cannot load http://en.wikiquote/wiki/Last_words. Origin null is not allowed by Access-Control-Allow-Origin. en.wikiquote/wiki/Last_words

GET http://en.wikiquote/wiki/Last_words undefined (undefined)

So I guess I'm not taking the correct approach. What should I be doing?

Also, once I DO download the file, what tools are available for me to traverse it? XPath? RegEx? Is there a way to generate a DOM model from it and attach jquery?

Share Improve this question asked May 27, 2011 at 15:55 George Mauer 122k140 gold badges396 silver badges630 bronze badges

Add a ment |

3 Answers 3

Sorted by: Reset to default 5

Assuming you are limiting yourself to JavaScript running in the browser, and documents that are not on the same host as the page running the script — you can't.

The Same Origin security policy makes this impossible. Without it a webpage could request data from any site (including LAN sites) that the user can access, with their ip address, their cookies, and anything else that might be used for authentication. (All your banking are belong to us).

WikiQuote exposes an API. You can use JSONP to make a request to the API and get the data all pre-parsed and ready to go:

document.body.appendChild(document.createElement("script")).src = 
    "http://en.wikiquote/w/api.php?action=query&titles=Last_words" +
        "&prop=revisions&rvlimit=1&rvprop=content&format=json&callback=handleQuote";

function handleQuote(quote)
{
    // quote is the response from wikiquote
}

Note that the response is returned as wiki markup, not html. ~~You'll have to do some parsing to get html, if that's what you're after.~~ Edit: Use action=parse&page=Last_words to get html.

You can preview the JSON response in your browser by changing the format argument from json to jsonfm and paste it in your browser:

Wiki markup:
http://en.wikiquote/w/api.php?action=query&titles=Last_words&prop=revisions&rvlimit=1&rvprop=content&format=jsonfm&callback=handleQuote

HTML:
http://en.wikiquote/w/api.php?action=parse&page=Last_words&format=jsonfm&callback=handleQuote

Edit: I really only answered half (or less) of your question. As for how to interact with the data once you've got it, jQuery makes it simple. If you pass an html string into $(), jQuery constructs the elements for you. Then, you can access it via jQuery or DOM methods:

var paragraphs = $(someHTML).find("p");

A simple way to get the HTML from any domain via JavaScript, is to make your ajax request to a local server page that requests the document for you. You could write a generic handler ashx page, with something like:

public void ProcessRequest(HttpContext context)
{
    string url = Request.QueryString["url"];
    if (Uri.IsWellFormedUriString(url, UriKind.Absolute))
    {
        context.Response.Write(new WebClient().DownloadString(url));
    }
}

And then call it with jQuery:

var url = encodeURIComponent("http://en.wikiquote/wiki/Last_words");
$.get("fetch.ashx?url=" + url, function (response)
{
    var $response = $(response);
});

Edit: Newer browsers do support some cross-domain data retrieval through JavaScript by implementing Cross-Origin Resource Sharing (CORS). FireFox and Chrome support CORS via XMLHttpRequest. IE8 and IE9 support CORS with XDomainRequest. The catch is that the server also has to support CORS. In short, the server must include a response header of Access-Control-Allow-Origin: * in order for the client to process the response. And sadly, it appears wikiquote is not sending that header in its response. Here's a hefty article on the internals of CORS.

XMLHttpRequest cannot be used for cross-domain requests. You could load the page using an iframe and try to get the details from there, but i remend to do this server-side (using a DOM or SAX parser, to answer your other question) since doing it in JavaScript is clearly not very elegant.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

libraries - Javascript: Download and interact with another page - Stack Overflow

3 Answers 3

与本文相关的文章

评论列表(0)