javascript - Using phantomjs to crawl a sitemap

I am very new to phantomjs. I have been messing with the following for far too long. I know I am missing something very simple. I have the following sitemap.xml:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<urlset xmlns=".9" xmlns:xsi="" xsi:schemaLocation=".9 .9/sitemap.xsd">
  <url>
    <loc>/</loc>
    <changefreq>always</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>/vehicles</loc>
    <lastmod>2013-01-07</lastmod>
  </url>
</urlset>

Now all I am trying to do is use phantomjs to get the url values from the xml document. I have the following.

page.open("sitemap.xml", function(status) {
    if(status !== "success") {
        console.log("Unable to open sitemap.");
    } else {
        // Stuck here
        console.log(page.content);
    }
});

The contents of the xml file are printed to screen correctly, but how do I use the document now to play with the xml? I just need to be able to get the first child of each url node. I have tried parsing the xml document into a DOMParser, but that does not seem right. Your help will be much appreciated.

Also how do you debug phatomjs so I can see the object in its full glory? For example, If I console.log an object in Dev Tools, I can expand it and see the key - value pairs. I am guessing terminal does not offer this luxury?

I am very new to phantomjs. I have been messing with the following for far too long. I know I am missing something very simple. I have the following sitemap.xml:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<urlset xmlns="http://www.sitemaps/schemas/sitemap/0.9" xmlns:xsi="http://www.w3/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps/schemas/sitemap/0.9 http://www.sitemaps/schemas/sitemap/0.9/sitemap.xsd">
  <url>
    <loc>/</loc>
    <changefreq>always</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>/vehicles</loc>
    <lastmod>2013-01-07</lastmod>
  </url>
</urlset>

Now all I am trying to do is use phantomjs to get the url values from the xml document. I have the following.

page.open("sitemap.xml", function(status) {
    if(status !== "success") {
        console.log("Unable to open sitemap.");
    } else {
        // Stuck here
        console.log(page.content);
    }
});

Share Improve this question asked Jan 7, 2013 at 17:17 TYRONEMICHAEL 4,2444 gold badges32 silver badges48 bronze badges

Add a ment |

4 Answers 4

Sorted by: Reset to default 5

PhantomJS allows you to call javascript from within the page context. Check out my solution using plain old javascript.

The assumption is the sitemap looks like so

<urlset xmlns="http://www.sitemaps/schemas/sitemap/0.9" xmlns:xsi="http://www.w3/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps/schemas/sitemap/0.9 http://www.sitemaps/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>http://example./</loc>
<lastmod>2014-07-07T14:09:27+00:00</lastmod>
<changefreq>always</changefreq>
</url>

I can get the url in the above sitemap using the code below.

var page = require('webpage').create();
page.open('http://xxxx/static/sitemap/sitemap.xml', function() {
        var content = page.content;
        parser = new DOMParser();
        xmlDoc = parser.parseFromString(content,'text/xml');
        var loc = xmlDoc.getElementsByTagName('loc');
        console.log(loc.length);
        for(var i=0; i < loc.length; i++)
        {
          var url=loc[i].textContent;

        }

        phantom.exit();
});

use libxmljs to parse your xml-string and get the data you want!

Another idea, you could inject jQuery into the page and just parse the xml as such:

page.open("sitemap.xml", function(status) {
    if(status !== "success") {
        console.log("Unable to open sitemap.");
    } else {
        // Stuck here
        console.log(page.content);
        page.injectJs('j-query.js');//path to jquery
        var output = page.evaluate(function(){
                            return $('url *:first-child');           
                       });
    }
});

Someone created a testsuite for testing XML Sitemaps using casperjs, maybe you can adopt the code for your specific needs.

From the author:

This script will attempt to crawl through a specified sitemap to check children pages for broken urls, images, css, and Javascript. Errors will be recorded to the logfile specified.

Usage:

casperjs sitemap_xml_testing.js --sitemap=<URL TO SITEMAP> --logfile=<LOG FILE NAME>

gmazin automated sitemap testing on Bitbucket

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

javascript - Using phantomjs to crawl a sitemap - Stack Overflow

4 Answers 4

与本文相关的文章

评论列表(0)