最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Using XPath in node.js - Stack Overflow

programmeradmin0浏览0评论

I am building a little document parser in node.js. To test, I have a raw HTML file, that is generally downloaded from the real website when the application executes.

I want to extract the first code example from each section of the Console.WriteLine that matches my constraint - it has to be written in C#. To do that, I have this sample XPath:

//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(@class,'lang-csharp')]

If I test the XPath online, I get the expected results, which is in this Gist.

In my node.js application, I am using xmldom and xpath to try and parse that exact same information out:

var exampleLookup = `//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(@class,'lang-csharp')]`;
var doc = new dom().parseFromString(rawHtmlString, 'text/html');
var sampleNodes = xpath.select(exampleLookup,doc);

This does not return anything, however.

What might be going on here?

I am building a little document parser in node.js. To test, I have a raw HTML file, that is generally downloaded from the real website when the application executes.

I want to extract the first code example from each section of the Console.WriteLine that matches my constraint - it has to be written in C#. To do that, I have this sample XPath:

//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(@class,'lang-csharp')]

If I test the XPath online, I get the expected results, which is in this Gist.

In my node.js application, I am using xmldom and xpath to try and parse that exact same information out:

var exampleLookup = `//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::div/following-sibling::div/pre[position()>1]/code[contains(@class,'lang-csharp')]`;
var doc = new dom().parseFromString(rawHtmlString, 'text/html');
var sampleNodes = xpath.select(exampleLookup,doc);

This does not return anything, however.

What might be going on here?

Share Improve this question asked Nov 23, 2017 at 23:08 DenDen 16.8k5 gold badges51 silver badges91 bronze badges
Add a ment  | 

2 Answers 2

Reset to default 5

This is most likely caused by the default namespace (xmlns="http://www.w3/1999/xhtml") in your HTML (XHTML).

Looking at the xpath docs, you should be able to bind the namespace to a prefix using useNamespaces and use the prefix in your xpath (untested)...

var exampleLookup = `//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::x:div/following-sibling::x:div/x:pre[position()>1]/x:code[contains(@class,'lang-csharp')]`;
var doc = new dom().parseFromString(rawHtmlString, 'text/html');
var select = xpath.useNamespaces({"x": "http://www.w3/1999/xhtml"});
var sampleNodes = xpath.select(exampleLookup,doc);

Instead of binding the namespace to a prefix, you could also use local-name() in your XPath, but I wouldn't remend it. This is also covered in the docs.

Example...

//*[@id='System_Console_WriteLine_System_String_System_Object_System_Object_System_Object_']/parent::*[local-name()='div']/following-sibling::*[local-name()='div']/*[local-name()='pre'][position()>1]/*[local-name()='code'][contains(@class,'lang-csharp')]

There is a library xpath-html that can help you using XPath to query HTML, with minimal efforts and lines of code.

const fs = require("fs");
const html = fs.readFileSync(`${__dirname}/shopback.html`, "utf8");

const xpath = require("xpath-html");
const node = xpath.fromPageSource(html).findElement("//*[contains(text(), 'with love')]");

console.log(`The matched tag name is "${node.getTagName()}"`);
console.log(`Your full text is "${node.getText()}"`);
发布评论

评论列表(0)

  1. 暂无评论