最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - How to select all text nodes after specific element - Stack Overflow

programmeradmin1浏览0评论

HTML:

<div class="someclass">
    <h3>First</h3> 
    <strong>Second</strong> 
    <hr>
    Third
    <br>
    Fourth
    <br>
    <em></em>
    ...
</div>

From above div node I want to get all child text nodes after hr ("Third", "Fourth", ... and there might be more)

If I do

document.querySelectorAll('div.someclass>hr~*')

I get NodeList [ br, br, em, ... ] - no text nodes

With below

document.querySelector('div.someclass').textContent

I get all text nodes as single string

I can get each text node as

var third = document.querySelector('div.someclass').childNodes[6].textContent
var fourth = document.querySelector('div.someclass').childNodes[8].textContent

so I tried

document.querySelector('div.someclass').childNodes[5:]  # SyntaxError

and slice()

document.querySelector('div.someclass').childNodes.slice(5)  # TypeError

So is there any way I can get all child text nodes starting from hr node?

UPDATE

I forgot to mention that this question is about web-scraping, but not web-development... I cannot change HTML source code

HTML:

<div class="someclass">
    <h3>First</h3> 
    <strong>Second</strong> 
    <hr>
    Third
    <br>
    Fourth
    <br>
    <em></em>
    ...
</div>

From above div node I want to get all child text nodes after hr ("Third", "Fourth", ... and there might be more)

If I do

document.querySelectorAll('div.someclass>hr~*')

I get NodeList [ br, br, em, ... ] - no text nodes

With below

document.querySelector('div.someclass').textContent

I get all text nodes as single string

I can get each text node as

var third = document.querySelector('div.someclass').childNodes[6].textContent
var fourth = document.querySelector('div.someclass').childNodes[8].textContent

so I tried

document.querySelector('div.someclass').childNodes[5:]  # SyntaxError

and slice()

document.querySelector('div.someclass').childNodes.slice(5)  # TypeError

So is there any way I can get all child text nodes starting from hr node?

UPDATE

I forgot to mention that this question is about web-scraping, but not web-development... I cannot change HTML source code

Share Improve this question edited Feb 10, 2018 at 3:03 BoltClock 725k165 gold badges1.4k silver badges1.4k bronze badges asked Feb 9, 2018 at 12:08 AnderssonAndersson 52.7k18 gold badges83 silver badges132 bronze badges 3
  • " I get NodeList [ br, br, em, ... ] - no text nodes " maybe just put the text into <p> tags and they will also be returned? – messerbill Commented Feb 9, 2018 at 12:11
  • @messerbill, Sorry, I didn't mention that this question is about web-scraping, but not web-development.. I cannot change HTML source code – Andersson Commented Feb 9, 2018 at 12:13
  • how about S(".someClass").text() – Atal Shrivastava Commented Feb 9, 2018 at 12:13
Add a ment  | 

4 Answers 4

Reset to default 3

You can get the content and use split with hr to get the html after the hr and then replace this content within a div and you will be able to manipulate this div to get your content:

var content = document.querySelector('.someclass').innerHTML;
content = content.split('<hr>');
content = content[1];

document.querySelector('.hide').innerHTML = content;
/**/

var nodes = document.querySelector('.hide').childNodes;
for (var i = 0; i < nodes.length; i++) {
  console.log(nodes[i].textContent);
}
.hide {
  display: none;
}
<div class="someclass">
  <h3>First</h3>
  <strong>Second</strong>
  <hr> Third
  <br> Fourth
  <br>
  <em></em> ...
</div>
<div class="hide"></div>

.childNodes includes both text and non-text nodes.

Your syntax error is because you can't do array slicing like [5:] in javascript.

And also a NodeList is array-like...but is not an array...which is why slice doesn't work directly on childNodes.

1) get your NodeList

var nodeList = document.querySelector('.some-class').childNodes;

2) Convert NodeList to actual array

nodes = Array.prototype.slice.call(nodes);

(note in modern ES6 browsers you can do nodes = Array.from(nodes); Also modern browsers have added .forEach support to NodeList objects...so you can directly use .forEach without array conversion on NodeList in modern browsers)

3) Iterate and collect the text nodes you want

This is dependent on your own logic. But you can iterate the nodes and test to see if node.nodeType == Node.TEXT_NODE to see if any given node is a text node.

var foundHr = false,
    results = [];
nodes.forEach(el => {
    if (el.tagName == 'HR') { foundHr = true; }
    else if (foundHr && el.nodeType == Node.TEXT_NODE) {
        results.push(el.textContent);
    }
});
console.log(results);

You may get all text nodes under node using this piece of code:

var walker = document.createTreeWalker(node, NodeFilter.SHOW_TEXT, null, false);
var textNode;
var result = [];
while (textNode = walker.nextNode()) {
    result.push(textNode);
}

And you've got an Array of text nodes, so you can slice() it as you wish:

console.log(result.slice(5));

Another [Recursive] Solution

const { childNodes } = <Element>document.querySelector('.someclass');  // throws an error if element doesn't exist
const siblings = sliceNodeList([], false, 'hr', ...childNodes);

function sliceNodeList(nodes: Node[], deep: boolean, selector: string, node?: Node, ...more: Node[]) {
    if (!selector) return nodes;
    if (!node) return nodes;
    const { nodeType } = node;
    const handle = {  // not concerned with Attr Nodes
        [Node.ELEMENT_NODE]: handleElement,
        [Node.TEXT_NODE]: handleOther,
        [Node.COMMENT_NODE]: handleOther,
    }[ nodeType ];
    
    if (handle) handle.call(this, nodes, deep, selector, node);
    if (more.length) sliceNodeList(nodes, deep, selector, ...more);
    return nodes;
}

function handleElement(nodes: Node[], deep: boolean, selector: string, node: Element) {
    const { childNodes } = node;  // not concerned with Attr Nodes
    const matches = node.matches(selector);
    
    if (nodes.length) nodes.push(node);  // assume we must have already matched
    else if (matches) nodes.push(node);  // we matched on an element
    
    if (deep) sliceNodeList(nodes, deep, selector, ...childNodes);  // keep diving into substructures
    return nodes;
}

function handleOther(nodes: Node[], deep: boolean, selector: string, node: Text|Comment) {
    if (nodes.length) return [ ...nodes, node ];  // assume we must have already matched
    if (node.data === selector) return [ ...nodes, node ];  // we matched on a Text or Comment value
    return nodes;
}

Basically, this just uses recursion by checking if there are more siblings while mapping to the correct Node-Type's handler. It uses Mutual Recursion (if deep === true) by recalling the parent function to dive into nested structures. We didn't concern ourselves here with Attr Nodes, but you can setup a handler for that.

Because a Text and a Comment share enough of the same interface (node.data) we are able to reuse the same function. That function, similar to handleElement, assumes that if (nodes.length) then we've already matched and can safely & reliably add the current node to the collection.

Unlike handleElement, a Text or Comment cannot call node.matches. However, we can still match on the node if it's value is equal to the selector -- giving us the ability to select by text. We could go another step further here by, instead of using a strict equality on the entire text (or ment) value using ===, we could use something like String.prototype.includes or a RegExp.

We could have used for, while, or other iteration approaches. It seems to me, however, that recursion is a better approach to walking the DOM as the DOM is positive (Node implements The Composite Pattern) and thus a recursive structure. I also just whipped this out in VSCode very quickly and didn't even run it in a browser console, but I write this pattern all the time so it should be pretty close to good-to-go.

Hopefully helpfully.

发布评论

评论列表(0)

  1. 暂无评论