HTML:
<div class="someclass">
<h3>First</h3>
<strong>Second</strong>
<hr>
Third
<br>
Fourth
<br>
<em></em>
...
</div>
From above div
node I want to get all child text nodes after hr
("Third"
, "Fourth"
, ... and there might be more)
If I do
document.querySelectorAll('div.someclass>hr~*')
I get NodeList [ br, br, em, ... ]
- no text nodes
With below
document.querySelector('div.someclass').textContent
I get all text nodes as single string
I can get each text node as
var third = document.querySelector('div.someclass').childNodes[6].textContent
var fourth = document.querySelector('div.someclass').childNodes[8].textContent
so I tried
document.querySelector('div.someclass').childNodes[5:] # SyntaxError
and slice()
document.querySelector('div.someclass').childNodes.slice(5) # TypeError
So is there any way I can get all child text nodes starting from hr
node?
UPDATE
I forgot to mention that this question is about web-scraping, but not web-development... I cannot change HTML source code
HTML:
<div class="someclass">
<h3>First</h3>
<strong>Second</strong>
<hr>
Third
<br>
Fourth
<br>
<em></em>
...
</div>
From above div
node I want to get all child text nodes after hr
("Third"
, "Fourth"
, ... and there might be more)
If I do
document.querySelectorAll('div.someclass>hr~*')
I get NodeList [ br, br, em, ... ]
- no text nodes
With below
document.querySelector('div.someclass').textContent
I get all text nodes as single string
I can get each text node as
var third = document.querySelector('div.someclass').childNodes[6].textContent
var fourth = document.querySelector('div.someclass').childNodes[8].textContent
so I tried
document.querySelector('div.someclass').childNodes[5:] # SyntaxError
and slice()
document.querySelector('div.someclass').childNodes.slice(5) # TypeError
So is there any way I can get all child text nodes starting from hr
node?
UPDATE
I forgot to mention that this question is about web-scraping, but not web-development... I cannot change HTML source code
Share Improve this question edited Feb 10, 2018 at 3:03 BoltClock 725k165 gold badges1.4k silver badges1.4k bronze badges asked Feb 9, 2018 at 12:08 AnderssonAndersson 52.7k18 gold badges83 silver badges132 bronze badges 3-
" I get NodeList [ br, br, em, ... ] - no text nodes " maybe just put the text into
<p>
tags and they will also be returned? – messerbill Commented Feb 9, 2018 at 12:11 - @messerbill, Sorry, I didn't mention that this question is about web-scraping, but not web-development.. I cannot change HTML source code – Andersson Commented Feb 9, 2018 at 12:13
-
how about
S(".someClass").text()
– Atal Shrivastava Commented Feb 9, 2018 at 12:13
4 Answers
Reset to default 3You can get the content and use split with hr
to get the html after the hr
and then replace this content within a div
and you will be able to manipulate this div
to get your content:
var content = document.querySelector('.someclass').innerHTML;
content = content.split('<hr>');
content = content[1];
document.querySelector('.hide').innerHTML = content;
/**/
var nodes = document.querySelector('.hide').childNodes;
for (var i = 0; i < nodes.length; i++) {
console.log(nodes[i].textContent);
}
.hide {
display: none;
}
<div class="someclass">
<h3>First</h3>
<strong>Second</strong>
<hr> Third
<br> Fourth
<br>
<em></em> ...
</div>
<div class="hide"></div>
.childNodes
includes both text and non-text nodes.
Your syntax error is because you can't do array slicing like [5:]
in javascript.
And also a NodeList is array-like...but is not an array...which is why slice
doesn't work directly on childNodes
.
1) get your NodeList
var nodeList = document.querySelector('.some-class').childNodes;
2) Convert NodeList to actual array
nodes = Array.prototype.slice.call(nodes);
(note in modern ES6 browsers you can do nodes = Array.from(nodes);
Also modern browsers have added .forEach
support to NodeList objects...so you can directly use .forEach
without array conversion on NodeList in modern browsers)
3) Iterate and collect the text nodes you want
This is dependent on your own logic. But you can iterate the nodes and test to see if node.nodeType == Node.TEXT_NODE
to see if any given node is a text node.
var foundHr = false,
results = [];
nodes.forEach(el => {
if (el.tagName == 'HR') { foundHr = true; }
else if (foundHr && el.nodeType == Node.TEXT_NODE) {
results.push(el.textContent);
}
});
console.log(results);
You may get all text nodes under node
using this piece of code:
var walker = document.createTreeWalker(node, NodeFilter.SHOW_TEXT, null, false);
var textNode;
var result = [];
while (textNode = walker.nextNode()) {
result.push(textNode);
}
And you've got an Array
of text nodes, so you can slice()
it as you wish:
console.log(result.slice(5));
Another [Recursive] Solution
const { childNodes } = <Element>document.querySelector('.someclass'); // throws an error if element doesn't exist
const siblings = sliceNodeList([], false, 'hr', ...childNodes);
function sliceNodeList(nodes: Node[], deep: boolean, selector: string, node?: Node, ...more: Node[]) {
if (!selector) return nodes;
if (!node) return nodes;
const { nodeType } = node;
const handle = { // not concerned with Attr Nodes
[Node.ELEMENT_NODE]: handleElement,
[Node.TEXT_NODE]: handleOther,
[Node.COMMENT_NODE]: handleOther,
}[ nodeType ];
if (handle) handle.call(this, nodes, deep, selector, node);
if (more.length) sliceNodeList(nodes, deep, selector, ...more);
return nodes;
}
function handleElement(nodes: Node[], deep: boolean, selector: string, node: Element) {
const { childNodes } = node; // not concerned with Attr Nodes
const matches = node.matches(selector);
if (nodes.length) nodes.push(node); // assume we must have already matched
else if (matches) nodes.push(node); // we matched on an element
if (deep) sliceNodeList(nodes, deep, selector, ...childNodes); // keep diving into substructures
return nodes;
}
function handleOther(nodes: Node[], deep: boolean, selector: string, node: Text|Comment) {
if (nodes.length) return [ ...nodes, node ]; // assume we must have already matched
if (node.data === selector) return [ ...nodes, node ]; // we matched on a Text or Comment value
return nodes;
}
Basically, this just uses recursion by checking if there are more siblings while mapping to the correct Node-Type's handler. It uses Mutual Recursion (if deep === true
) by recalling the parent function to dive into nested structures. We didn't concern ourselves here with Attr
Node
s, but you can setup a handler for that.
Because a Text
and a Comment
share enough of the same interface (node.data
) we are able to reuse the same function. That function, similar to handleElement
, assumes that if (nodes.length)
then we've already matched and can safely & reliably add the current node to the collection.
Unlike handleElement
, a Text
or Comment
cannot call node.matches
. However, we can still match on the node if it's value is equal to the selector
-- giving us the ability to select by text. We could go another step further here by, instead of using a strict equality on the entire text (or ment) value using ===
, we could use something like String.prototype.includes
or a RegExp
.
We could have used for
, while
, or other iteration approaches. It seems to me, however, that recursion is a better approach to walking the DOM as the DOM is positive (Node
implements The Composite Pattern) and thus a recursive structure. I also just whipped this out in VSCode very quickly and didn't even run it in a browser console, but I write this pattern all the time so it should be pretty close to good-to-go.
Hopefully helpfully.