I have not seen a satisfactory answer for this question. This basically a duplicate of this question, but it was improperly closed and the answers given are not sufficient.
I have e up with my own solution which I will post below.
This can be useful for web scraping, or in my case, running tests on a javascript library that handles custom elements. I make sure it is producing the output that I want, then I use this function to scrape the HTML for a given test output and use that copied HTML as the expected output to pare the test against in the future.
I have not seen a satisfactory answer for this question. This basically a duplicate of this question, but it was improperly closed and the answers given are not sufficient.
I have e up with my own solution which I will post below.
This can be useful for web scraping, or in my case, running tests on a javascript library that handles custom elements. I make sure it is producing the output that I want, then I use this function to scrape the HTML for a given test output and use that copied HTML as the expected output to pare the test against in the future.
Share Improve this question asked Nov 6, 2021 at 20:42 MossMoss 3,8137 gold badges44 silver badges61 bronze badges3 Answers
Reset to default 6Here is a function that can do what is requested. Note that it ignores html ments and other fringe things. But it retrieves regular elements, text nodes, and custom elements with shadowRoots. It also handles slotted template content. It has not been tested exhaustively but seems to be working well for my needs.
Use it like extractHTML(document.body)
or extractHTML(document.getElementByID('app'))
.
function extractHTML(node) {
// return a blank string if not a valid node
if (!node) return ''
// if it is a text node just return the trimmed textContent
if (node.nodeType===3) return node.textContent.trim()
//beyond here, only deal with element nodes
if (node.nodeType!==1) return ''
let html = ''
// clone the node for its outer html sans inner html
let outer = node.cloneNode()
// if the node has a shadowroot, jump into it
node = node.shadowRoot || node
if (node.children.length) {
// we checked for children but now iterate over childNodes
// which includes #text nodes (and even other things)
for (let n of node.childNodes) {
// if the node is a slot
if (n.assignedNodes) {
// an assigned slot
if (n.assignedNodes()[0]){
// Can there be more than 1 assigned node??
html += extractHTML(n.assignedNodes()[0])
// an unassigned slot
} else { html += n.innerHTML }
// node is not a slot, recurse
} else { html += extractHTML(n) }
}
// node has no children
} else { html = node.innerHTML }
// insert all the (children's) innerHTML
// into the (cloned) parent element
// and return the whole package
outer.innerHTML = html
return outer.outerHTML
}
Only if shadowRoots are created with the mode:"open"
setting can you access shadowRoots from the outside.
You can then dive into elements and shadowRoots with something like:
const shadowDive = (
el,
selector,
match = (m, r) => console.warn('match', m, r)
) => {
let root = el.shadowRoot || el;
root.querySelector(selector) && match(root.querySelector(selector), root);
[...root.children].map(el => shadowDive(el, selector, match));
}
Note: extracting raw HTML is pointless if Web Component styling is based on shadowDOM behaviour; you will loose all correct styling.
For me the solution by Moss nearly worked. Just slots were not included in the output. I used an LLM to improve the answer and tested it and now slots were included just like I needed. For anyone else stumbling across this and needing slots, here is the code:
function extractHTML(node) {
// return a blank string if not a valid node
if (!node) return '';
// if it is a text node just return the trimmed textContent
if (node.nodeType === 3) return node.textContent.trim();
// beyond here, only deal with element nodes
if (node.nodeType !== 1) return '';
let html = '';
// clone the node for its outer html sans inner html
let outer = node.cloneNode();
// if the node has a shadowroot, jump into it
node = node.shadowRoot || node
if (node.children.length || node.childNodes.length) {
// iterate over childNodes which includes #text nodes and slots
for (let n of node.childNodes) {
// if the node is a slot
if (n.nodeName === 'SLOT') {
// check if slot has assigned nodes
let assignedNodes = n.assignedNodes();
if (assignedNodes.length > 0) {
// if there are assigned nodes, recurse over them
for (let assignedNode of assignedNodes) {
html += extractHTML(assignedNode);
}
} else {
// if no assigned nodes, preserve the <slot> element itself
html += n.outerHTML;
}
// node is not a slot, recurse normally
} else {
html += extractHTML(n);
}
}
} else {
// node has no children, insert its innerHTML
html = node.innerHTML;
}
// insert all the (children's) innerHTML
// into the (cloned) parent element
// and return the whole package
outer.innerHTML = html;
return outer.outerHTML;
}