最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - How can I get all the HTML in a document or node containing shadowRoot elements - Stack Overflow

programmeradmin7浏览0评论

I have not seen a satisfactory answer for this question. This basically a duplicate of this question, but it was improperly closed and the answers given are not sufficient.

I have e up with my own solution which I will post below.

This can be useful for web scraping, or in my case, running tests on a javascript library that handles custom elements. I make sure it is producing the output that I want, then I use this function to scrape the HTML for a given test output and use that copied HTML as the expected output to pare the test against in the future.

I have not seen a satisfactory answer for this question. This basically a duplicate of this question, but it was improperly closed and the answers given are not sufficient.

I have e up with my own solution which I will post below.

This can be useful for web scraping, or in my case, running tests on a javascript library that handles custom elements. I make sure it is producing the output that I want, then I use this function to scrape the HTML for a given test output and use that copied HTML as the expected output to pare the test against in the future.

Share Improve this question asked Nov 6, 2021 at 20:42 MossMoss 3,8137 gold badges44 silver badges61 bronze badges
Add a ment  | 

3 Answers 3

Reset to default 6

Here is a function that can do what is requested. Note that it ignores html ments and other fringe things. But it retrieves regular elements, text nodes, and custom elements with shadowRoots. It also handles slotted template content. It has not been tested exhaustively but seems to be working well for my needs.

Use it like extractHTML(document.body) or extractHTML(document.getElementByID('app')).

function extractHTML(node) {
            
    // return a blank string if not a valid node
    if (!node) return ''

    // if it is a text node just return the trimmed textContent
    if (node.nodeType===3) return node.textContent.trim()

    //beyond here, only deal with element nodes
    if (node.nodeType!==1) return ''

    let html = ''

    // clone the node for its outer html sans inner html
    let outer = node.cloneNode()

    // if the node has a shadowroot, jump into it
    node = node.shadowRoot || node
    
    if (node.children.length) {
        
        // we checked for children but now iterate over childNodes
        // which includes #text nodes (and even other things)
        for (let n of node.childNodes) {
            
            // if the node is a slot
            if (n.assignedNodes) {
                
                // an assigned slot
                if (n.assignedNodes()[0]){
                    // Can there be more than 1 assigned node??
                    html += extractHTML(n.assignedNodes()[0])

                // an unassigned slot
                } else { html += n.innerHTML }                    

            // node is not a slot, recurse
            } else { html += extractHTML(n) }
        }

    // node has no children
    } else { html = node.innerHTML }

    // insert all the (children's) innerHTML 
    // into the (cloned) parent element
    // and return the whole package
    outer.innerHTML = html
    return outer.outerHTML
    
}

Only if shadowRoots are created with the mode:"open" setting can you access shadowRoots from the outside.

You can then dive into elements and shadowRoots with something like:

 const shadowDive = (
          el, 
          selector, 
          match = (m, r) => console.warn('match', m, r)
  ) => {
    let root = el.shadowRoot || el;
    root.querySelector(selector) && match(root.querySelector(selector), root);
    [...root.children].map(el => shadowDive(el, selector, match));
  }

Note: extracting raw HTML is pointless if Web Component styling is based on shadowDOM behaviour; you will loose all correct styling.

For me the solution by Moss nearly worked. Just slots were not included in the output. I used an LLM to improve the answer and tested it and now slots were included just like I needed. For anyone else stumbling across this and needing slots, here is the code:

function extractHTML(node) {
        
// return a blank string if not a valid node
if (!node) return '';

// if it is a text node just return the trimmed textContent
if (node.nodeType === 3) return node.textContent.trim();

// beyond here, only deal with element nodes
if (node.nodeType !== 1) return '';

let html = '';

// clone the node for its outer html sans inner html
let outer = node.cloneNode();

// if the node has a shadowroot, jump into it
node = node.shadowRoot || node

if (node.children.length || node.childNodes.length) {
    
                                                                                    
    // iterate over childNodes which includes #text nodes and slots
    for (let n of node.childNodes) {
        
        // if the node is a slot
        if (n.nodeName === 'SLOT') {
            
            // check if slot has assigned nodes
            let assignedNodes = n.assignedNodes();
            if (assignedNodes.length > 0) {
                // if there are assigned nodes, recurse over them
                for (let assignedNode of assignedNodes) {
                                 
                                                                                    

                                      
                    html += extractHTML(assignedNode);
                }
            } else {
                // if no assigned nodes, preserve the <slot> element itself
                html += n.outerHTML;
            }

        // node is not a slot, recurse normally
        } else {
            html += extractHTML(n);
        }
    }
} else {
    // node has no children, insert its innerHTML
    html = node.innerHTML;
}

// insert all the (children's) innerHTML 
// into the (cloned) parent element
// and return the whole package
outer.innerHTML = html;
return outer.outerHTML;

}
发布评论

评论列表(0)

  1. 暂无评论