html

My intent was to give an advise on the question Delete everything between two strings (inclusive) to use the HTMLDocument parser instead of a text based replace command.
But somehow the OuterHTML property of the <aside> element doesn't include the concerned element up and till the </aside> end tag:

$Html = @'
<html>
    <head>
        <title>Title</title>
    </head>
    <body>
        <h1>Some header elements</h1>
        <aside>
            <p>huge text in between aside</p>
        </aside>
        <div>
            <p>huge text in between div</p>
        </div>
        <p>Some other elements</p>
    </body>
</html>
'@

Parsing

function ParseHtml($String) {
    $Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
    $Html = New-Object -Com 'HTMLFile'
    if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
        $Html.IHTMLDocument2_Write($Unicode)
    } 
    else {
        $Html.write($Unicode)
    }
    $Html.Close()
    $Html
}
$Document = ParseHtml $Html

`<aside>`

$Document.getElementsByTagName('aside') | ForEach-Object { $_.OuterHTML }
<ASIDE>

`<div>`

$Document.getElementsByTagName('div') | ForEach-Object { $_.OuterHTML }

<DIV><P>huge text in between div</P></DIV>

What is so special to the <aside> element that explains the difference to other elements as e.g. a <div>?
What is the proper way to include the whole <aside> element up and till the </aside> end tag?

html

$Html = @'
<html>
    <head>
        <title>Title</title>
    </head>
    <body>
        <h1>Some header elements</h1>
        <aside>
            <p>huge text in between aside</p>
        </aside>
        <div>
            <p>huge text in between div</p>
        </div>
        <p>Some other elements</p>
    </body>
</html>
'@

Parsing

function ParseHtml($String) {
    $Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
    $Html = New-Object -Com 'HTMLFile'
    if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
        $Html.IHTMLDocument2_Write($Unicode)
    } 
    else {
        $Html.write($Unicode)
    }
    $Html.Close()
    $Html
}
$Document = ParseHtml $Html

`<aside>`

$Document.getElementsByTagName('aside') | ForEach-Object { $_.OuterHTML }
<ASIDE>

`<div>`

$Document.getElementsByTagName('div') | ForEach-Object { $_.OuterHTML }

<DIV><P>huge text in between div</P></DIV>

What is so special to the <aside> element that explains the difference to other elements as e.g. a <div>?
What is the proper way to include the whole <aside> element up and till the </aside> end tag?

Share Improve this question edited Mar 4 at 18:18 artkoshelev 8947 silver badges24 bronze badges asked Mar 3 at 9:46 iRon 24k10 gold badges58 silver badges99 bronze badges

3 Probably an issue with an outdated parser that doesn't know about any of the "new" elements that HTML5 introduced, I suppose? Does anything change if you add <!DOCTYPE html> to the very start of your HTML code? – C3roe Commented Mar 3 at 10:03
@C3roe, adding <!DOCTYPE html> or even <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3./TR/html4/loose.dtd"> doesn't change the results. – iRon Commented Mar 3 at 10:10
3 I suspect @C3roe is spot on here, the only thing special about it is that it isn't recognized as a standard HTML tag - you see the same behavior with literally any other that were introduced with or after HTML5 (section, header, nav, etc.) as well as any non-existing tag (eg. <iRon> </iRon>). Safe to surmise it just doesn't speak anything newer than HTML 4.01 – Mathias R. Jessen Commented Mar 3 at 10:28

Add a comment |

1 Answer 1

Sorted by: Reset to default 2

I believe the answer has been given in comments by C3roe and Mathias already, the parser isn't able to correctly interpret elements introduced in HTML5, but as a workaround, you can use a more modern parser, for example the one used in ConvertFrom-Html (default engine is AgilityPack).

$parsed = $html | ConvertFrom-Html
$parsed.SelectSingleNode('//aside').Remove()
$parsed.OuterHtml

# <html>
#     <head>
#         <title>Title</title>
#     </head>
#     <body>
#         <h1>Some header elements</h1>
# 
#         <div>
#             <p>huge text in between div</p>
#         </div>
#         <p>Some other elements</p>
#     </body>
# </html>

For a simple Html like the one in question you could get away with using XmlDocument to parse it and then, after selecting the node, target its parent node and then RemoveChild().

$xml = [xml]::new()
$xml.PreserveWhitespace = $true
$xml.LoadXml($html)
$node = $xml.SelectSingleNode('//aside')
$null = $node.ParentNode.RemoveChild($node)
$xml.OuterXml

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

html - Why isn't the end tag included in an ASIDE.OuterHTML - Stack Overflow

html

Parsing

`<aside>`

`<div>`

html

Parsing

`<aside>`

`<div>`

1 Answer 1

与本文相关的文章

评论列表(0)