My intent was to give an advise on the question Delete everything between two strings (inclusive) to use the HTMLDocument parser instead of a text based replace
command.
But somehow the OuterHTML
property of the <aside>
element doesn't include the concerned element up and till the </aside>
end tag:
html
$Html = @'
<html>
<head>
<title>Title</title>
</head>
<body>
<h1>Some header elements</h1>
<aside>
<p>huge text in between aside</p>
</aside>
<div>
<p>huge text in between div</p>
</div>
<p>Some other elements</p>
</body>
</html>
'@
Parsing
function ParseHtml($String) {
$Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
$Html = New-Object -Com 'HTMLFile'
if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
$Html.IHTMLDocument2_Write($Unicode)
}
else {
$Html.write($Unicode)
}
$Html.Close()
$Html
}
$Document = ParseHtml $Html
<aside>
$Document.getElementsByTagName('aside') | ForEach-Object { $_.OuterHTML }
<ASIDE>
<div>
$Document.getElementsByTagName('div') | ForEach-Object { $_.OuterHTML }
<DIV><P>huge text in between div</P></DIV>
- What is so special to the
<aside>
element that explains the difference to other elements as e.g. a<div>
? - What is the proper way to include the whole
<aside>
element up and till the</aside>
end tag?
My intent was to give an advise on the question Delete everything between two strings (inclusive) to use the HTMLDocument parser instead of a text based replace
command.
But somehow the OuterHTML
property of the <aside>
element doesn't include the concerned element up and till the </aside>
end tag:
html
$Html = @'
<html>
<head>
<title>Title</title>
</head>
<body>
<h1>Some header elements</h1>
<aside>
<p>huge text in between aside</p>
</aside>
<div>
<p>huge text in between div</p>
</div>
<p>Some other elements</p>
</body>
</html>
'@
Parsing
function ParseHtml($String) {
$Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
$Html = New-Object -Com 'HTMLFile'
if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
$Html.IHTMLDocument2_Write($Unicode)
}
else {
$Html.write($Unicode)
}
$Html.Close()
$Html
}
$Document = ParseHtml $Html
<aside>
$Document.getElementsByTagName('aside') | ForEach-Object { $_.OuterHTML }
<ASIDE>
<div>
$Document.getElementsByTagName('div') | ForEach-Object { $_.OuterHTML }
<DIV><P>huge text in between div</P></DIV>
- What is so special to the
<aside>
element that explains the difference to other elements as e.g. a<div>
? - What is the proper way to include the whole
<aside>
element up and till the</aside>
end tag?
1 Answer
Reset to default 2I believe the answer has been given in comments by C3roe and Mathias already, the parser isn't able to correctly interpret elements introduced in HTML5, but as a workaround, you can use a more modern parser, for example the one used in ConvertFrom-Html
(default engine is AgilityPack).
$parsed = $html | ConvertFrom-Html
$parsed.SelectSingleNode('//aside').Remove()
$parsed.OuterHtml
# <html>
# <head>
# <title>Title</title>
# </head>
# <body>
# <h1>Some header elements</h1>
#
# <div>
# <p>huge text in between div</p>
# </div>
# <p>Some other elements</p>
# </body>
# </html>
For a simple Html like the one in question you could get away with using XmlDocument
to parse it and then, after selecting the node, target its parent node and then RemoveChild()
.
$xml = [xml]::new()
$xml.PreserveWhitespace = $true
$xml.LoadXml($html)
$node = $xml.SelectSingleNode('//aside')
$null = $node.ParentNode.RemoveChild($node)
$xml.OuterXml
<!DOCTYPE html>
to the very start of your HTML code? – C3roe Commented Mar 3 at 10:03<!DOCTYPE html>
or even<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3./TR/html4/loose.dtd">
doesn't change the results. – iRon Commented Mar 3 at 10:10section
,header
,nav
, etc.) as well as any non-existing tag (eg.<iRon> </iRon>
). Safe to surmise it just doesn't speak anything newer than HTML 4.01 – Mathias R. Jessen Commented Mar 3 at 10:28