I have a terribly formed html, Thanks to MS Word 10 "save as htm, html". Here's a sample of what I'm trying to sanitize.
<html xmlns:v="urn:schemas-microsoft-com:vml"... other xmlns>
<head>
<meta tags, title, styles, a couple comments too (they are irrelevant to the question)>
</head>
<body lang=EN-US link=blue vlink=purple style='tab-interval:36.0pt'>
<div class=WordSection1>
<h1>Pros and Cons of a Website</h1>
<p class=MsoBodyText align=left style='a long irrelevant list'><span style='long list'><o:p> </o:p></span></p>(this is a sample of what it uses as line breaks. Take note of the <o:p> tag).
<p class=MsoBodyText style='margin-right:5.75pt;line-height:115%'>
A<span style='letter-spacing:.05pt'> </span>SAMPLE<span style='letter-spacing:.05pt'> </span>TEXT
</p>
</div>
<div class=WordSection2>...same pattern in div 1</div>
<div class=WordSection3>...same...</div>
</body>
</html>
What I need from all of this is:
<div>...A SAMPLE TEXT</div>
<div>...same pattern in div 1</div>
<div>...same...</div>
What I have so far:
$dom = new DOMDocument;
$dom->loadHTML($filecontent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$body = $xpath->query('//html/body');
$nodes = $body->item(0)->getElementsByTagName('*');
foreach ($nodes as $node) {
if($node->tagName=='script') $node->parentNode->removeChild($node);
if($node->tagName=='a') continue;
$attrs = $xpath->query('@*', $node);
foreach($attrs as $attr) {
$attr->parentNode->removeAttribute($attr->nodeName);
}
}
echo str_ireplace(['<span>', '</span>'], '', $dom->saveHTML($body->item(0)));
It gives me:
<body lang="EN-US" link="blue" vlink="purple" style="tab-interval:36.0pt">
<div>
<h1>Pros and Cons of a Website</h1>
<p><p> </p></p>
<p>A SAMPLE TEXT</p>
</div>
<div>...same pattern in div 1</div>
<div>...same...</div>
</body>
which I'm good with, but I want the body tag out. I also want h1 and it's content out too, but when I say:
if($node->tagName=='script' || $node->tagName=='h1') $node->parentNode->removeChild($node);
something weird happens:
<p><p> </p></p> becomes <p class="MsoBodyText" ...all those very long stuff I was trying to remove in the first place><p> </p></p>
I've come across some very good answers like:
- How to get innerHTML of DOMNode? (Haim Evgi's answer, I don't know how to properly implement it, Keyacom's answer too), Marco Marsala's answer is the closest I got but the divs all kept their classes.
I have a terribly formed html, Thanks to MS Word 10 "save as htm, html". Here's a sample of what I'm trying to sanitize.
<html xmlns:v="urn:schemas-microsoft-com:vml"... other xmlns>
<head>
<meta tags, title, styles, a couple comments too (they are irrelevant to the question)>
</head>
<body lang=EN-US link=blue vlink=purple style='tab-interval:36.0pt'>
<div class=WordSection1>
<h1>Pros and Cons of a Website</h1>
<p class=MsoBodyText align=left style='a long irrelevant list'><span style='long list'><o:p> </o:p></span></p>(this is a sample of what it uses as line breaks. Take note of the <o:p> tag).
<p class=MsoBodyText style='margin-right:5.75pt;line-height:115%'>
A<span style='letter-spacing:.05pt'> </span>SAMPLE<span style='letter-spacing:.05pt'> </span>TEXT
</p>
</div>
<div class=WordSection2>...same pattern in div 1</div>
<div class=WordSection3>...same...</div>
</body>
</html>
What I need from all of this is:
<div>...A SAMPLE TEXT</div>
<div>...same pattern in div 1</div>
<div>...same...</div>
What I have so far:
$dom = new DOMDocument;
$dom->loadHTML($filecontent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$body = $xpath->query('//html/body');
$nodes = $body->item(0)->getElementsByTagName('*');
foreach ($nodes as $node) {
if($node->tagName=='script') $node->parentNode->removeChild($node);
if($node->tagName=='a') continue;
$attrs = $xpath->query('@*', $node);
foreach($attrs as $attr) {
$attr->parentNode->removeAttribute($attr->nodeName);
}
}
echo str_ireplace(['<span>', '</span>'], '', $dom->saveHTML($body->item(0)));
It gives me:
<body lang="EN-US" link="blue" vlink="purple" style="tab-interval:36.0pt">
<div>
<h1>Pros and Cons of a Website</h1>
<p><p> </p></p>
<p>A SAMPLE TEXT</p>
</div>
<div>...same pattern in div 1</div>
<div>...same...</div>
</body>
which I'm good with, but I want the body tag out. I also want h1 and it's content out too, but when I say:
if($node->tagName=='script' || $node->tagName=='h1') $node->parentNode->removeChild($node);
something weird happens:
<p><p> </p></p> becomes <p class="MsoBodyText" ...all those very long stuff I was trying to remove in the first place><p> </p></p>
I've come across some very good answers like:
- How to get innerHTML of DOMNode? (Haim Evgi's answer, I don't know how to properly implement it, Keyacom's answer too), Marco Marsala's answer is the closest I got but the divs all kept their classes.
3 Answers
Reset to default 2The removal of h1
shifts the list of $nodes
, causing <p class="MsoBodyText">
to be skipped in the next iteration. To avoid this, replace foreach
with a for
loop and decrement the current index whenever a node is removed.
$dom = new DOMDocument;
@$dom->loadHTML($filecontent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$bodyNode = $xpath->query('//html/body')->item(0);
$nodes = $bodyNode->getElementsByTagName('*');
for ($i = 0; $i < $nodes->count(); $i++) {
$node = $nodes->item($i);
if ($node->tagName == 'script' || $node->tagName == 'h1') {
$node->parentNode->removeChild($node);
$i--;
}
if ($node->tagName == 'a') {
continue;
}
$attrs = $xpath->query('@*', $node);
foreach ($attrs as $attr) {
$attr->parentNode->removeAttribute($attr->nodeName);
}
}
echo str_ireplace(['<span>', '</span>'], '', $dom->saveHTML($bodyNode)) . PHP_EOL;
Then, the saveHTML()
function can be invoked for each child node, resulting in a combined output that omits the parent body
tag.
$inner = [];
foreach ($bodyNode->childNodes as $node) {
$inner []= trim($bodyNode->ownerDocument->saveHTML($node));
}
echo implode(PHP_EOL, array_filter($inner)) . PHP_EOL;
As an alternative, extract the text alone and recreate the wrapping tag.
$inner = [];
foreach ($bodyNode->childNodes as $node) {
$text = trim($node->textContent);
if ($node->nodeType != XML_ELEMENT_NODE) {
$inner []= $text;
continue;
}
$inner []= sprintf('<%s>%s</%s>',
$node->tagName, $text, $node->tagName);
}
echo implode(PHP_EOL, array_filter($inner)) . PHP_EOL;
I really appreciate that some people actually settled down to read through the very long question and provide solutions. I was able to develop a solution though, because I feared I might have asked a bad / stressful question and maybe nobody would have the time to answer, I came to post the answer and noticed some devs already answered the question, so I'm still going to post it anyways.
$dom = new DOMDocument;
libxml_use_internal_errors(true); //removes / ignores the invalid html tag error
$dom->loadHTML($filecontent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
libxml_use_internal_errors(false);
$body = $dom->getElementsByTagName("html");
$nodes = $body->item(0)->getElementsByTagName('*');
for ($i = $nodes->length; --$i >= 0; ) { //take note of the negative for loop
$node = $nodes->item($i);
if($node->tagName=='script' || $node->tagName=='h1') $node->parentNode->removeChild($node);
if($node->tagName=='a') continue;
if($node->tagName=='body') {
$innerHTML = '';
$chnodes = $node->childNodes;
foreach($chnodes as $chnode) {
$innerHTML .= $node->parentNode->ownerDocument->saveHTML($chnode);
}
}
$attributes = $node->attributes;
while ($attributes->length) {
$node->removeAttribute($attributes->item(0)->name);
}
}
echo str_ireplace(['<span>', '</span>'], '',$innerHTML);
If you've got a recent PHP version at hand (8+), you can create a fragment of all the body elements and using saveHTML() on it:
$element = $body->item(0); # the body element itself from xpath result
$fragment = $dom->createDocumentFragment();
$fragment->append(...$element->childNodes);
echo str_ireplace(['<span>', '</span>'], '', $dom->saveHTML($fragment));
it will move the child nodes into the fragment, so this would only be useful for the inner HTML problem and can only be applied once. Therefore it depends where you put it in.
It may show though, that it is often better to collect the elements in the fragment you want to export by appending them instead of removing from the original document the unwanted ones.
for ($i = $nodes->length - 1; $i >= 0; $i--) { ...
– You Old Fool Commented Jan 19 at 2:37