php - How can I remove tag names but leave the inner html contents using DOMDocument

I have a terribly formed html, Thanks to MS Word 10 "save as htm, html". Here's a sample of what I'm trying to sanitize.

<html xmlns:v="urn:schemas-microsoft-com:vml"... other xmlns>
    <head>
        <meta tags, title, styles, a couple comments too (they are irrelevant to the question)>
    </head>
    <body lang=EN-US link=blue vlink=purple style='tab-interval:36.0pt'>
        <div class=WordSection1>
            <h1>Pros and Cons of a Website</h1>
            <p class=MsoBodyText align=left style='a long irrelevant list'><span style='long list'><o:p>&nbsp;</o:p></span></p>(this is a sample of what it uses as line breaks. Take note of the <o:p> tag).
            <p class=MsoBodyText style='margin-right:5.75pt;line-height:115%'>
                A<span style='letter-spacing:.05pt'> </span>SAMPLE<span style='letter-spacing:.05pt'> </span>TEXT
            </p>
        </div>
        <div class=WordSection2>...same pattern in div 1</div>
        <div class=WordSection3>...same...</div>
   </body>
</html>

What I need from all of this is:

<div>...A SAMPLE TEXT</div>
<div>...same pattern in div 1</div>
<div>...same...</div>

What I have so far:

$dom = new DOMDocument;
$dom->loadHTML($filecontent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$body = $xpath->query('//html/body');
$nodes = $body->item(0)->getElementsByTagName('*');
foreach ($nodes as $node) {
    if($node->tagName=='script') $node->parentNode->removeChild($node);
    if($node->tagName=='a') continue;
    $attrs = $xpath->query('@*', $node);
    foreach($attrs as $attr) {
        $attr->parentNode->removeAttribute($attr->nodeName);
    }
}
echo str_ireplace(['<span>', '</span>'], '', $dom->saveHTML($body->item(0)));

It gives me:

<body lang="EN-US" link="blue" vlink="purple" style="tab-interval:36.0pt">
    <div>
        <h1>Pros and Cons of a Website</h1>
        <p><p> </p></p>
        <p>A SAMPLE TEXT</p>
    </div>
    <div>...same pattern in div 1</div>
    <div>...same...</div>
</body>

which I'm good with, but I want the body tag out. I also want h1 and it's content out too, but when I say:

if($node->tagName=='script' || $node->tagName=='h1') $node->parentNode->removeChild($node);

something weird happens:

<p><p> </p></p> becomes <p class="MsoBodyText" ...all those very long stuff I was trying to remove in the first place><p> </p></p>

I've come across some very good answers like:

How to get innerHTML of DOMNode? (Haim Evgi's answer, I don't know how to properly implement it, Keyacom's answer too), Marco Marsala's answer is the closest I got but the divs all kept their classes.

I have a terribly formed html, Thanks to MS Word 10 "save as htm, html". Here's a sample of what I'm trying to sanitize.

<html xmlns:v="urn:schemas-microsoft-com:vml"... other xmlns>
    <head>
        <meta tags, title, styles, a couple comments too (they are irrelevant to the question)>
    </head>
    <body lang=EN-US link=blue vlink=purple style='tab-interval:36.0pt'>
        <div class=WordSection1>
            <h1>Pros and Cons of a Website</h1>
            <p class=MsoBodyText align=left style='a long irrelevant list'><span style='long list'><o:p>&nbsp;</o:p></span></p>(this is a sample of what it uses as line breaks. Take note of the <o:p> tag).
            <p class=MsoBodyText style='margin-right:5.75pt;line-height:115%'>
                A<span style='letter-spacing:.05pt'> </span>SAMPLE<span style='letter-spacing:.05pt'> </span>TEXT
            </p>
        </div>
        <div class=WordSection2>...same pattern in div 1</div>
        <div class=WordSection3>...same...</div>
   </body>
</html>

What I need from all of this is:

<div>...A SAMPLE TEXT</div>
<div>...same pattern in div 1</div>
<div>...same...</div>

What I have so far:

$dom = new DOMDocument;
$dom->loadHTML($filecontent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$body = $xpath->query('//html/body');
$nodes = $body->item(0)->getElementsByTagName('*');
foreach ($nodes as $node) {
    if($node->tagName=='script') $node->parentNode->removeChild($node);
    if($node->tagName=='a') continue;
    $attrs = $xpath->query('@*', $node);
    foreach($attrs as $attr) {
        $attr->parentNode->removeAttribute($attr->nodeName);
    }
}
echo str_ireplace(['<span>', '</span>'], '', $dom->saveHTML($body->item(0)));

It gives me:

<body lang="EN-US" link="blue" vlink="purple" style="tab-interval:36.0pt">
    <div>
        <h1>Pros and Cons of a Website</h1>
        <p><p> </p></p>
        <p>A SAMPLE TEXT</p>
    </div>
    <div>...same pattern in div 1</div>
    <div>...same...</div>
</body>

which I'm good with, but I want the body tag out. I also want h1 and it's content out too, but when I say:

if($node->tagName=='script' || $node->tagName=='h1') $node->parentNode->removeChild($node);

something weird happens:

<p><p> </p></p> becomes <p class="MsoBodyText" ...all those very long stuff I was trying to remove in the first place><p> </p></p>

I've come across some very good answers like:

How to get innerHTML of DOMNode? (Haim Evgi's answer, I don't know how to properly implement it, Keyacom's answer too), Marco Marsala's answer is the closest I got but the divs all kept their classes.

Share Improve this question edited Jan 18 at 9:04 asked Jan 18 at 8:57 Chimdi 3793 silver badges9 bronze badges

1 You can extract the text from the div element by accessing the textContent and replace all multiple whitespaces with just one to clean it up. Demo: 3v4l./VpIEq – Markus Zeller Commented Jan 18 at 14:09
1 to avoid funkiness when removing or altering nodes, iterate backwards over the node list. for ($i = $nodes->length - 1; $i >= 0; $i--) { ... – You Old Fool Commented Jan 19 at 2:37

Add a comment |

3 Answers 3

Sorted by: Reset to default 2

The removal of h1 shifts the list of $nodes, causing <p class="MsoBodyText"> to be skipped in the next iteration. To avoid this, replace foreach with a for loop and decrement the current index whenever a node is removed.

$dom = new DOMDocument;
@$dom->loadHTML($filecontent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);

$bodyNode = $xpath->query('//html/body')->item(0);
$nodes = $bodyNode->getElementsByTagName('*');

for ($i = 0; $i < $nodes->count(); $i++) {
    $node = $nodes->item($i);
    if ($node->tagName == 'script' || $node->tagName == 'h1') {
        $node->parentNode->removeChild($node);
        $i--;
    }
    if ($node->tagName == 'a') {
        continue;
    }
    $attrs = $xpath->query('@*', $node);
    foreach ($attrs as $attr) {
        $attr->parentNode->removeAttribute($attr->nodeName);
    }
}
echo str_ireplace(['<span>', '</span>'], '', $dom->saveHTML($bodyNode)) . PHP_EOL;

Then, the saveHTML() function can be invoked for each child node, resulting in a combined output that omits the parent body tag.

$inner = [];
foreach ($bodyNode->childNodes as $node) {
    $inner []= trim($bodyNode->ownerDocument->saveHTML($node));
}
echo implode(PHP_EOL, array_filter($inner)) . PHP_EOL;

As an alternative, extract the text alone and recreate the wrapping tag.

$inner = [];
foreach ($bodyNode->childNodes as $node) {
    $text = trim($node->textContent);
    if ($node->nodeType != XML_ELEMENT_NODE) {
        $inner []= $text;
        continue;
    }
    $inner []= sprintf('<%s>%s</%s>',
        $node->tagName, $text, $node->tagName);
}
echo implode(PHP_EOL, array_filter($inner)) . PHP_EOL;

I really appreciate that some people actually settled down to read through the very long question and provide solutions. I was able to develop a solution though, because I feared I might have asked a bad / stressful question and maybe nobody would have the time to answer, I came to post the answer and noticed some devs already answered the question, so I'm still going to post it anyways.

$dom = new DOMDocument;
libxml_use_internal_errors(true); //removes / ignores the invalid html tag error
$dom->loadHTML($filecontent, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
libxml_use_internal_errors(false);
$body = $dom->getElementsByTagName("html");
$nodes = $body->item(0)->getElementsByTagName('*');
for ($i = $nodes->length; --$i >= 0; ) { //take note of the negative for loop
    $node = $nodes->item($i);
    if($node->tagName=='script' || $node->tagName=='h1') $node->parentNode->removeChild($node);
    if($node->tagName=='a') continue;
    if($node->tagName=='body') {
        $innerHTML = '';
        $chnodes = $node->childNodes;
        foreach($chnodes as $chnode) {
            $innerHTML .= $node->parentNode->ownerDocument->saveHTML($chnode);
        }
    }
    $attributes = $node->attributes;
    while ($attributes->length) {
        $node->removeAttribute($attributes->item(0)->name);
    }
}
echo str_ireplace(['<span>', '</span>'], '',$innerHTML);

If you've got a recent PHP version at hand (8+), you can create a fragment of all the body elements and using saveHTML() on it:

$element = $body->item(0); # the body element itself from xpath result

$fragment = $dom->createDocumentFragment(); 
$fragment->append(...$element->childNodes);

echo str_ireplace(['<span>', '</span>'], '', $dom->saveHTML($fragment));

it will move the child nodes into the fragment, so this would only be useful for the inner HTML problem and can only be applied once. Therefore it depends where you put it in.

It may show though, that it is often better to collect the elements in the fragment you want to export by appending them instead of removing from the original document the unwanted ones.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

php - How can I remove tag names but leave the inner html contents using DOMDocument - Stack Overflow

3 Answers 3

与本文相关的文章

评论列表(0)