I want to strip tags from a html, but preserves it's line breaks.
I want the behaviour like copying the text in browser and pasting it in notepad.
For example, a code that converts:
<div>x1</div><div>x2</div>
tox1\nx2
<p>x1</p><p>x2</p>
tox1\nx2
<b>x1</b><i>x2</i>
tox1x2
x1<br>x2
tox1\nx2
Removing all tags not works (/<.*?>/g).
Also creating a dummy <div> and settings it's innertHTML
and read it's textContent
will remove line breaks.
Any Help?
I want to strip tags from a html, but preserves it's line breaks.
I want the behaviour like copying the text in browser and pasting it in notepad.
For example, a code that converts:
<div>x1</div><div>x2</div>
tox1\nx2
<p>x1</p><p>x2</p>
tox1\nx2
<b>x1</b><i>x2</i>
tox1x2
x1<br>x2
tox1\nx2
Removing all tags not works (/<.*?>/g).
Also creating a dummy <div> and settings it's innertHTML
and read it's textContent
will remove line breaks.
Any Help?
Share Improve this question edited Apr 14, 2012 at 14:35 Taha Jahangir asked Jul 27, 2011 at 16:02 Taha JahangirTaha Jahangir 4,9022 gold badges43 silver badges52 bronze badges4 Answers
Reset to default 3How's this work for you? This will replace every occurrence of <br>
, </div>
, and </p>
with a \n
, and then strip the remaining tags. Its goofy, but its at least a start.
fixed = text_to_fix.replace(/<(?:br|\/div|\/p)>/g, "\n")
.replace(/<.*?>/g, "");
This doesn't work for all HTML, however. Just the tags you mentioned.
Try:
function strip_tags(str){
return str
.replace(/(<(br[^>]*)>)/ig, '\n')
.replace(/(<([^>]+)>)/ig,'');
}
var str = '<div>x1</div><div>x2</div><br>'+'<p>x1</p><p>x2</p>'+'<b>x1</b><i>x2</i>';
This will strip the tags and replace <br />
or <br>
with new lines, but adding new lines for block elements requires quite some time to e up with a solution.
Here is a demo
This is as far as I got before I got bored...
const strip_tags = (html) => {
let tmp = document.createElement("div");
tmp.innerHTML = html
.replace(/(<(br[^>]*)>)/ig, '\n')
.replace(/(<(p[^>]*)>)/ig, '\n')
.replace(/(<(div[^>]*)>)/ig, '\n')
.replace(/(<(h[1-6][^>]*)>)/ig, '\n')
.replace(/(<(li[^>]*)>)/ig, '\n')
.replace(/(<(ul[^>]*)>)/ig, '\n')
.replace(/(<(ol[^>]*)>)/ig, '\n')
.replace(/(<(blockquote[^>]*)>)/ig, '\n')
.replace(/(<(pre[^>]*)>)/ig, '\n')
.replace(/(<(hr[^>]*)>)/ig, '\n')
.replace(/(<(table[^>]*)>)/ig, '\n')
.replace(/(<(tr[^>]*)>)/ig, '\n')
.replace(/(<(td[^>]*)>)/ig, '\n')
.replace(/(<(th[^>]*)>)/ig, '\n')
.replace(/(<(caption[^>]*)>)/ig, '\n')
.replace(/(<(dl[^>]*)>)/ig, '\n')
.replace(/(<(dt[^>]*)>)/ig, '\n')
.replace(/(<(dd[^>]*)>)/ig, '\n')
.replace(/(<(address[^>]*)>)/ig, '\n')
.replace(/(<(section[^>]*)>)/ig, '\n')
.replace(/(<(article[^>]*)>)/ig, '\n')
.replace(/(<(aside[^>]*)>)/ig, '\n');
return tmp.textContent || tmp.innerText || "";
}
You can use this
function stripTags(html) {
return html.replace(/<[^>]+>/g, '').replace(/<\/[^>]+>/g, '\n').replace(/<br>/g, '\n');
}
Now the function will replace all opening and closing tags with nothing, and <br>
tags with line breaks. This should give you the desired output.