sanitization - Strip tags with javascript and handle line breaks

I want to strip tags from a html, but preserves it's line breaks.

I want the behaviour like copying the text in browser and pasting it in notepad.

For example, a code that converts:

<div>x1</div><div>x2</div> to x1\nx2
<p>x1</p><p>x2</p> to x1\nx2
<b>x1</b><i>x2</i> to x1x2
x1<br>x2 to x1\nx2

Removing all tags not works (/<.*?>/g). Also creating a dummy <div> and settings it's innertHTML and read it's textContent will remove line breaks.

Any Help?

I want to strip tags from a html, but preserves it's line breaks.

I want the behaviour like copying the text in browser and pasting it in notepad.

For example, a code that converts:

<div>x1</div><div>x2</div> to x1\nx2
<p>x1</p><p>x2</p> to x1\nx2
<b>x1</b><i>x2</i> to x1x2
x1<br>x2 to x1\nx2

Removing all tags not works (/<.*?>/g). Also creating a dummy <div> and settings it's innertHTML and read it's textContent will remove line breaks.

Any Help?

Share Improve this question edited Apr 14, 2012 at 14:35 asked Jul 27, 2011 at 16:02 Taha Jahangir 4,9022 gold badges43 silver badges52 bronze badges

Add a ment |

4 Answers 4

Sorted by: Reset to default 3

How's this work for you? This will replace every occurrence of <br>, </div>, and </p> with a \n, and then strip the remaining tags. Its goofy, but its at least a start.

fixed = text_to_fix.replace(/<(?:br|\/div|\/p)>/g, "\n")
           .replace(/<.*?>/g, "");

This doesn't work for all HTML, however. Just the tags you mentioned.

Try:

function strip_tags(str){
    return str
             .replace(/(<(br[^>]*)>)/ig, '\n')
             .replace(/(<([^>]+)>)/ig,'');
}

var str = '<div>x1</div><div>x2</div><br>'+'<p>x1</p><p>x2</p>'+'<b>x1</b><i>x2</i>';

This will strip the tags and replace <br /> or <br> with new lines, but adding new lines for block elements requires quite some time to e up with a solution.

Here is a demo

This is as far as I got before I got bored...

const strip_tags = (html) => {
    let tmp = document.createElement("div");
    tmp.innerHTML = html
        .replace(/(<(br[^>]*)>)/ig, '\n')
        .replace(/(<(p[^>]*)>)/ig, '\n')
        .replace(/(<(div[^>]*)>)/ig, '\n')
        .replace(/(<(h[1-6][^>]*)>)/ig, '\n')
        .replace(/(<(li[^>]*)>)/ig, '\n')
        .replace(/(<(ul[^>]*)>)/ig, '\n')
        .replace(/(<(ol[^>]*)>)/ig, '\n')
        .replace(/(<(blockquote[^>]*)>)/ig, '\n')
        .replace(/(<(pre[^>]*)>)/ig, '\n')
        .replace(/(<(hr[^>]*)>)/ig, '\n')
        .replace(/(<(table[^>]*)>)/ig, '\n')
        .replace(/(<(tr[^>]*)>)/ig, '\n')
        .replace(/(<(td[^>]*)>)/ig, '\n')
        .replace(/(<(th[^>]*)>)/ig, '\n')
        .replace(/(<(caption[^>]*)>)/ig, '\n')
        .replace(/(<(dl[^>]*)>)/ig, '\n')
        .replace(/(<(dt[^>]*)>)/ig, '\n')
        .replace(/(<(dd[^>]*)>)/ig, '\n')
        .replace(/(<(address[^>]*)>)/ig, '\n')
        .replace(/(<(section[^>]*)>)/ig, '\n')
        .replace(/(<(article[^>]*)>)/ig, '\n')
        .replace(/(<(aside[^>]*)>)/ig, '\n');
    return tmp.textContent || tmp.innerText || "";
}

You can use this

function stripTags(html) {
     return html.replace(/<[^>]+>/g, '').replace(/<\/[^>]+>/g, '\n').replace(/<br>/g, '\n');
}

Now the function will replace all opening and closing tags with nothing, and <br> tags with line breaks. This should give you the desired output.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

sanitization - Strip tags with javascript and handle line breaks - Stack Overflow

4 Answers 4

与本文相关的文章

评论列表(0)