I want to remove html tags from given string using javascript. I looked into current approaches but there are some unsolved problems occured with them.
Current solutions
(1) Using javascript, creating virtual div tag and get the text
function remove_tags(html)
{
var tmp = document.createElement("DIV");
tmp.innerHTML = html;
return tmp.textContent||tmp.innerText;
}
(2) Using regex
function remove_tags(html)
{
return html.replace(/<(?:.|\n)*?>/gm, '');
}
(3) Using JQuery
function remove_tags(html)
{
return jQuery(html).text();
}
These three solutions are working correctly, but if the string is like this
<div> hello <hi all !> </div>
stripped string is like
hello
. But I need only remove html tags only. like hello <hi all !>
Edited: Background is, I want to remove all the user input html tags for a particular text area. But I want to allow users to enter <hi all>
kind of text. In current approach, its remove any content which include within <>.
I want to remove html tags from given string using javascript. I looked into current approaches but there are some unsolved problems occured with them.
Current solutions
(1) Using javascript, creating virtual div tag and get the text
function remove_tags(html)
{
var tmp = document.createElement("DIV");
tmp.innerHTML = html;
return tmp.textContent||tmp.innerText;
}
(2) Using regex
function remove_tags(html)
{
return html.replace(/<(?:.|\n)*?>/gm, '');
}
(3) Using JQuery
function remove_tags(html)
{
return jQuery(html).text();
}
These three solutions are working correctly, but if the string is like this
<div> hello <hi all !> </div>
stripped string is like
hello
. But I need only remove html tags only. like hello <hi all !>
Edited: Background is, I want to remove all the user input html tags for a particular text area. But I want to allow users to enter <hi all>
kind of text. In current approach, its remove any content which include within <>.
- 4 If you want special parsing rules for invalid HTML, you will need to write a parser. Note that the last jQuery version is no different to the first, and a regular expression will not do the job for anything other than trivial input. – RobG Commented Jun 18, 2013 at 8:58
- 2 Additionally to RobG's comment: Maybe it would help if you'd explain the background, so that we can suggest better solutions. Why are you using JavaScript for this? Where is the HTML coming from that is invalid? – RoToRa Commented Jun 18, 2013 at 9:07
- @RobG: I disagree, in this particular case. I think I have a fairly robust solution below, I'd appreciate your input. – Andy E Commented Jun 18, 2013 at 10:39
- @chacka Regarding your edit: You shouldn't use JavaScript for this. JavaScript is easily circumvented and removing dangerous HTML is important. Do it server-side for example using a markup library just as Stackoverflow does on this site. They will remove and/or escape any problematic HTML. – RoToRa Commented Jun 18, 2013 at 11:01
- @RoToRa: Stack Overflow also has a live preview that is rendered using JavaScript. I agree, though, and common sense says to sanitize at the server before storing in the database or outputting to the page. – Andy E Commented Jun 18, 2013 at 11:04
6 Answers
Reset to default 7Using a regex might not be a problem if you consider a different approach. For instance, looking for all tags, and then checking to see if the tag name matches a list of defined, valid HTML tag names:
var protos = document.body.constructor === window.HTMLBodyElement;
validHTMLTags =/^(?:a|abbr|acronym|address|applet|area|article|aside|audio|b|base|basefont|bdi|bdo|bgsound|big|blink|blockquote|body|br|button|canvas|caption|center|cite|code|col|colgroup|data|datalist|dd|del|details|dfn|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frame|frameset|h1|h2|h3|h4|h5|h6|head|header|hgroup|hr|html|i|iframe|img|input|ins|isindex|kbd|keygen|label|legend|li|link|listing|main|map|mark|marquee|menu|menuitem|meta|meter|nav|nobr|noframes|noscript|object|ol|optgroup|option|output|p|param|plaintext|pre|progress|q|rp|rt|ruby|s|samp|script|section|select|small|source|spacer|span|strike|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|tt|u|ul|var|video|wbr|xmp)$/i;
function sanitize(txt) {
var // This regex normalises anything between quotes
normaliseQuotes = /=(["'])(?=[^\1]*[<>])[^\1]*\1/g,
normaliseFn = function ($0, q, sym) {
return $0.replace(/</g, '<').replace(/>/g, '>');
},
replaceInvalid = function ($0, tag, off, txt) {
var
// Is it a valid tag?
invalidTag = protos &&
document.createElement(tag) instanceof HTMLUnknownElement
|| !validHTMLTags.test(tag),
// Is the tag complete?
isComplete = txt.slice(off+1).search(/^[^<]+>/) > -1;
return invalidTag || !isComplete ? '<' + tag : $0;
};
txt = txt.replace(normaliseQuotes, normaliseFn)
.replace(/<(\w+)/g, replaceInvalid);
var tmp = document.createElement("DIV");
tmp.innerHTML = txt;
return "textContent" in tmp ? tmp.textContent : tmp.innerHTML;
}
Working Demo: http://jsfiddle.net/m9vZg/3/
This works because browsers parse '>' as text if it isn't part of a matching '<' opening tag. It doesn't suffer the same problems as trying to parse HTML tags using a regular expression, because you're only looking for the opening delimiter and the tag name, everything else is irrelevant.
It's also future proof: the WebIDL specification tells vendors how to implement prototypes for HTML elements, so we try and create a HTML element from the current matching tag. If the element is an instance of HTMLUnknownElement
, we know that it's not a valid HTML tag. The validHTMLTags
regular expression defines a list of HTML tags for older browsers, such as IE 6 and 7, that do not implement these prototypes.
If you want to keep invalid markup untouched, regular expressions is your best bet. Something like this might work:
text = html.replace(/<\/?(span|div|img|p...)\b[^<>]*>/g, "")
Expand (span|div|img|p...)
into a list of all tags (or only those you want to remove). NB: the list must be sorted by length, longer tags first!
This may provide incorrect results in some edge cases (like attributes with <>
characters), but the only real alternative would be to program a complete html parser by yourself. Not that it would be extremely complicated, but might be an overkill here. Let us know.
var StrippedString = OriginalString.replace(/(<([^>]+)>)/ig,"");
Here is my solution ,
function removeTags(){
var txt = document.getElementById('myString').value;
var rex = /(<([^>]+)>)/ig;
alert(txt.replace(rex , ""));
}
I use regular expression for preventing HTML tags in my textarea
Example
<form>
<textarea class="box"></textarea>
<button>Submit</button>
</form>
<script>
$(".box").focusout( function(e) {
var reg =/<(.|\n)*?>/g;
if (reg.test($('.box').val()) == true) {
alert('HTML Tag are not allowed');
}
e.preventDefault();
});
</script>
<script type="text/javascript">
function removeHTMLTags() {
var str="<html><p>I want to remove HTML tags</p></html>";
alert(str.replace(/<[^>]+>/g, ''));
}</script>