(clarification: this is an old question that has been tweaked for admin purposes)
There have been a fair amount of questions on this site about parsing HTML from textareas and whatnot, or not allowing HTML in Textboxes. My question is similar: How would I detect if HTML is present in the textbox? Would I need to run it through a regular expression of all known HTML tags? Is there a current library for .NET that has the ability to detect when HTML is inserted into a Textarea?
Edit: Similarly, is there a JavaScript Library that does this?
Edit #2: Due to the way that the web app works (It validates textarea text on asyncronous postback using the Validate method of ASP.NET), it bombs before it can get back to the code-behind to use HTML.Encode. My concern was trying to find another way of handling HTML in those instances.
(clarification: this is an old question that has been tweaked for admin purposes)
There have been a fair amount of questions on this site about parsing HTML from textareas and whatnot, or not allowing HTML in Textboxes. My question is similar: How would I detect if HTML is present in the textbox? Would I need to run it through a regular expression of all known HTML tags? Is there a current library for .NET that has the ability to detect when HTML is inserted into a Textarea?
Edit: Similarly, is there a JavaScript Library that does this?
Edit #2: Due to the way that the web app works (It validates textarea text on asyncronous postback using the Validate method of ASP.NET), it bombs before it can get back to the code-behind to use HTML.Encode. My concern was trying to find another way of handling HTML in those instances.
Share Improve this question edited Sep 28, 2010 at 6:06 Marc Gravell 1.1m273 gold badges2.6k silver badges3k bronze badges asked Sep 28, 2010 at 6:03 a.muppeta.muppet 06 Answers
Reset to default 4Not really an answer, but why you need it at all? You need to sanitize HTML input only if you are going to output it without modifications, i.e. if you want to allow your users actually to be able to use HTML. And if you want that, you do not have to "detect" HTML, you just need to make sure that you handle it safe. Jeff Atwood has a good routine for this.
If you want to prevent at all HTML output, you can take whatever the user inputs, without any checks. Just take care to HtmlEncode it, and store it that way. Then your output will not have actually any "real" HTML from what the user wrote.
Yes, a regular expression is probably the easiest way to do that.
One regex would be: <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
You can run that in both ASP.Net and javascript. The .Net framework class you use is System.Text.RegularExpressions.Regex
Hope that helps!
bool containsHtml = Regex.IsMatch(MyTextbox.Text, @"<(.|\n)*?>");
As far as I know you cannot paste HTML into a TextArea and have it work automatically at least in .Net 2.0. ASP.Net automatically santizes input. You need to set ValidateInput page directive to false (If I remember correctly).
If you want to allow HTML tags and want to pick from a possible list of tags, I suggest you lookup 'Markdown' and this Jeff Atwood Post.
+1 Sunny. “detecting” HTML is a fool's errand. You need to escape it on output, and as long as you're doing that you're safe. If you're not escaping it, sanitisation hacks aren't going to make you secure, they're just going to obfuscate the problem.
Due to the way that the web app works (It validates textarea text on asyncronous postback using the Validate method of ASP.NET)
Yeah, you'll want to stop doing that. ASP.NET's “request validation” is utterly bogus and needs to be turned off if you want to have any chance of processing uploaded content consistently.
Well, in HTML you can't do a lot without a less than symbol "<".
So, I would look for a less than symbol followed by e characters followed by a greater than symbol. If you find that, you can pretty much be assured that it is HTML.
I don't think you have to look for specific tags, as HTML will ignore invalid tags as part of the specification and it would still be considered HTML.
EDIT: Oops! Almost forgot... the ampersand character! If you see one in the text, you MIGHT have HTML since it is used to specify special characters (like ©
for ©) This can be dangerous because the user could specify <
for < so it might turn into HTML later...