I'm trying to prevent XSS request with HtmlSanitizer in my .NET wb app project, e.g:
var sanitiser = new HtmlSanitizer();
var result = sanitiser.Sanitize(rawText);
When the body or query has specific word inside of it Sanitizer get error. Like this:
sanitiser.Sanitize("£s=10")
Output:£s=10
sanitiser.Sanitize("¢erId=2")
Output:¢erId=2
I'm assuming that sanitiser.Sanitize("¢erId=2")
should output ¢erId=2
The output is wrong for ¢er
or ¢
How can I resolve this??
Example:
I'm trying to prevent XSS request with HtmlSanitizer in my .NET wb app project, e.g:
var sanitiser = new HtmlSanitizer();
var result = sanitiser.Sanitize(rawText);
When the body or query has specific word inside of it Sanitizer get error. Like this:
sanitiser.Sanitize("£s=10")
Output:£s=10
sanitiser.Sanitize("¢erId=2")
Output:¢erId=2
I'm assuming that sanitiser.Sanitize("¢erId=2")
should output ¢erId=2
The output is wrong for ¢er
or ¢
How can I resolve this??
Example:
Share Improve this question edited Feb 8 at 8:49 DarkBee 15.6k8 gold badges70 silver badges115 bronze badges asked Feb 8 at 8:31 rahmanrahman 1319 bronze badges 6 | Show 1 more comment1 Answer
Reset to default 2It's because it's being parsed as HTML and it gets validly consumed as a legacy HTML entity. I believe that this is intended default behavior from HtmlSanitizer's parser, AngleSharp. Anything you pass in to sanitiser.Sanitize
will get parsed as HTML and have its HTML entities consumed.
Before HtmlSanitizer does any sanitization, it parses it as an HTML document using AngleSharp:
/// <summary>
/// Sanitizes the specified HTML body fragment. If a document is given, only the body part will be returned.
/// </summary>
/// <param name="html">The HTML body fragment to sanitize.</param>
/// <param name="baseUrl">The base URL relative URLs are resolved against. No resolution if empty.</param>
/// <returns>The sanitized HTML document.</returns>
public IHtmlDocument SanitizeDom(string html, string baseUrl = "")
{
var parser = HtmlParserFactory();
var dom = parser.ParseDocument("<!doctype html><html><body>" + html);
if (dom.Body != null)
DoSanitize(dom, dom.Body, baseUrl);
return dom;
}
When var dom = parser.ParseDocument(...)
is called, this is the point when your string gets transformed from ¢er
to ¢er
. If you step through the code in a debugger and execute dom.Body.ChildNodes.ToHtml()
, you can see that the string is already transformed before the call to DoSanitize
happens.
We can see this is also true if we make an HTML snippet with only ¢er
or £s
as the HTML content - this is just how HTML gets parsed:
£s
¢er
According to this answer on Why do HTML entity names with dec < 255 not require semicolon?, it's valid markup to specify these HTML entities (cents, pounds) without a semicolon because their hexadecimal value is less than 256.
Apparently, AngleSharp offers an option IsNotConsumingCharacterReferences
that could help us here (discussed here), but that option isn't exposed to us through HtmlSanitizer's API. If it were, you would instead get the output &pounds=10
or &centerId=2
, which is still not your expected output. I don't think HtmlSanitizer will work for you in the way you expect it to here.
¢
in¢er
as it's missing;
to be html entity. But the strange thing is why you sanitize such text? Is that your untrusted input? Also you can't always expect the same output as input from the sanitizer. That's the whole point of sanitizing html in the first place. – Paweł Łukasik Commented Feb 8 at 10:06¢erId=2
in this form written anywhere on the page so that you need to HTML sanitize it? – Paweł Łukasik Commented 2 days agoHtmlSanitizer
) – Paweł Łukasik Commented 2 days ago