最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

c# - Receiving wrong output when using, e.g. &center - Stack Overflow

programmeradmin1浏览0评论

I'm trying to prevent XSS request with HtmlSanitizer in my .NET wb app project, e.g:

var sanitiser = new HtmlSanitizer();
var result = sanitiser.Sanitize(rawText);

When the body or query has specific word inside of it Sanitizer get error. Like this:

  • sanitiser.Sanitize("&pounds=10") Output: £s=10

  • sanitiser.Sanitize("&centerId=2") Output: ¢erId=2

I'm assuming that sanitiser.Sanitize("&centerId=2") should output &centerId=2

The output is wrong for &center or &cent

How can I resolve this??

Example:

I'm trying to prevent XSS request with HtmlSanitizer in my .NET wb app project, e.g:

var sanitiser = new HtmlSanitizer();
var result = sanitiser.Sanitize(rawText);

When the body or query has specific word inside of it Sanitizer get error. Like this:

  • sanitiser.Sanitize("&pounds=10") Output: £s=10

  • sanitiser.Sanitize("&centerId=2") Output: ¢erId=2

I'm assuming that sanitiser.Sanitize("&centerId=2") should output &centerId=2

The output is wrong for &center or &cent

How can I resolve this??

Example:

Share Improve this question edited Feb 8 at 8:49 DarkBee 15.6k8 gold badges70 silver badges115 bronze badges asked Feb 8 at 8:31 rahmanrahman 1319 bronze badges 6
  • I don't know why it replaces &cent in &center as it's missing ; to be html entity. But the strange thing is why you sanitize such text? Is that your untrusted input? Also you can't always expect the same output as input from the sanitizer. That's the whole point of sanitizing html in the first place. – Paweł Łukasik Commented Feb 8 at 10:06
  • i try sanitize query and body in every request, in owasp penetration test, every input value is untrusted @PawełŁukasik – rahman Commented Feb 8 at 10:15
  • but is the &centerId=2 in this form written anywhere on the page so that you need to HTML sanitize it? – Paweł Łukasik Commented 2 days ago
  • no it's on query request, like : localhost:4050/api/getlist?person=1&centerId=2, so i sanitize request to prevent xss attack – rahman Commented 2 days ago
  • but to have xss, that content needs to be send back to the user. if it's not then there's no xss. There might be some other attacts but you would sanitize in a different way (not with HtmlSanitizer) – Paweł Łukasik Commented 2 days ago
 |  Show 1 more comment

1 Answer 1

Reset to default 2

It's because it's being parsed as HTML and it gets validly consumed as a legacy HTML entity. I believe that this is intended default behavior from HtmlSanitizer's parser, AngleSharp. Anything you pass in to sanitiser.Sanitize will get parsed as HTML and have its HTML entities consumed.

Before HtmlSanitizer does any sanitization, it parses it as an HTML document using AngleSharp:

/// <summary>
/// Sanitizes the specified HTML body fragment. If a document is given, only the body part will be returned.
/// </summary>
/// <param name="html">The HTML body fragment to sanitize.</param>
/// <param name="baseUrl">The base URL relative URLs are resolved against. No resolution if empty.</param>
/// <returns>The sanitized HTML document.</returns>
public IHtmlDocument SanitizeDom(string html, string baseUrl = "")
{
    var parser = HtmlParserFactory();
    var dom = parser.ParseDocument("<!doctype html><html><body>" + html);

    if (dom.Body != null)
        DoSanitize(dom, dom.Body, baseUrl);

    return dom;
}

When var dom = parser.ParseDocument(...) is called, this is the point when your string gets transformed from &center to ¢er. If you step through the code in a debugger and execute dom.Body.ChildNodes.ToHtml(), you can see that the string is already transformed before the call to DoSanitize happens.

We can see this is also true if we make an HTML snippet with only &center or &pounds as the HTML content - this is just how HTML gets parsed:

&pounds
&center

According to this answer on Why do HTML entity names with dec < 255 not require semicolon?, it's valid markup to specify these HTML entities (cents, pounds) without a semicolon because their hexadecimal value is less than 256.

Apparently, AngleSharp offers an option IsNotConsumingCharacterReferences that could help us here (discussed here), but that option isn't exposed to us through HtmlSanitizer's API. If it were, you would instead get the output &amp;pounds=10 or &amp;centerId=2, which is still not your expected output. I don't think HtmlSanitizer will work for you in the way you expect it to here.

发布评论

评论列表(0)

  1. 暂无评论