最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Java: Best way to remove Javascript from HTML - Stack Overflow

programmeradmin2浏览0评论

What's the best library/approach for removing Javascript from HTML that will be displayed?

For example, take:

<html><body><span onmousemove='doBadXss()'>test</span></body></html>

and leave:

<html><body><span>test</span></body></html>

I see the DeXSS project. But is that the best way to go?

What's the best library/approach for removing Javascript from HTML that will be displayed?

For example, take:

<html><body><span onmousemove='doBadXss()'>test</span></body></html>

and leave:

<html><body><span>test</span></body></html>

I see the DeXSS project. But is that the best way to go?

Share Improve this question edited Nov 11, 2010 at 16:47 mtyson asked Nov 11, 2010 at 16:33 mtysonmtyson 8,55017 gold badges75 silver badges113 bronze badges 4
  • Probably, the easiest way to do it is to use XSLT (write a stylesheet that copies the allowable elements and attributes), but that only works if your document is XHTML (unless XSLT has an HTML mode---I can't remember if there's one). – C. K. Young Commented Nov 11, 2010 at 16:38
  • 2 That you wrote "IE" instead of "i.e." confused me to no end! – JasonFruit Commented Nov 11, 2010 at 16:45
  • @JasonFruit: lolz! i too got confused. – Rakesh Juyal Commented Nov 11, 2010 at 16:47
  • 2 possible duplicate of How to "Purify" HTML code to prevent XSS attacks in Java or JSP ? – BalusC Commented Nov 11, 2010 at 17:01
Add a comment  | 

3 Answers 3

Reset to default 11

JSoup has a simple method for sanitizing HTML based on a whitelist. Check http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer

It uses a whitelist, which is safer then the blacklist approach DeXSS uses. From the DeXSS page:

There are still a number of known XSS attacks that DeXSS does not yet detect.

A blacklist only disallows known unsafe constructions, while a whitelist only allows known safe constructions. So unknown, possibly unsafe constructions will only be protected against with a whitelist.

The easiest way would be to not have those in the first place... It probably would make sense to allow only very simple tags to be used in free-form fields and to disallow any kind of attributes.

Probably not the answer you're going for, but in many cases you only want to provide markup capabilities, not a full editing suite.


Similarly, another even easier approach would be to provide a text-based syntax, like Markdown, for editing. (not that many ways you can exploit the SO edit area, for instance. Markdown syntax + limited tag list without attributes).

You could try dom4j http://dom4j.sourceforge.net/dom4j-1.6.1/ This is a DOM parser (as opposed to SAX) and allows you to easily traverse and manipulate the DOM, removing node attributes like onmouseover for example (or entire elements like <script>), before writing back out or streaming somewhere. Depending on how wild your html is, you may need to clean it up first - jtidy http://jtidy.sourceforge.net/ is good.

But obviously doing all this involves some overhead if you're doing this at page render time.

发布评论

评论列表(0)

  1. 暂无评论