PHP: HTML Attribute EncodingJavaScript Decoding

What's the proper way to encode untrusted data for HTML attribute context? For example:

<input type="hidden" value="<?php echo $data; ?>" />

I usually use htmlentities() or htmlspecialchars() to do this:

<input type="hidden" value="<?php echo htmlentities($data); ?>" />

However, I recently ran into an issue where this was breaking my application when the data I needed to pass was a URL which needed to be handed off to JavaScript to change the page location:

<input id="foo" type="hidden" value="foo?bar=1&amp;baz=2" />
<script>
    // ...
    window.location = document.getElementById('foo').value;
    // ...
</script>

In this case, foo is a C program, and it doesn't understand the encoded characters in the URL and segfaults.

I can simply grab the value in JavaScript and do something like value.replace('&', '&'), but that seems kludgy, and only works for ampersands.

So, my question is: is there a better way to go about the encoding or decoding of data that gets injected into HTML attributes?

I have read all of OWASP's XSS Prevention Cheatsheet, and it sounds to me like as long as I'm careful to quote my attributes, then the only character I need to encode is the quote itself (") - in which case, I could use something like str_replace('"', '"', ...) - but, I'm not sure if I'm understanding it properly.

What's the proper way to encode untrusted data for HTML attribute context? For example:

<input type="hidden" value="<?php echo $data; ?>" />

I usually use htmlentities() or htmlspecialchars() to do this:

<input type="hidden" value="<?php echo htmlentities($data); ?>" />

However, I recently ran into an issue where this was breaking my application when the data I needed to pass was a URL which needed to be handed off to JavaScript to change the page location:

<input id="foo" type="hidden" value="foo?bar=1&amp;baz=2" />
<script>
    // ...
    window.location = document.getElementById('foo').value;
    // ...
</script>

In this case, foo is a C program, and it doesn't understand the encoded characters in the URL and segfaults.

I can simply grab the value in JavaScript and do something like value.replace('&', '&'), but that seems kludgy, and only works for ampersands.

So, my question is: is there a better way to go about the encoding or decoding of data that gets injected into HTML attributes?

Share Improve this question edited May 4, 2012 at 12:48 asked May 1, 2012 at 20:36 FtDRbwLXw6 28.9k16 gold badges72 silver badges108 bronze badges

2 Doesn't urlencode take care of that in PHP? There is few code examples in comments that show how to protect against XSS too on the php manual. php.net/manual/en/function.urlencode.php – GillesC Commented May 1, 2012 at 20:43
@gillesc: urlencode() is for encoding URL parameters, not whole URLs, and does not encode for the HTML attribute context. There is a section in the manual that even talks about this - "Leave it as &, but simply encode your URLs using htmlentities() or htmlspecialchars()." – FtDRbwLXw6 Commented May 2, 2012 at 13:00
are you sure about window.location = document.getElementById('foo');? that should be like this I think-> window.location = document.getElementById('foo').value; and it redirects to right page(foo?bar=1&baz=2) – Okan Kocyigit Commented May 4, 2012 at 8:26
@ocanal: Thank you, I've corrected that, but this does not address the problem, because it will redirect to foo?bar=1&baz=2. PHP is able to understand this, but foo is not a PHP script, and just crashes unless the URL is like foo?bar=1&baz=2. – FtDRbwLXw6 Commented May 4, 2012 at 12:51
The actual value of the input in your case is foo?bar=1&baz=2, as demonstrated here. Your script as posted won't result in a redirect to foo?bar=1&baz=2 but to foo?bar=1&baz=2. – lanzz Commented Jul 10, 2012 at 21:18

Add a comment |

5 Answers 5

Sorted by: Reset to default 11 +300

Your current method of using htmlentities() or htmlspecialchars() is the right approach.

The example you provided is correct HTML:

<input id="foo" type="hidden" value="foo?bar=1&amp;baz=2" />

The ampersand in the value attribute does indeed need to be HTML encoded, otherwise your HTML is invalid. Most browsers would parse it correctly with an & in there, but that doesn't change the fact that it's invalid and you are correct to be encoding it.

Your problem lies not in the encoding of the value, which is good, but in the fact that you're using Javascript code that doesn't decode it properly.

In fact, I'm surprised at this, because your JS code is accessing the DOM, and the DOM should be returning the decoded values.

I wrote a JSfiddle to prove this to myself: http://jsfiddle.net/qRd4Z/

Running this, it gives me an alert box with the decoded value as I expected. Changing it to console.log also give the result I expect. So I'm not sure why you're getting different results? Perhaps you're using a different browser? It might be worth specifying which one you're testing with. Or perhaps you've double-encoded the entities by mistake? Can you confirm that's not the case?

What's the proper way to encode untrusted data for HTML attribute context?

If you add double quotes around the attribute value, htmlspecialchars() is enough.

 <input id="foo" type="hidden" value="foo?bar=1&amp;baz=2" />

This is correct, and the browser will send foo?bar=1&baz=2 (decoded &) to the server. If the server isn't seeing foo?bar=1&baz=2, you must have encoded the value twice.

Getting the value in javascript should return foo?bar=1&baz=2 too (e.g. document.getElementById('foo').value must return foo?bar=1&baz=2).

View the source of the page using your browser and see the actual source of the input field.

If you are modifying the input field's value using Javascript, then the script must be double-encoding it.

BTW your program shouldn't segfault because of wrong user input ;)

You can use the DOM to decode the value:

function decodeHTMLSpecialChars(input){
  var div = document.createElement('div');
  div.innerHTML = input;
  return div.childNodes.length === 0 ? "" : div.childNodes[0].nodeValue;
}

This will render the following string:

'http://someurl.com/foo?bar=1&amp;baz=2'

to this:

decodeHTMLSpecialChars('http://someurl.com/foo?bar=1&amp;baz=2');
// => 'http://someurl.com/foo?bar=1&baz=2

And no, for HTML encoding and decoding, the htmlspecialchars and html escaping is the standard method and is doing the job just fine for you.

Could you not just use the html_entity_decode function in PHPJS:

http://phpjs.org/functions/html_entity_decode

Other than that you could base64 encode your data instead...

Please note that using htmlentities as it is doesn't help!

By default it just encodes " < > &

It doesn't escape ' which can create a problem!

Make sure you use Flags for the functions , you can find the usage and examples here

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

PHP: HTML Attribute EncodingJavaScript Decoding - Stack Overflow

5 Answers 5

与本文相关的文章

评论列表(0)