最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Unicode and URI encoding, decoding and escaping in JavaScript - Stack Overflow

programmeradmin0浏览0评论

If you look at this table here, it has a list of escape sequences for Unicode characters that don't actually work for me.

For example for "%96", which should be a –, I get an error when trying decode:

decodeURIComponent("%96");
URIError: URI malformed

If I attempt to encode "–" I actually get:

encodeURIComponent("–");
"%E2%80%93"

I searched through the internet and I saw this page, which mentions using escape and unescape with decodeURIComponent and encodeURIComponent respectively. This doesn't seem to help because %96 doesn't show up as "–" no matter what I try and this of course wouldn't work:

decodeURIComponent(escape("%96));
"%96"

Not very helpful.

How can I get "%96" to be a "–" with JavaScript (without hardcoding a map for every single possible unicode character I may run into)?

If you look at this table here, it has a list of escape sequences for Unicode characters that don't actually work for me.

For example for "%96", which should be a –, I get an error when trying decode:

decodeURIComponent("%96");
URIError: URI malformed

If I attempt to encode "–" I actually get:

encodeURIComponent("–");
"%E2%80%93"

I searched through the internet and I saw this page, which mentions using escape and unescape with decodeURIComponent and encodeURIComponent respectively. This doesn't seem to help because %96 doesn't show up as "–" no matter what I try and this of course wouldn't work:

decodeURIComponent(escape("%96));
"%96"

Not very helpful.

How can I get "%96" to be a "–" with JavaScript (without hardcoding a map for every single possible unicode character I may run into)?

Share Improve this question asked Apr 7, 2010 at 22:59 BjornBjorn 72k40 gold badges140 silver badges165 bronze badges
Add a ment  | 

3 Answers 3

Reset to default 6

The sequence %XX in a URI encodes an "octet", that is, an eight-bit byte. This raises the question of what Unicode character that the decoded byte refers to. If my memory serves me correctly, in older versions of the URI specification, it was not well defined what charset was assumed. In later versions of the URI specification it was remended that UTF-8 be the default encoding charset. That is, to decode a sequence of bytes, you would decode each %XX sequence and then convert the resulting bytes into a string using the UTF-8 character set.

This explains why %96 won't decode. The hex 0x96 value isn't a valid UTF-8 sequence. As it is lies beyond ASCII, it would need a special modifier byte before it to indicate an extended character. (See the UTF-8 specification for more details.) The JavaScript encodeURIComponent() and decodeURIComponent() methods both assume UTF-8 (as they should), so I wouldn't expect %96 to decode correctly.

The character you referenced is U+2013, an en-dash. How on earth does the page you reference get an en-dash from hex 0x96 (decimal 150)? They are obviously not assuming UTF-8 encoding, which is the standard. They are not assuming ASCII, which doesn't contain this character. They are not even assuming ISO-8859-1, which is a standard encoding that uses one byte per character. It turns out they are assuming the special Windows 1252 code page. That is, the URI yo u are trying to decode assumes that the user is on a Windows machine, and even worse, on a Windows machine in English (or one of a few other Western languages).

In short, the table you're using is bad. It's out-of-date and assumes that the user is on an English Windows system. The up-to-date and correct way to encode non-ASCII values is to convert them to UTF-8 and then encode each octet using %XX. That's why you got %E2%80%93 when you tried to encode the character, and that's what decodeURIComponent() is expecting. The URI you're using is not encoded correctly. If you have no other choice, you can guess that the URI is using Windows 1252, convert the bytes yourself, and then use a Windows 1252 table to find out what Unicode values were intended. But that's risky---how do you know which URI uses which table? That's why everybody settled on UTF-8. If possible, tell whoever is giving you these URIs to encode them correctly.

Posting as a munity wiki entry as it's from "Building Scalable Websites" by Carl Henderson. The book says it's OK to reproduce significant portions of the examples though. You may be able to create a special case for "-" with it.

function escape_utf8(data) {
        if (data == '' || data == null){
               return '';
        }
       data = data.toString();
       var buffer = '';
       for(var i=0; i<data.length; i++){
               var c = data.charCodeAt(i);
               var bs = new Array();
              if (c > 0x10000){
                       // 4 bytes
                       bs[0] = 0xF0 | ((c & 0x1C0000) >>> 18);
                       bs[1] = 0x80 | ((c & 0x3F000) >>> 12);
                       bs[2] = 0x80 | ((c & 0xFC0) >>> 6);
                   bs[3] = 0x80 | (c & 0x3F);
               }else if (c > 0x800){
                        // 3 bytes
                        bs[0] = 0xE0 | ((c & 0xF000) >>> 12);
                        bs[1] = 0x80 | ((c & 0xFC0) >>> 6);
                       bs[2] = 0x80 | (c & 0x3F);
             }else if (c > 0x80){
                      // 2 bytes
                       bs[0] = 0xC0 | ((c & 0x7C0) >>> 6);
                      bs[1] = 0x80 | (c & 0x3F);
               }else{
                       // 1 byte
                    bs[0] = c;
              }
             for(var j=0; j<bs.length; j++){
                      var b = bs[j];
                       var hex = nibble_to_hex((b & 0xF0) >>> 4) 
                      + nibble_to_hex(b &0x0F);buffer += '%'+hex;
              }
    }
    return buffer;
}
function nibble_to_hex(nibble){
        var chars = '0123456789ABCDEF';
        return chars.charAt(nibble);
}

See this question, specifically this answer:

there is a special “%uNNNN” format for encoding Unicode UTF-16 code points, instead of encoding UTF-8 bytes

I suspect "–" is one of those characters since 0x96 in the Ascii table is û

发布评论

评论列表(0)

  1. 暂无评论