javascript - Convert UTF-8 String with only 8 Bits per Character

I have a JavaScript string that contains characters that have a charCode greater than 255.

I want to be able to encode/decode that string into another string that has all its charCode less than or equal to 255.

There is no restriction on the characters (ex: can be non-printable).

I want a solution that is as fast as possible and that produces a string as small as possible.

It must also work for any UTF-8 character.

I found out that encodeURI does exactly that, but it seems that it takes a lot of space.

encodeURI('ĉ') === "%C4%89" // 6 bytes...

Is there anything better than encodeURI?

I have a JavaScript string that contains characters that have a charCode greater than 255.

I want to be able to encode/decode that string into another string that has all its charCode less than or equal to 255.

There is no restriction on the characters (ex: can be non-printable).

I want a solution that is as fast as possible and that produces a string as small as possible.

It must also work for any UTF-8 character.

I found out that encodeURI does exactly that, but it seems that it takes a lot of space.

encodeURI('ĉ') === "%C4%89" // 6 bytes...

Is there anything better than encodeURI?

Share Improve this question edited Jul 10, 2018 at 0:09 Grant Miller 29.1k16 gold badges156 silver badges170 bronze badges asked Jun 16, 2016 at 19:19 RainingChain 7,79210 gold badges38 silver badges69 bronze badges

Do you have any other requirements on the encoding, other than that there is no charCode greater than 255? Is it allowed to have quotation marks, spaces, non-printable characters, NUL characters? – Paul Commented Jun 16, 2016 at 19:22
No other requirements. The data is sent as binary. – RainingChain Commented Jun 16, 2016 at 19:24
Fast and as small as possible are somewhat mutually exclusive. You could try LZW pression of the string. Just how large is the string you want to press, and why do you need to press it? E.g. if it is for a GET request, perhaps you could use a POST request instead, which would transmit the bytes quite effectively. – Andrew Morton Commented Jun 16, 2016 at 19:32
You could convert each characters charcode to base 255 and then delimit them with the one unused character. – Paul Commented Jun 16, 2016 at 19:32
@AndrewMorton I'm using a pression library that encodes an object into a binary buffer. That library assumes each character of thestrings within the object fit in 1 byte. – RainingChain Commented Jun 16, 2016 at 19:35

| Show 6 more ments

3 Answers 3

Sorted by: Reset to default 2

What you want to do is encode your string as UTF8. Googling for how to do that in Javascript, I found http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html , which gives:

function encode_utf8( s ) {
  return unescape( encodeURIComponent( s ) );
}

function decode_utf8( s ) {
  return decodeURIComponent( escape( s ) );
}

or in short, almost exactly what you found already, plus unescaping the '%xx' codes to a byte.

You can get the ASCII value of a character with .charCodeAt(position). You can split a character into multiple characters using this.

First, get the char code for every character, by looping trough the string. Create a temporary empty string, and while the char code is higher than 255 of the current character, divide 255 from it, and put a ÿ (the 256th character of the extended ASCII table), then once it's under 255 use String.fromCharCode(charCode), to convert it to a character, and put it at the end of the temporary string, and at last, replace the character with this string.

function encode(string) {
    var result = [];
    for (var i = 0; i < string.length; i++) {
    var charCode = string.charCodeAt(i);
        var temp = "";
        while (charCode > 255) {
            temp += "ÿ";
            charCode -= 255;
        }
        result.push(temp + String.fromCharCode(charCode));
    }
    return result.join(",");
}

The above encoder puts a ma after every group, this could cause problems at decode, so we need to use the ,(?!,) regex to match the last ma from multiple mas.

function decode(string) {
    var characters = string.split(/,(?!,)/g);
    var result = "";
    for (var i = 0; i < characters.length; i++) {
        var charCode = 0;
        for (var j = 0; j < characters[i].length; j++) {
            charCode += characters[i].charCodeAt(j);
        }
        result += String.fromCharCode(charCode);
    }
    return result;
}

UTF-8 is already an encoding for unicode text that uses 8 bits per character. You can simply send the UTF-8 string over the wire.

Generally, JavaScript strings consist of UTF-16 characters.

For such strings, you can either encode each UTF-16 character as two 8-bit characters or use a dynamic length encoding such as UTF-8.

If you have many non-ASCII characters, the first might produce smaller results.

// See http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html
function encode_utf8(s) {
  return unescape(encodeURIComponent(s));
}

function decode_utf8(s) {
  return decodeURIComponent(escape(s));
}

function encode_fixed_length(s) {
  let length = s.length << 1,
      bytes = new Array(length);
  for (let i = 0; i < length; ++i) {
    let code = s.charCodeAt(i >> 1);
    bytes[i] = code >> 8;
    bytes[++i] = code & 0xFF;
  }
  return String.fromCharCode.apply(undefined, bytes);
}

function decode_fixed_length(s) {
  let length = s.length,
      chars = new Array(length >> 1);
  for (let i = 0; i < length; ++i) {
    chars[i >> 1] = (s.charCodeAt(i) << 8) + s.charCodeAt(++i);
  }
  return String.fromCharCode.apply(undefined, chars);
}

string_1 = "\u0000\u000F\u00FF";
string_2 = "\u00FF\u0FFF\uFFFF";

console.log(encode_fixed_length(string_1)); // "\x00\x00\x00\x0F\x00\xFF"
console.log(encode_fixed_length(string_2)); // "\x00\xFF\x0F\xFF\xFF\xFF"

console.log(encode_utf8(string_1));         // "\x00\x0F\xC3\xBF" 
console.log(encode_utf8(string_2));         // "\xC3\xBF\xE0\xBF\xBF\xEF\xBF\xBF"

Performance parison: See https://jsfiddle/r0d9pm25/1/

Results for 500000 iterations in Firefox 47:

6159.91ms encode_fixed_length()
7177.35ms encode_utf8()

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

javascript - Convert UTF-8 String with only 8 Bits per Character - Stack Overflow

3 Answers 3

与本文相关的文章

评论列表(0)