I have a JavaScript string that contains characters that have a charCode
greater than 255.
I want to be able to encode/decode that string into another string that has all its charCode
less than or equal to 255.
There is no restriction on the characters (ex: can be non-printable).
I want a solution that is as fast as possible and that produces a string as small as possible.
It must also work for any UTF-8 character.
I found out that encodeURI
does exactly that, but it seems that it takes a lot of space.
encodeURI('ĉ') === "%C4%89" // 6 bytes...
Is there anything better than encodeURI
?
I have a JavaScript string that contains characters that have a charCode
greater than 255.
I want to be able to encode/decode that string into another string that has all its charCode
less than or equal to 255.
There is no restriction on the characters (ex: can be non-printable).
I want a solution that is as fast as possible and that produces a string as small as possible.
It must also work for any UTF-8 character.
I found out that encodeURI
does exactly that, but it seems that it takes a lot of space.
encodeURI('ĉ') === "%C4%89" // 6 bytes...
Is there anything better than encodeURI
?
- Do you have any other requirements on the encoding, other than that there is no charCode greater than 255? Is it allowed to have quotation marks, spaces, non-printable characters, NUL characters? – Paul Commented Jun 16, 2016 at 19:22
- No other requirements. The data is sent as binary. – RainingChain Commented Jun 16, 2016 at 19:24
- Fast and as small as possible are somewhat mutually exclusive. You could try LZW pression of the string. Just how large is the string you want to press, and why do you need to press it? E.g. if it is for a GET request, perhaps you could use a POST request instead, which would transmit the bytes quite effectively. – Andrew Morton Commented Jun 16, 2016 at 19:32
- You could convert each characters charcode to base 255 and then delimit them with the one unused character. – Paul Commented Jun 16, 2016 at 19:32
- @AndrewMorton I'm using a pression library that encodes an object into a binary buffer. That library assumes each character of thestrings within the object fit in 1 byte. – RainingChain Commented Jun 16, 2016 at 19:35
3 Answers
Reset to default 2What you want to do is encode your string as UTF8. Googling for how to do that in Javascript, I found http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html , which gives:
function encode_utf8( s ) {
return unescape( encodeURIComponent( s ) );
}
function decode_utf8( s ) {
return decodeURIComponent( escape( s ) );
}
or in short, almost exactly what you found already, plus unescaping the '%xx' codes to a byte.
You can get the ASCII value of a character with .charCodeAt(position)
. You can split a character into multiple characters using this.
First, get the char code for every character, by looping trough the string. Create a temporary empty string, and while the char code is higher than 255 of the current character, divide 255 from it, and put a ÿ
(the 256th character of the extended ASCII table), then once it's under 255 use String.fromCharCode(charCode)
, to convert it to a character, and put it at the end of the temporary string, and at last, replace the character with this string.
function encode(string) {
var result = [];
for (var i = 0; i < string.length; i++) {
var charCode = string.charCodeAt(i);
var temp = "";
while (charCode > 255) {
temp += "ÿ";
charCode -= 255;
}
result.push(temp + String.fromCharCode(charCode));
}
return result.join(",");
}
The above encoder puts a ma after every group, this could cause problems at decode, so we need to use the ,(?!,)
regex to match the last ma from multiple mas.
function decode(string) {
var characters = string.split(/,(?!,)/g);
var result = "";
for (var i = 0; i < characters.length; i++) {
var charCode = 0;
for (var j = 0; j < characters[i].length; j++) {
charCode += characters[i].charCodeAt(j);
}
result += String.fromCharCode(charCode);
}
return result;
}
UTF-8 is already an encoding for unicode text that uses 8 bits per character. You can simply send the UTF-8 string over the wire.
Generally, JavaScript strings consist of UTF-16 characters.
For such strings, you can either encode each UTF-16 character as two 8-bit characters or use a dynamic length encoding such as UTF-8.
If you have many non-ASCII characters, the first might produce smaller results.
// See http://monsur.hossa.in/2012/07/20/utf-8-in-javascript.html
function encode_utf8(s) {
return unescape(encodeURIComponent(s));
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
function encode_fixed_length(s) {
let length = s.length << 1,
bytes = new Array(length);
for (let i = 0; i < length; ++i) {
let code = s.charCodeAt(i >> 1);
bytes[i] = code >> 8;
bytes[++i] = code & 0xFF;
}
return String.fromCharCode.apply(undefined, bytes);
}
function decode_fixed_length(s) {
let length = s.length,
chars = new Array(length >> 1);
for (let i = 0; i < length; ++i) {
chars[i >> 1] = (s.charCodeAt(i) << 8) + s.charCodeAt(++i);
}
return String.fromCharCode.apply(undefined, chars);
}
string_1 = "\u0000\u000F\u00FF";
string_2 = "\u00FF\u0FFF\uFFFF";
console.log(encode_fixed_length(string_1)); // "\x00\x00\x00\x0F\x00\xFF"
console.log(encode_fixed_length(string_2)); // "\x00\xFF\x0F\xFF\xFF\xFF"
console.log(encode_utf8(string_1)); // "\x00\x0F\xC3\xBF"
console.log(encode_utf8(string_2)); // "\xC3\xBF\xE0\xBF\xBF\xEF\xBF\xBF"
Performance parison: See https://jsfiddle/r0d9pm25/1/
Results for 500000 iterations in Firefox 47:
- 6159.91ms
encode_fixed_length()
- 7177.35ms
encode_utf8()