Make a utf-8 string shorter with a utf-32 encoding in Javascript?

I'm trying to find a way to compress/decompress a string in Javascript. By compress I mean to make the string look shorter (less char). That's my goal.

Here's an example of how things should work:

// The string that I want to make shorter
// It will only contain [a-zA-Z0-9] chars and some ponctuations like ()[]{}.,;'"!
var string = "I like bananas !";

// The compressed string, maybe something like "䐓㐛꯱字",
// which is shorter than the original
var shortString = compress(string);  

// The original string, "I like banana !"
var originalString = decompress(shortString);

Here's my first idea (maybe there's a better way to get to my goal, and if so I'm interested in it).

I know that my original string will be in utf-8. So I'm thinking of using utf-32 for the encoding, which should divide by 4 the length of the string.

But I don't know how to do these 2 functions that construct new strings with different encoding. Here's the code I have so far that doesn't work...

function compress(string) {
    string = unescape(encodeURIComponent(string));
    var newString = '';

    for (var i = 0; i < string.length; i++) {
        var char = string.charCodeAt(i);
        newString += parseInt(char, 8).toString(32);
    }

    return newString;
}

I'm trying to find a way to compress/decompress a string in Javascript. By compress I mean to make the string look shorter (less char). That's my goal.

Here's an example of how things should work:

// The string that I want to make shorter
// It will only contain [a-zA-Z0-9] chars and some ponctuations like ()[]{}.,;'"!
var string = "I like bananas !";

// The compressed string, maybe something like "䐓㐛꯱字",
// which is shorter than the original
var shortString = compress(string);  

// The original string, "I like banana !"
var originalString = decompress(shortString);

Here's my first idea (maybe there's a better way to get to my goal, and if so I'm interested in it).

I know that my original string will be in utf-8. So I'm thinking of using utf-32 for the encoding, which should divide by 4 the length of the string.

But I don't know how to do these 2 functions that construct new strings with different encoding. Here's the code I have so far that doesn't work...

function compress(string) {
    string = unescape(encodeURIComponent(string));
    var newString = '';

    for (var i = 0; i < string.length; i++) {
        var char = string.charCodeAt(i);
        newString += parseInt(char, 8).toString(32);
    }

    return newString;
}

Share Improve this question edited Nov 1, 2017 at 20:34 asked Oct 30, 2017 at 18:42 Thomas 9571 gold badge10 silver badges16 bronze badges

1 "Compressing to UTF-32" makes little sense. Is there at least any restriction on the valid input? E.g. while your idea may somewhat work with English, you won't be able to compress "バナナ好き" into "䐓㐛꯱字". – deceze ♦ Commented Nov 1, 2017 at 10:32
I know that my input string will only contain English characters, like [a-zA-Z0-9] and some ponctuations like ()[]{}.,; – Thomas Commented Nov 1, 2017 at 10:34
3 UTF-8, UTF-16 and UTF-32 are character encodings for the Unicode character set. In JavaScript, all strings are counted sequences of UTF-16 code units (charCodeAt). "New strings with different encoding" doesn't make sense. I think you mean to map a sequence of characters from a fixed subset of Unicode into a shorter sequence of characters. That doesn't have anything to do with changing character encodings. – Tom Blodget Commented Nov 1, 2017 at 17:01
My goal is to make the string shorter. If you have a better idea than utf-8 and utf-32, let me know. – Thomas Commented Nov 1, 2017 at 20:27
3 Compressing a string to a string is highly problematic. JavaScript strings use the UTF-16 encoding of the Unicode character set. If the result is to be valid, it must satisfy quite a few rules, including: each UTF-16 code unit or surrogate pair must encode a defined Unicode codepoint. A bit packing scheme can't just be mathematical. BTW—Is this an XY Problem? – Tom Blodget Commented Nov 2, 2017 at 5:09

| Show 2 more comments

4 Answers 4

Sorted by: Reset to default 14 +25

Since you're using a set of less than 100 characters and that javascript strings are encoded in UTF-16 (which mean you have 65536 possible characters), what you can do is concatenate the character codes so as to have one "compressed" character per two basic character. This allows you to compress strings to half the length.

Like this for example:

document.getElementById('compressBtn').addEventListener('click', function() {
  var stringToCompress = document.getElementById('tocompress').value;
  var compressedString = compress(stringToCompress);
  var decompressedString = decompress(compressedString);

  if (stringToCompress === decompressedString) {
    document.getElementById('display').innerHTML = stringToCompress + ", length of " + stringToCompress.length  + " characters compressed to " + compressedString + ", length of " + compressedString.length + " characters back to " + decompressedString;
  } else {
    document.getElementById('display').innerHTML = "This string cannot be compressed"
  }

})


function compress(string) {
  string = unescape(encodeURIComponent(string));
  var newString = '',
    char, nextChar, combinedCharCode;

  for (var i = 0; i < string.length; i += 2) {
    char = string.charCodeAt(i);

    if ((i + 1) < string.length) {

      // You need to make sure that you don't have 3 digits second character else you  might go over 65536. 
      // But in UTF-16 the 32 characters aren't in your basic character set. But it's a limitation, anything
      // under charCode 32 will cause an error
      nextChar = string.charCodeAt(i + 1) - 31;

      // this is to pad the result, because you could have a code that is single digit, which would make 
      // decompression a bit harder
      combinedCharCode = char + "" + nextChar.toLocaleString('en', {
        minimumIntegerDigits: 2
      });

      // You take the concanated code string and convert it back to a number, then a character
      newString += String.fromCharCode(parseInt(combinedCharCode, 10));

    } else {

      // Here because you won't always have pair number length
      newString += string.charAt(i);
    }
  }
  return newString;
}

function decompress(string) {

  var newString = '',
    char, codeStr, firstCharCode, lastCharCode;

  for (var i = 0; i < string.length; i++) {
    char = string.charCodeAt(i);
    if (char > 132) {
      codeStr = char.toString(10);

      // You take the first part of the compressed char code, it's your first letter
      firstCharCode = parseInt(codeStr.substring(0, codeStr.length - 2), 10);

      // For the second one you need to add 31 back.
      lastCharCode = parseInt(codeStr.substring(codeStr.length - 2, codeStr.length), 10) + 31;

      // You put back the 2 characters you had originally
      newString += String.fromCharCode(firstCharCode) + String.fromCharCode(lastCharCode);
    } else {
      newString += string.charAt(i);
    }
  }
  return newString;
}

var stringToCompress = 'I like bananas!';
var compressedString = compress(stringToCompress);
var decompressedString = decompress(compressedString);

document.getElementById('display').innerHTML = stringToCompress + ", length of " + stringToCompress.length  + " characters compressed to " + compressedString + ", length of " + compressedString.length + " characters back to " + decompressedString;

body {
  padding: 10px;
}

#tocompress {
  width: 200px;
}

<input id="tocompress" placeholder="enter string to compress" />
<button id="compressBtn">
  Compress input
</button>
<div id="display">

</div>

Regarding the possible use of UTF-32 to further compress, I'm not sure it's possible, I might be wrong on that, but from my understanding it's not feasible. Here's why:

The approach above is basically concatenating two 1 byte values in one 2 bytes value. This is possible because javascript strings are encoded in 2 bytes (or 16 bits) (note that from what I understand the engine could decide to store differently making this compression unnecessary from a purely memory space point of view - that being said, in the end, one character is considered being 16 bits). A cleaner way to make the compression above would in fact to user the binary numbers instead of the decimal, it would make much more sense. Like this for example:

document.getElementById('compressBtn').addEventListener('click', function() {
  var stringToCompress = document.getElementById('tocompress').value;
  var compressedString = compress(stringToCompress);
  var decompressedString = decompress(compressedString);

  if (stringToCompress === decompressedString) {
    document.getElementById('display').innerHTML = stringToCompress + ", length of " + stringToCompress.length  + " characters compressed to " + compressedString + ", length of " + compressedString.length + " characters back to " + decompressedString;
  } else {
    document.getElementById('display').innerHTML = "This string cannot be compressed"
  }

})


function compress(string) {
  string = unescape(encodeURIComponent(string));
  var newString = '',
    char, nextChar, combinedCharCode;

  for (var i = 0; i < string.length; i += 2) {
  
  // convert to binary instead of keeping the decimal
    char = string.charCodeAt(i).toString(2);

    if ((i + 1) < string.length) {

     
      nextChar = string.charCodeAt(i + 1).toString(2) ;
     

      // you still need padding, see this answer https://stackoverflow.com/questions/27641812/way-to-add-leading-zeroes-to-binary-string-in-javascript
      combinedCharCode = "0000000".substr(char.length) + char + "" + "0000000".substr(nextChar.length) + nextChar;

      // You take the concanated code string and convert it back to a binary number, then a character
      newString += String.fromCharCode(parseInt(combinedCharCode, 2));

    } else {

      // Here because you won't always have pair number length
      newString += string.charAt(i);
    }
  }
  return newString;
}

function decompress(string) {

  var newString = '',
    char, codeStr, firstCharCode, lastCharCode;

  for (var i = 0; i < string.length; i++) {
    char = string.charCodeAt(i);
    if (char > 132) {
      codeStr = char.toString(2);

      // You take the first part (the first byte) of the compressed char code, it's your first letter
      firstCharCode = parseInt(codeStr.substring(0, codeStr.length - 7), 2);

      // then the second byte
      lastCharCode = parseInt(codeStr.substring(codeStr.length - 7, codeStr.length), 2);

      // You put back the 2 characters you had originally
      newString += String.fromCharCode(firstCharCode) + String.fromCharCode(lastCharCode);
    } else {
      newString += string.charAt(i);
    }
  }
  return newString;
}

var stringToCompress = 'I like bananas!';
var compressedString = compress(stringToCompress);
var decompressedString = decompress(compressedString);

document.getElementById('display').innerHTML = stringToCompress + ", length of " + stringToCompress.length  + " characters compressed to " + compressedString + ", length of " + compressedString.length + " characters back to " + decompressedString;

<input id="tocompress" placeholder="enter string to compress" />
<button id="compressBtn">
  Compress input
</button>
<div id="display">

</div>

So why not push the logic and use utf-32, which should be 4 bytes, meaning four 1 byte characters. One problem is that javascript has 2 bytes string. It's true that you can use pairs of 16 bits characters to represent utf-32 characters. Like this:

document.getElementById('test').innerHTML = "\uD834\uDD1E";

<div id="test"></div>

But if you test the length of the resulting string, you'll see that it's 2, even if there's only one "character". So from a javascript perspective, you're not reducing the actual string length.

The other thing is that UTF-32 has in fact 2²¹ characters. See here: https://en.wikipedia.org/wiki/UTF-32

It is a protocol to encode Unicode code points that uses exactly 32 bits per Unicode code point (but a number of leading bits must be zero as there are fewer than 221 Unicode code points)

So you don't really have 4 bytes, in fact you don't even have 3, which would be needed to encode 3. So UTF-32 doesn't seem to be a way to compress even more. And since javascript has native 2 bytes strings, it seems to me to be the most efficient - using that approach at least.

If your strings only contain ASCII characters [0, 127] you can "compress" the string using a custom 6 or 7-bit code page.

You can do this several ways, but I think one of the simpler methods is to define an array holding all allowed characters - a LUT, lookup-table if you like, then use its index value as the encoded value. You would of course have to manually mask and shift the encoded value into a typed array.

If your LUT looked like this:

var lut = " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,:;!(){}";

you would in this case deal with a LUT of length 71 which means we would need to use a 7-bit range or [0, 127] (if length were 64 we could've reduced the it to 6-bit [0, 63] values).

Then you would take each characters in the string and convert to index values (you would normally do all the following steps in a single operation but I have separated them for simplicity):

var lut = " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,:;!(){}";
var str = "I like bananas !";
var page = [];

Array.prototype.forEach.call(str, function(ch) {
  var i = lut.indexOf(ch);
  if (i < 0) throw "Invalid character - can't encode";
  page.push(i);
});

console.log("Intermediate page:", page);

You can always tweak the LUT so that the most used characters are in the beginning, then support variable encoding bit-range, find max value and use that to determine what range you want to encode in. You can add an initial bit as a flag as to which range the encoding uses (for example bit 0 set if 6-bit fits, otherwise use 7-bit range).

Now that you know the indices we can start to encode the binary output itself using a 7-bit approach. Since JavaScript only support byte values, i.e. 8-bit width, we have to do all the split, shift and merge operations manually.

This means we need to keep track of remainder and position on a bit-level.

Say first index value was the following 7-bit value (full 7-bit range for readability - all in pseudo format):

&b01111111

The first step would be to shift it over to bit position 0 and keep track of a remainder:

&b01111111 << 1

Resulting in:

&b11111110
         ^
new bit position: 7
new remainder   : 1

Then the next index value, for example:

&b01010101

would be encoded like this - first convert to 7-bit value in its own byte representation:

&b01010101 << 1 => &b10101010

Then get the reminder part first. To obtain this will shift everything right-wise using 8-bit minus the current remainder (within modulo of 8):

remainderValue = &b10101010 >>> (8 - remainder)

leaving us with the following representation:

&b00000001

(Note that we use triple >>> to shift right to avoid issues with sign.)

Next step now is to merge this value with our previous value that has already been encoded and stored into our destination byte array - for this we'll use an OR operation:

Index 0      New value     Result in index 0 (index of dst. array)
&b11111110 | &b00000001 => &b11111111

then go to next index in our destination array and store the rest of the current value, then update the remainder and position.

The "leftover" of the byte is calculated like this using the original (after shifting it) 7-bit byte value:

leftover = &b10101010 << remainder => &b01010100

which we now put into the next position:

Index 0    Index 1   (destination array index, not page index)
&b11111111 01010100
                 ^

new bit position: 14
new remainder   : 2

And so on with the remaining index values. See this answer for actual code on how you can do this in JavaScript - the code in this answer doesn't deal with string encoding per-se, but it shows how you can shift byte buffers bit-wise which is essentially the same you need for this task.

To calculate the remainder step, use 8-bits minus your custom bit-range:

step = 8 - newRange (here 7) => 1

This will also be the start remainder. For each character, you'll add the step to remainder after it has been processed, but remember to use modulo 8 (byte width) when you use it for shifting:

remainder += step;
numOfBitsToShift = remainder % 8;

Bit-position uses of course the bit-range, in this case 7:

bitPosition += 7;

Then to find which indices you're dealing with you divide the bitPosition on 8, if any decimal you have to deal with two indexes (old and new), if no decimal the current position represents new index only (only shift is needed for current index value).

You can also use modulo and when modulo of remainder = step you know you that you are dealing with a single index in the destination.

To calculate the final length you would use the bit-length and length of string, then ceil the result so that all characters will fit into a 8-byte byte array which is the only array we can get in JavaScript:

dstLength = Math.ceil(7 * str.length / 8);

To decode you just reverse all the steps.

An alternative, if you use long strings or have to move forward fast, is to use an established compressor such as zlib which has a very compact header as well as good performance in JavaScript in the case of the linked solution. This will also deal with "patterns" in the string to further optimize the resulting size.

Disclaimer: as this is mostly a theoretical answer there might be some errors. Feel free to comment if any are found. Refer to linked answer for actual code example.

for full code see here: https://repl.it/NyMl/1

using the Uint8Array you can work with the bytes.

let msg = "This is some message";

let data = []

for(let i = 0; i < msg.length; ++i){
  data[i] = msg.charCodeAt(i);
}

let i8 = new Uint8Array(data);
let i16 = new Uint16Array(i8.buffer);

you could also think of a compression like this: http://pieroxy.net/blog/pages/lz-string/demo.html

if you don't want to use a 3rd party library, the lz based compression should be fairly simple. see here (wikipedia)

I use the same library mentioned above, lz-string https://github.com/pieroxy/lz-string, and it creates file sizes that are smaller than most of the binary formats like Protocol Buffers.

I compress via Node.js like this:

var compressedString = LZString.compressToUTF16(str);

And I decompress client side like this:

var decompressedString = LZString.decompressFromUTF16(str);

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

Make a utf-8 string shorter with a utf-32 encoding in Javascript? - Stack Overflow

4 Answers 4

与本文相关的文章

评论列表(0)