最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

How viable is base128 encoding for scenarios like JavaScript strings? - Stack Overflow

programmeradmin4浏览0评论

I recently found that base32, base64 and base128 are the most efficient forms of base-n encoding, and that while base58, Ascii85, base91, base92 et al do provide some efficiency improvements over the ubiquitous base64 due to their use of more characters, there are some mapping losses; for example, there happen to be 272 indices per character-pair in base92 that are impossible to map to from base-10 powers of 2 and are thus pletely wasted. (Base91 encoding only has a similar loss of 89 characters (as found by the script in the link above) but it's patented.)

It would be great if it were viable to use base128 in modern-day real-world scenarios.

There are 92 characters available within 0x21 (33) to 0x7E (126) sans \ and ", which make for a great start to creating JSONifiable strings with the most characters possible.

Here are a few ways I envisage the rest of the characters could be found. This is the question I'm asking.

  • Just dumbly use Unicode

    Two-byte Unicode characters could be used to fill in the remaining 36 required indices. Highly suboptimal; I wouldn't be surprised if this was worse than base64 on the wire. Would only be useful for Unicode character counting scenarios like tweet length. Not exactly what I'm going for.
     

  • Select 36 non-Unicode characters from within the upper (>128) ASCII range

    JavaScript was built with the expectation that character encoding configuration will occasionally go horribly wrong. So the language (and web browsers) handle printing arbitrary and unprintable binary data just fine. So why not just use the upper ASCII range? It's there to be used, right?

    One very real problem could be data going over HTTP and falling through one or more can openers proxies on the way between my browser and the server. How badly could this go? I'm aware that WebSockets over HTTP caused some real pain a couple years ago, and potentially even today.
     

  • Kind of use UTF-8 in interesting ways

    UTF-8 defines 1- to 4-byte long sequences to encapsulate Unicode codepoints. Bytes 2 to 4 always start with 10xxxxxx. There are 64 characters within that range. If I pass through a naïve proxy that filters characters outside the Unicode range on a character-by-character basis, using bytes within this range might mean my data would get through unscathed!
     

  • Determine 36 magic bytes that will work for various esoteric reasons

    Maybe there are some high ASCII characters that will successfully traverse >99% of the Internet infrastructure for various historical or implementational reasons. What characters might these be?

 

Base64 is ubiquitous and has wound up being used everywhere, and it's easy to understand why: it was defined in 1987 to use a carefully-chosen, very restricted alphabet of A-Z, a-z, 0-9, + and / that was (and remains) difficult for most environments (such as mainframes using non-ASCII encoding) to have problems with.

EBCDIC mainframes and MIME email are still very much out there, but today base64 has also wound up as a heavily-used pipe within JavaScript to handle the case of "something in this data path might choke on binary", and the collective overhead it adds is nontrivial.

There's currently only one other question on SO regarding the general viability of base128 encoding, and literally every single answer has one or more issues. The accepted answer suggests that base128 must exactly use the first 128 characters of ASCII, and the only answer that acknowledges that the encoded alphabet can use any characters proceeds to claim that that base128 is not in use because the encoded characters must be easily retypeable (which base58 is optimized for, FWIW). All the others have various problems (which I can explain further if desired).

This question is an attempt to re-ask the above with some additional unambiguous subject clarification, in the hope that a concrete go/no-go can be determined.

I recently found that base32, base64 and base128 are the most efficient forms of base-n encoding, and that while base58, Ascii85, base91, base92 et al do provide some efficiency improvements over the ubiquitous base64 due to their use of more characters, there are some mapping losses; for example, there happen to be 272 indices per character-pair in base92 that are impossible to map to from base-10 powers of 2 and are thus pletely wasted. (Base91 encoding only has a similar loss of 89 characters (as found by the script in the link above) but it's patented.)

It would be great if it were viable to use base128 in modern-day real-world scenarios.

There are 92 characters available within 0x21 (33) to 0x7E (126) sans \ and ", which make for a great start to creating JSONifiable strings with the most characters possible.

Here are a few ways I envisage the rest of the characters could be found. This is the question I'm asking.

  • Just dumbly use Unicode

    Two-byte Unicode characters could be used to fill in the remaining 36 required indices. Highly suboptimal; I wouldn't be surprised if this was worse than base64 on the wire. Would only be useful for Unicode character counting scenarios like tweet length. Not exactly what I'm going for.
     

  • Select 36 non-Unicode characters from within the upper (>128) ASCII range

    JavaScript was built with the expectation that character encoding configuration will occasionally go horribly wrong. So the language (and web browsers) handle printing arbitrary and unprintable binary data just fine. So why not just use the upper ASCII range? It's there to be used, right?

    One very real problem could be data going over HTTP and falling through one or more can openers proxies on the way between my browser and the server. How badly could this go? I'm aware that WebSockets over HTTP caused some real pain a couple years ago, and potentially even today.
     

  • Kind of use UTF-8 in interesting ways

    UTF-8 defines 1- to 4-byte long sequences to encapsulate Unicode codepoints. Bytes 2 to 4 always start with 10xxxxxx. There are 64 characters within that range. If I pass through a naïve proxy that filters characters outside the Unicode range on a character-by-character basis, using bytes within this range might mean my data would get through unscathed!
     

  • Determine 36 magic bytes that will work for various esoteric reasons

    Maybe there are some high ASCII characters that will successfully traverse >99% of the Internet infrastructure for various historical or implementational reasons. What characters might these be?

 

Base64 is ubiquitous and has wound up being used everywhere, and it's easy to understand why: it was defined in 1987 to use a carefully-chosen, very restricted alphabet of A-Z, a-z, 0-9, + and / that was (and remains) difficult for most environments (such as mainframes using non-ASCII encoding) to have problems with.

EBCDIC mainframes and MIME email are still very much out there, but today base64 has also wound up as a heavily-used pipe within JavaScript to handle the case of "something in this data path might choke on binary", and the collective overhead it adds is nontrivial.

There's currently only one other question on SO regarding the general viability of base128 encoding, and literally every single answer has one or more issues. The accepted answer suggests that base128 must exactly use the first 128 characters of ASCII, and the only answer that acknowledges that the encoded alphabet can use any characters proceeds to claim that that base128 is not in use because the encoded characters must be easily retypeable (which base58 is optimized for, FWIW). All the others have various problems (which I can explain further if desired).

This question is an attempt to re-ask the above with some additional unambiguous subject clarification, in the hope that a concrete go/no-go can be determined.

Share Improve this question edited May 23, 2017 at 12:25 CommunityBot 11 silver badge asked Apr 8, 2017 at 3:01 i336_i336_ 2,0111 gold badge25 silver badges42 bronze badges 2
  • IMHO base91 is the best solution, having both good encoded size and encode/decode speed. It you want base 128 then it's better to use a binary format. A little bit below that you can use github./kevinAlbs/Base122 – phuclv Commented Apr 16, 2018 at 15:27
  • 92 characters available within 0x21 (33) to 0x7E (126) sans \ and ". Well, if you add the space character 0x20 you get 93 characters available. – Cœur Commented Jun 4, 2018 at 8:11
Add a ment  | 

5 Answers 5

Reset to default 2

Select 36 non-Unicode characters from within the upper (>128) ASCII range

base128 is not effective because you must use characters witch codes greater than '128'. For charater witch codes >=128 chrome send two bytes... (so string witch 1MB of this characters on sending will be change to 2MB bytes... so you loose all profit). For base64 strings this phenomena does't appear (so we loose only ~33%). More details here in "update" section.

It's viable in the sense of being technically possible, but it's not viable in the sense of being able to achieve a result better than a much simpler alternative: using HTTP gzip pression. In practice if pression is enabled, the Huffman encoding of the strings will negate the 1/3 increase in size from base64 encoding because each character in the base64 string has only 6 bits of entropy.

As a test, I tried generating a 1Mb file of random data using a utility like Dummy File Creator. Then base64 encoded it and gzipped the resulting file using 7zip.

  • Original data: 1,048,576 bytes
  • Base64 encoded data: 1,398,104 bytes
  • Gzipped base64 encoded data: 1,060,329 bytes

That's only a 1.12% increase in size (and the overhead of encoding -> pressing -> depressing -> decoding).

Base128 encoding would take 1,198,373 bytes, so you'd have to press it too if you wanted parable file size. Gzip pression is a standard feature in all modern browsers so what's the case for base128 and all the extra plexity that would entail?

The problem why base64 is used a lot is because they use English alphabets and numbers to encode a binary stream. Technically we can use higher bases but the problem with them is that they will need to fit some character set.

UTF-8 is one of the widely used charsets and if you are using XML or JSON to transmit data, you can very well use a Base256 encoding like the below

https://github./bharatmicrosystems/base256

  • Kind of use UTF-8 in interesting ways

    UTF-8 defines 1- to 4-byte long sequences to encapsulate Unicode codepoints. Bytes 2 to 4 always start with 10xxxxxx. There are 64 characters within that range. If I pass through a naïve proxy that filters characters outside the Unicode range on a character-by-character basis, using bytes within this range might mean my data would get through unscathed!

This is actually quite viable and has been used in base-122. Despite the name, it's in fact base-128 because the 6 invalid values (128 – 122) are encoded specially so that a series of 14 bits can always be represented with at most 2 bytes, exactly like base-128 where 7 bits will be encoded in 1 byte, and in reality can be optimized to be more efficient than base-128

Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx. If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx. Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx. The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx, equivalent to a bitwise OR with 0x80 (this can likely be improved, see §4). Figure 3 summarizes the plete base-122 encoding.

§2.2 Base-122 Encoding

You can find the implementation on github


The accepted answer suggests that base128 must exactly use the first 128 characters of ASCII, ...

Base-122 doesn't exactly use the first 128 ASCII characters, so it can be encoded normally in a null-terminated string. But as

... and the only answer that acknowledges that the encoded alphabet can use any characters proceeds to claim that that base128 is not in use because the encoded characters must be easily retypeable (which base58 is optimized for, FWIW)

Encodings that use non-printable characters are generally not for typing by hand but for transmission. For example base-122 is optimized for storing binary data in JavaScript strings in a UTF-8 html file which probably works best for your use case

Base128 is viable for strings. Two Base128 (7-bits) fit into a 16 bit character. You can set the highest bit to 1 and escape any control characters or string specific characters '/', '"' and '"' in the lo byte, whenever hi byte happen to be zero. Why not use Base256 to use every bit? To escape not wanted characters you can have another string with pairs of positions and how to recode the positions.

Base128 strings is viable for everything (variables, localStorage, even sending because HTTP and modern servers is 8-bit clean) but not files (except latin1).

To store a string as UTF8 file it restricts to maximum Base128 (7 bits), because the 8-th bit is used in UTF8 to know if it is an ascii (bit is 0) or a unicode (bit is 1 to store in more than one byte).

The propositions you have made is generally the same - using another byte:

  • Just dumbly use Unicode

Unicode need another byte!

  • Select 36 non-Unicode characters from within the upper (>128) ASCII range

In UTF8 you can not use the upper ascii without setting bit-8 to 1, that means another byte is used.

  • Kind of use UTF-8 in interesting ways

Here you proposed another byte. To use all four bytes is even worse, because the upper bits is reserved/lost/not useful in encoding.

  • Determine 36 magic bytes that will work for various esoteric reasons

Those magic bytes must be ascii < 32 almost all of them but not at least null, backslash, ampersand, newline and carriage return. You propose "high ASCII characters". That need another byte.

The closest a Base128 is Base-122 that in fact is a Base128 because "the other byte is the next byte". Unfortunately, base-64 seems to press better than base-122, which may be due to the more redundant sequences of bits in base-64 being easier to press. Interestingly, using base-64 with gzip presses it enough to make it smaller than the original ✱. When pressed before encoding, then of course Base-122 would win. A good pression take away redundant information, that makes entropy more random with even distribution of frequency. No encoding have info to gain from.

Another proposal is to avoid setting the 8th bit - so, avoiding handle another byte. Then the nearest we e is BasE91 and Base94.

Another proposal I can think of is the three first characters tell what escaped character should be inserted in which position. After that position another three bytes tell an offset to the next position... and so on. If number is not fit in two chars it can jump as far as possible, that should mean there is another "jump".

First of the three characters is the escape (an illegal string/HTML/UTF8 character, control character, '/', '"', '&', '<', '>', ...). It can also be a code, which tell that the next two characters is a length (encoded number) of a following sequence of escaped characters to insert at once. Or a code to run a length of same character.

发布评论

评论列表(0)

  1. 暂无评论