最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Encoding MessagePack objects containing Node.js Buffers - Stack Overflow

programmeradmin0浏览0评论

I'm using node-msgpack to encode and decode messages passed around between machines. One thing I'd like to be able to do is wrap raw Buffer data in an object and encode that with Messagepack.

msgpack = require('msgpack')
buf = <Buffer 89 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 00 00 ...>
obj = {foo: buf}
packed = msgpack.pack(obj)

In the example above, I wanted to do a consistency check on the raw bytes of buffers nested in an object. So buf was obtained like so :

var buf = fs.readFileSync('some_image.png');

In a perfect world, I would have obtained :

new Buffer(msgpack.unpack(packed).foo);

#> <Buffer 89 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 00 00 ...>

Instead, I end up with some random number. Digging up a little deeper, I end up with the following curiosity:

enc = 'ascii'
new Buffer(buf.toString(enc), enc)
#> <Buffer *ef bf bd* 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 00 00 ...>

buf
#> <Buffer *89* 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 00 00 02 00 ...>

The first byte is the problem. I tried using different encodings with no luck. What is happening here and what can I do to do circumvent the issue?

EDIT:

Originally, the buf was a buffer I had generated with msgpack itself, thus double-packing data. To avoid any confusion, I replaced that with another buffer obtained by reading an image, which raised the same problem.

I'm using node-msgpack to encode and decode messages passed around between machines. One thing I'd like to be able to do is wrap raw Buffer data in an object and encode that with Messagepack.

msgpack = require('msgpack')
buf = <Buffer 89 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 00 00 ...>
obj = {foo: buf}
packed = msgpack.pack(obj)

In the example above, I wanted to do a consistency check on the raw bytes of buffers nested in an object. So buf was obtained like so :

var buf = fs.readFileSync('some_image.png');

In a perfect world, I would have obtained :

new Buffer(msgpack.unpack(packed).foo);

#> <Buffer 89 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 00 00 ...>

Instead, I end up with some random number. Digging up a little deeper, I end up with the following curiosity:

enc = 'ascii'
new Buffer(buf.toString(enc), enc)
#> <Buffer *ef bf bd* 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 00 00 ...>

buf
#> <Buffer *89* 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 00 00 02 00 ...>

The first byte is the problem. I tried using different encodings with no luck. What is happening here and what can I do to do circumvent the issue?

EDIT:

Originally, the buf was a buffer I had generated with msgpack itself, thus double-packing data. To avoid any confusion, I replaced that with another buffer obtained by reading an image, which raised the same problem.

Share Improve this question edited Jan 3, 2013 at 5:42 matehat asked Dec 18, 2012 at 1:20 matehatmatehat 5,3743 gold badges31 silver badges42 bronze badges 4
  • Where is the "ac" is ing from in your original code? > new Buffer("Hello World!") gives me "<Buffer 48 65 6c 6c 6f 20 57 6f 72 6c 64 21>" (without the "ac") – Hector Correa Commented Dec 18, 2012 at 2:13
  • Quoting the above, buf = msgpack.pack("Hello World!") :) It's a prefix msgpack puts there to know that the following bytes are raw bytes and to encode its length. That's why I'm expecting msgpack.unpack(new Buffer(msgpack.unpack(packed).foo)); to return "Hello World!". – matehat Commented Dec 18, 2012 at 2:17
  • It looks like you are double packing. (1) when you do buf = msgpack.pack("Hello World!") and (2) when you do packed = msgpack.pack(obj), right? – Hector Correa Commented Dec 18, 2012 at 2:22
  • Well I'm only double packing because I want to check the general case when a raw Buffer is packed using msgpack. If the buffer is still consistent after packing and unpacking, it should work, right? Whether that buf is obtained from pressed data or a JPG file shouldn't matter I guess. – matehat Commented Dec 18, 2012 at 2:29
Add a ment  | 

1 Answer 1

Reset to default 5

The buffer corruption problem occurs when binary data is decoded using any encoding text except base64 and hex. which don't seem to be picked up by node-msgpack. It seems to automatically try to use 'utf-8', which irreversibly screws up the buffer. They had to do something like that so we don't end up with a bunch of buffer objects instead of ordinary strings, which is mostly what of our msgpack objects are usually made of.


EDIT:

The three bytes that were shown above to be problematic represent the UTF-8 Replacement Character. A quick test shows that this character was to replace the unrecognizable 0x89 byte at the start :

new Buffer((new Buffer('89', 'hex')).toString('utf-8'), 'utf-8')
//> <Buffer ef bf bd>

This line of C++ code from node-msgpack is responsible for this behavior. When intercepting a Buffer instance in a data structure given to the encoder, it just bindly converts it to a String, equivalent to executing buffer.toString() which by default assumes UTF-8 encoding, replacing every unrecognizable characters with the above.

The alternative module suggested below works around this by leaving the buffer as raw bytes, not trying to convert it to a string, but by doing so is inpatible with other MessagePack implementation. If patibility is an concern, a work around this would be to encode non-UTF-8 buffers ahead of time with a binary-safe encoding like binary, base64 or hex. base64 or hex will inevitably grow the size of the data by a significant amount, but will leave it consistent and are safest to use when transporting data across HTTP. If size is a concern as well, piping the MessagePack result through a streaming pression algorithm like Snappy can be a good option.


Turns out another module, msgpack-js (which is a msgpack encoder/decoder all written in javascript), leaves raw binary data as such, hence solving the above problem. Here's how he did it:

I've extended the format a little to allow for encoding and decoding of undefined and Buffer instances.

This required three new type codes that were previously marked as "reserved". This change means that using these new types will render your serialized data inpatible with other messagepack implementations that don't have the same extension.

As a bonus, it's also more performant than the C++ extension-based module mentionned earlier. It's also much younger, so maybe not as thoroughly tested. Time will tell. Here is the result of a quick benchmark I did, based off the one that was included in node-msgpack, paring the two libraries (as well as native JSON parser) :

node-msgpack pack:   3793 ms
node-msgpack unpack: 1340 ms

msgpack-js pack:   3132 ms
msgpack-js unpack: 983 ms

json pack:   1223 ms
json unpack: 483 ms

So while we see a performance improvement with the native javascript msgpack decoder, JSON is still way more performant.

发布评论

评论列表(0)

  1. 暂无评论