最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

node.js - JavaScript RegExp matches binary data - Stack Overflow

programmeradmin2浏览0评论

I'm using RegExp to match a series of bytes other than 0x1b from the sequence [0xcb, 0x98, 0x1b] :

var r = /[^\x1b]+/g;
r.exec(new Buffer([0xcb, 0x98, 0x1b]));
console.log(r.lastIndex);

I expected that the pattern would match 0xcb, 0x98 and r.lastIndex == 2, but it matches 0xcb only and r.lastIndex == 1. Is that a bug or something?

I'm using RegExp to match a series of bytes other than 0x1b from the sequence [0xcb, 0x98, 0x1b] :

var r = /[^\x1b]+/g;
r.exec(new Buffer([0xcb, 0x98, 0x1b]));
console.log(r.lastIndex);

I expected that the pattern would match 0xcb, 0x98 and r.lastIndex == 2, but it matches 0xcb only and r.lastIndex == 1. Is that a bug or something?

Share Improve this question asked May 13, 2017 at 0:21 Chris LinChris Lin 231 silver badge5 bronze badges 2
  • What is expected result? [203, 152]? – guest271314 Commented May 13, 2017 at 0:27
  • Yes, [0xcb, 0x98] – Chris Lin Commented May 13, 2017 at 0:31
Add a ment  | 

3 Answers 3

Reset to default 4

regexp.exec() implicitly converts its argument to a string and Buffer's default encoding for toString() is UTF-8. So you will not be able to see the individual bytes any longer with that encoding. Instead, you will need to explicitly use the 'latin1' encoding (e.g. Buffer.from([0xcb, 0x98, 0x1b]).toString('latin1')), which is a single-byte encoding (making the result the equivalent of '\xcb\x98\x1b').

RegExp.prototype.exec works on strings. This means that the Buffer is being implicitly cast to a string by the toString method.

In doing so, the bytes are read as a UTF-8 string, as UTF-8 is the default encoding. From Node's Buffer documentation:

buf.toString([encoding[, start[, end]]])

encoding <string> The character encoding to decode to. Default: 'utf8'

...

0xcb and 0x98 are read as a single UTF-8 character (˘), thus the third byte ends up being at the 1 index, not the 2 index.

One option might be to explicitly call the toString method with a different encoding, but I'm thinking regex is probably not the best option here.

You can use Array.prototype.filter() to return an array containing values not equal to 0x1b

new Buffer([0xcb, 0x98, 0x1b].filter(byte => byte !== 0x1b))
发布评论

评论列表(0)

  1. 暂无评论