最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

encoding - How to Convert UTF8 ArrayBuffer to UTF16 JavaScript String - Stack Overflow

programmeradmin2浏览0评论

The answers from here got me started on how to use the ArrayBuffer:

Converting between strings and ArrayBuffers

However, they have quite a bit of different approaches. The main one is this:

function ab2str(buf) {
  return String.fromCharCode.apply(null, new Uint16Array(buf));
}

function str2ab(str) {
  var buf = new ArrayBuffer(str.length*2); // 2 bytes for each char
  var bufView = new Uint16Array(buf);
  for (var i=0, strLen=str.length; i<strLen; i++) {
    bufView[i] = str.charCodeAt(i);
  }
  return buf;
}

I wanted to clarify though the difference between UTF8 and UTF16 encoding, because I'm not 100% sure this is correct.

So in JavaScript, in my understanding, all strings are UTF16 encoded. But the raw bytes you might have in your own ArrayBuffer can be in any encoding.

So say that I have provided an ArrayBuffer to the browser from an XMLHttpRequest, and those bytes from the backend are in UTF8 encoding:

var r = new XMLHttpRequest()
r.open('GET', '/x', true)
r.responseType = 'arraybuffer'
r.onload = function(){
  var b = r.response
  if (!b) return
  var v = new Uint8Array(b)
}
r.send(null)

So now we have the ArrayBuffer b from the response r in the Uint8Array view v.

The question is, if I want to convert this into a JavaScript string, what to do.

From my understanding, the raw bytes we have in v are encoded in UTF8 (and were sent to the browser encoded in UTF8). If we were to do this though, I don't think it would work right:

function ab2str(buf) {
  return String.fromCharCode.apply(null, new Uint16Array(buf));
}

From my understanding of the fact that we are in UTF8, and JavaScript strings are in UTF16, you need to do this instead:

function ab2str(buf) {
  return String.fromCharCode.apply(null, new Uint8Array(buf));
}

So using Uint8Array instead of Uint16Array. That is the first question, how to go from utf8 bytes -> js string.

The second question is how now to go back to UTF8 bytes from a JavaScript string. That is, I am not sure this would encode right:

function str2ab(str) {
  var buf = new ArrayBuffer(str.length*2); // 2 bytes for each char
  var bufView = new Uint16Array(buf);
  for (var i=0, strLen=str.length; i<strLen; i++) {
    bufView[i] = str.charCodeAt(i);
  }
  return buf;
}

I am not sure what to change in this one though, to get back to a UTF8 ArrayBuffer. Something like this seems incorrect:

function str2ab(str) {
  var buf = new ArrayBuffer(str.length*2); // 2 bytes for each char
  var bufView = new Uint8Array(buf);
  for (var i=0, strLen=str.length; i<strLen; i++) {
    bufView[i] = str.charCodeAt(i);
  }
  return buf;
}

Anyways, I am just trying to clarify how exactly to go from UTF8 bytes, which are encoding a string from the backend, to a UTF16 JavaScript string on the frontend.

The answers from here got me started on how to use the ArrayBuffer:

Converting between strings and ArrayBuffers

However, they have quite a bit of different approaches. The main one is this:

function ab2str(buf) {
  return String.fromCharCode.apply(null, new Uint16Array(buf));
}

function str2ab(str) {
  var buf = new ArrayBuffer(str.length*2); // 2 bytes for each char
  var bufView = new Uint16Array(buf);
  for (var i=0, strLen=str.length; i<strLen; i++) {
    bufView[i] = str.charCodeAt(i);
  }
  return buf;
}

I wanted to clarify though the difference between UTF8 and UTF16 encoding, because I'm not 100% sure this is correct.

So in JavaScript, in my understanding, all strings are UTF16 encoded. But the raw bytes you might have in your own ArrayBuffer can be in any encoding.

So say that I have provided an ArrayBuffer to the browser from an XMLHttpRequest, and those bytes from the backend are in UTF8 encoding:

var r = new XMLHttpRequest()
r.open('GET', '/x', true)
r.responseType = 'arraybuffer'
r.onload = function(){
  var b = r.response
  if (!b) return
  var v = new Uint8Array(b)
}
r.send(null)

So now we have the ArrayBuffer b from the response r in the Uint8Array view v.

The question is, if I want to convert this into a JavaScript string, what to do.

From my understanding, the raw bytes we have in v are encoded in UTF8 (and were sent to the browser encoded in UTF8). If we were to do this though, I don't think it would work right:

function ab2str(buf) {
  return String.fromCharCode.apply(null, new Uint16Array(buf));
}

From my understanding of the fact that we are in UTF8, and JavaScript strings are in UTF16, you need to do this instead:

function ab2str(buf) {
  return String.fromCharCode.apply(null, new Uint8Array(buf));
}

So using Uint8Array instead of Uint16Array. That is the first question, how to go from utf8 bytes -> js string.

The second question is how now to go back to UTF8 bytes from a JavaScript string. That is, I am not sure this would encode right:

function str2ab(str) {
  var buf = new ArrayBuffer(str.length*2); // 2 bytes for each char
  var bufView = new Uint16Array(buf);
  for (var i=0, strLen=str.length; i<strLen; i++) {
    bufView[i] = str.charCodeAt(i);
  }
  return buf;
}

I am not sure what to change in this one though, to get back to a UTF8 ArrayBuffer. Something like this seems incorrect:

function str2ab(str) {
  var buf = new ArrayBuffer(str.length*2); // 2 bytes for each char
  var bufView = new Uint8Array(buf);
  for (var i=0, strLen=str.length; i<strLen; i++) {
    bufView[i] = str.charCodeAt(i);
  }
  return buf;
}

Anyways, I am just trying to clarify how exactly to go from UTF8 bytes, which are encoding a string from the backend, to a UTF16 JavaScript string on the frontend.

Share Improve this question asked Jul 24, 2018 at 21:12 Lance PollardLance Pollard 79.6k98 gold badges333 silver badges610 bronze badges 1
  • "String.fromCharCode.apply(null, new Uint8Array(buf))" - no, that only works for ASCII strings. You'll need a proper TextDecoder (and a TextEncoder for reversal). – Bergi Commented Feb 4, 2023 at 20:31
Add a ment  | 

2 Answers 2

Reset to default 3

Why not use the TextDecoder interface instead of rolling your own? Are you limited to a browser that doesn't support it?

const decoder = new TextDecoder('UTF-8')
const dataStr = decoder.decode(dataBuf) 

We need some assumptions to understand what happened:

1. JS uses UTF-16

First of all, js uses UTF-16 for storing symbols as it mention here in section unicode strings: https://developer.mozilla/en-US/docs/Web/API/btoa

2. UTF-16 and UTF-8

UTF-8 and UTF-16 doesn't mean that one symbol represents by one byte or two bytes. UTF-8 such as utf-16 is a variable-length encoding.

3. ArrayBuffer and encodings

"hello" by one byte (Uint8Array): [104, 101, 108, 108, 111]
the same by two bytes (Uint16Array): [0, 104, 0, 101, 0, 108, 0, 108, 0, 111]

There is no encoding in ArrayBuffer because ArrayBuffer represent numbers.

Iteration over second array will be different from iteration over the first array. You know that the two-byte number cannot be pack onto one-byte number.


When you receive response from server in utf-8 - you receive it as sequence of bytes and if data you receive are stored by one-byte per symbol your code will work fine - it works with symbols like [a-zA-Z0-9] and mon punctuation symbols. But if you receive a symbol that stores with two bytes in UTF-8 the transcription onto UTF-16 will be incorrect:

0xC3 0xA5 (one symbol å) -> 0x00 0xC3 0x00 0xA5 (two symbols "Ã¥")

So if you will not transfer symbols outside the range of latin symbols, digits and punctuation you can use your code and it will work fine even though it is not correct.

发布评论

评论列表(0)

  1. 暂无评论