最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

utf 8 - Check if the bytes sequence is valid UTF-8 sequence in Javascript - Stack Overflow

programmeradmin6浏览0评论

Is there a simple way to check if string is valid UTF-8 sequence in JavaScript?

I really do not want to end with a regular expression like this:

Regex to detect invalid UTF-8 string

P.S.: I am receiving data from external API and sometimes (very rarely but it happens) it returns data with invalid UTF-8 sequences. Trying to put them into PostgreSQL results in an appropriate error.

Is there a simple way to check if string is valid UTF-8 sequence in JavaScript?

I really do not want to end with a regular expression like this:

Regex to detect invalid UTF-8 string

P.S.: I am receiving data from external API and sometimes (very rarely but it happens) it returns data with invalid UTF-8 sequences. Trying to put them into PostgreSQL results in an appropriate error.

Share Improve this question edited Jul 26, 2021 at 21:26 Peter Mortensen 31.6k22 gold badges110 silver badges133 bronze badges asked Dec 17, 2013 at 16:09 zavgzavg 11.1k4 gold badges47 silver badges68 bronze badges 6
  • 1 I don't think that really makes any sense. A string is a list of characters. UTF-8 is a way of representing characters in a binary format. A string in itself does not have an encoding. – njzk2 Commented Dec 17, 2013 at 16:12
  • unless you are trying to determine if a string can be represented pletely using utf-8 encoding ? – njzk2 Commented Dec 17, 2013 at 16:12
  • the only way to check for a valid UTF8 is to check whether or not it contains invalid utf8 chars. The regex you linked is an effective, concise and efficient way to perform the check. You can, of course, check against your own dictionary, in a custom tuned way. – PA. Commented Dec 17, 2013 at 16:13
  • 1 I don't know of any built-in method so last time I needed this, I used text.match(/[\x80-\xFF]+/) to gather potential problems, and checked each match against the UTF-8 specification -- 52 lines of code. Using that regexp is actually a pretty neat, fast, and simple way. – Jongware Commented Dec 17, 2013 at 16:14
  • 2 or you are trying to figure out if a sequence of bytes can be interpreted as an utf-8 encoded string? – njzk2 Commented Dec 17, 2013 at 16:15
 |  Show 1 more ment

1 Answer 1

Reset to default 5

UTF-8 is in fact a simple encoding, but still what you are asking can't be done with a one-liner. You have to:

  1. Override the Content-Type of the response to have a byte array in your script and prevent the browser/library to interpret the response itself
  2. Looping over the bytes to make characters. Note that UTF-8 is a variable-length encoding, and that's why some sequences are invalid.
  3. If an invalid octet is found, skip it
  4. If needed, deserialize the JSON/XML/whatever string to a JavaScript object, possibly by handing failures

Deciding if a certain array is a valid UTF-8 sequence is quite a straightforward task (just a bunch of if statements and bit shiftings), but again it's not a one line thing.

发布评论

评论列表(0)

  1. 暂无评论