utf 8 - Check if the bytes sequence is valid UTF-8 sequence in Javascript

Is there a simple way to check if string is valid UTF-8 sequence in JavaScript?

I really do not want to end with a regular expression like this:

Regex to detect invalid UTF-8 string

P.S.: I am receiving data from external API and sometimes (very rarely but it happens) it returns data with invalid UTF-8 sequences. Trying to put them into PostgreSQL results in an appropriate error.

Is there a simple way to check if string is valid UTF-8 sequence in JavaScript?

I really do not want to end with a regular expression like this:

Regex to detect invalid UTF-8 string

Share Improve this question edited Jul 26, 2021 at 21:26 Peter Mortensen 31.6k22 gold badges110 silver badges133 bronze badges asked Dec 17, 2013 at 16:09 zavg 11.1k4 gold badges47 silver badges68 bronze badges

1 I don't think that really makes any sense. A string is a list of characters. UTF-8 is a way of representing characters in a binary format. A string in itself does not have an encoding. – njzk2 Commented Dec 17, 2013 at 16:12
unless you are trying to determine if a string can be represented pletely using utf-8 encoding ? – njzk2 Commented Dec 17, 2013 at 16:12
the only way to check for a valid UTF8 is to check whether or not it contains invalid utf8 chars. The regex you linked is an effective, concise and efficient way to perform the check. You can, of course, check against your own dictionary, in a custom tuned way. – PA. Commented Dec 17, 2013 at 16:13
1 I don't know of any built-in method so last time I needed this, I used text.match(/[\x80-\xFF]+/) to gather potential problems, and checked each match against the UTF-8 specification -- 52 lines of code. Using that regexp is actually a pretty neat, fast, and simple way. – Jongware Commented Dec 17, 2013 at 16:14
2 or you are trying to figure out if a sequence of bytes can be interpreted as an utf-8 encoded string? – njzk2 Commented Dec 17, 2013 at 16:15

| Show 1 more ment

1 Answer 1

Sorted by: Reset to default 5

UTF-8 is in fact a simple encoding, but still what you are asking can't be done with a one-liner. You have to:

Override the Content-Type of the response to have a byte array in your script and prevent the browser/library to interpret the response itself
Looping over the bytes to make characters. Note that UTF-8 is a variable-length encoding, and that's why some sequences are invalid.
If an invalid octet is found, skip it
If needed, deserialize the JSON/XML/whatever string to a JavaScript object, possibly by handing failures

Deciding if a certain array is a valid UTF-8 sequence is quite a straightforward task (just a bunch of if statements and bit shiftings), but again it's not a one line thing.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

utf 8 - Check if the bytes sequence is valid UTF-8 sequence in Javascript - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)