c - If I am using UTF-8 strings is it risky to use standard string handling that assumes null termination?

From what I understand it is very rare for UTF-8 strings to have embedded NULLs, however there is the case that a person can put a NULL into a Unicode string explicitly with "X\0Y" or something like that. Apparently the Unicode standard supports an embedded NULL in this way. However, as far as I can see there is no use of NULLs outside of this, or at least no common use of NULLs in UTF-8 encodings.

So, the question is: if I am allowing users of my software to use any UTF-8 string, am I taking a significant risk processing those strings with functions that assume strings are NULL terminated? I guess what I am asking is that I don't know how often I might encounter an embedded NULL "in the wild".

Share Improve this question edited Feb 17 at 14:29 int main 811 silver badge14 bronze badges asked Feb 16 at 18:23 Tyler Durden 11.6k10 gold badges73 silver badges134 bronze badges

1 Duplicate? Can UTF-8 contain zero byte? – Weather Vane Commented Feb 16 at 18:33
@WeatherVane Yes, I know that. In fact, in my question I said I know that it can contain a NULL byte. That is not my question. – Tyler Durden Commented Feb 16 at 18:36
1 You can put a null byte into any string, and your question says "rare" and "no common use of NULLs in UTF-8 encodings.", whereas the linked question says "none". – Weather Vane Commented Feb 16 at 18:39
4 There is absolutely nothing special whatsoever about UTF-8 compared to any other encoding, so you could restate your question by removing all mentions of UTF-8 from it. – n. m. could be an AI Commented Feb 16 at 18:42
2 Note that the C standard §7.1.1 Definitions of terms ¶1 defines: A string is a contiguous sequence of characters terminated by and including the first null character. No string includes anything beyond the first null byte. "String literals" can contain null bytes, but the library's string functions can't read beyond the first null byte (though the memory functions can if the code knows that there is extra data there — via sizeof, for example). I've linked to the C11 text, but all the editions from C90 through C23 contain the same text. – Jonathan Leffler Commented Feb 17 at 3:30

| Show 3 more comments

2 Answers 2

Sorted by: Reset to default 10

Definition (from any C standard):

A string is a contiguous sequence of characters terminated by and including the first null character

It follows that there are no embedded null characters in C strings.

Your users cannot put a null character in the middle of a string. It's physically and logically impossible. The string ends where the null character is, by definition.

A user can put a null character in the middle of a character array (which is not a string) or in the middle of a file (which is also not a string). How to deal with those is up to you. The encoding is irrelevant. UTF-8 does not pose any additional challenges in this regard compared to any other encoding.

Some multibyte encodings allow zero bytes in characters that are not the null character. If you use such encoding, you may need to be a little bit extra careful so that not to confuse a zero byte with the null character. UTF-8 is not one of those. A zero byte always represents the null character in UTF-8.

The ASCII NULL character used to terminate a string encodes in UTF8 as the single byte 0x00, so there should be no problem only if you consider that your UTF8 C strings end when you first find the UNICODE U+0000 codepoint. You can decode the 0x00 byte as part of the string into codepoint U+0000 or not. String functions do require the string to be null terminated, anyway. But if you are going to allow codepoint U+0000 as part of a UNICODE string, then you will need to use another way to distinguish a null byte (0x00) representing U+0000 from the UTF8 encoding of codepoint U+0000 and so ensure transparency. I've seen software that distinguishes the end of a string in UTF8 by using an invalid UTF8 sequence (like 0xc0 0x80, which should decode --if used the standard algorithm to decode it into a code point-- to codepoint U+0000, but is not allowed as a UTF8 encoding of it) In this case you would be able to encode strings with embedded codepoints U+0000. You can consider that 0xc0 0x80 is a escape you use to indicate the U+0000 and then you will be able to do things like search for the length (in bytes, not in codepoints) of the string with strlen() or not, if you use it as the final string delimiter.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

c - If I am using UTF-8 strings is it risky to use standard string handling that assumes null termination? - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)