From what I understand it is very rare for UTF-8 strings to have embedded NULLs, however there is the case that a person can put a NULL into a Unicode string explicitly with "X\0Y" or something like that. Apparently the Unicode standard supports an embedded NULL in this way. However, as far as I can see there is no use of NULLs outside of this, or at least no common use of NULLs in UTF-8 encodings.
So, the question is: if I am allowing users of my software to use any UTF-8 string, am I taking a significant risk processing those strings with functions that assume strings are NULL terminated? I guess what I am asking is that I don't know how often I might encounter an embedded NULL "in the wild".
From what I understand it is very rare for UTF-8 strings to have embedded NULLs, however there is the case that a person can put a NULL into a Unicode string explicitly with "X\0Y" or something like that. Apparently the Unicode standard supports an embedded NULL in this way. However, as far as I can see there is no use of NULLs outside of this, or at least no common use of NULLs in UTF-8 encodings.
So, the question is: if I am allowing users of my software to use any UTF-8 string, am I taking a significant risk processing those strings with functions that assume strings are NULL terminated? I guess what I am asking is that I don't know how often I might encounter an embedded NULL "in the wild".
Share Improve this question edited Feb 17 at 14:29 int main 811 silver badge14 bronze badges asked Feb 16 at 18:23 Tyler DurdenTyler Durden 11.6k10 gold badges73 silver badges134 bronze badges 8 | Show 3 more comments2 Answers
Reset to default 10Definition (from any C standard):
A string is a contiguous sequence of characters terminated by and including the first null character
It follows that there are no embedded null characters in C strings.
Your users cannot put a null character in the middle of a string. It's physically and logically impossible. The string ends where the null character is, by definition.
A user can put a null character in the middle of a character array (which is not a string) or in the middle of a file (which is also not a string). How to deal with those is up to you. The encoding is irrelevant. UTF-8 does not pose any additional challenges in this regard compared to any other encoding.
Some multibyte encodings allow zero bytes in characters that are not the null character. If you use such encoding, you may need to be a little bit extra careful so that not to confuse a zero byte with the null character. UTF-8 is not one of those. A zero byte always represents the null character in UTF-8.
The ASCII NULL character used to terminate a string encodes in UTF8 as the single byte 0x00
, so there should be no problem only if you consider that your UTF8 C strings end when you first find the UNICODE U+0000
codepoint. You can decode the 0x00
byte as part of the string into codepoint U+0000
or not. String functions do require the string to be null terminated, anyway. But if you are going to allow codepoint U+0000
as part of a UNICODE string, then you will need to use another way to distinguish a null byte (0x00
) representing U+0000
from the UTF8 encoding of codepoint U+0000
and so ensure transparency. I've seen software that distinguishes the end of a string in UTF8 by using an invalid UTF8 sequence (like 0xc0 0x80
, which should decode --if used the standard algorithm to decode it into a code point-- to codepoint U+0000
, but is not allowed as a UTF8 encoding of it) In this case you would be able to encode strings with embedded codepoints U+0000
. You can consider that 0xc0 0x80
is a escape you use to indicate the U+0000
and then you will be able to do things like search for the length (in bytes, not in codepoints) of the string with strlen()
or not, if you use it as the final string delimiter.
sizeof
, for example). I've linked to the C11 text, but all the editions from C90 through C23 contain the same text. – Jonathan Leffler Commented Feb 17 at 3:30