I know I can write a non-ASCII character literal using a Unicode escape sequence like:
wchar_t myChar = L'\u00C6';
But, is there any guarantee that the resulting numerical value of myChar
is actually hexadecimal C6
? Or, does the C language specification leave this as an implementation-defined detail?
Section 6.10.8 of this (apparent?) draft spec seems to imply that such a guarantee exists only if the optional __STDC_ISO_10646__
macro is defined (I guess either explicitly or as a compiler default). But, I'm not 100% sure of my understanding, or of how official that doc is (the truly official spec seems hidden behind a paywall). So, I'm wondering whether anyone knows for sure.
Update:
To clarify, this question has nothing to do with the issue of Unicode characters that don't fit in 16 bits. It has to do with the relationship between a character's "short identifier" (the hexadecimal code shown on unicode charts and used in the escape code) versus the corresponding numerical value of the wchar_t
variable. That is, whether this code:
wchar_t myChar = L'\u00C6';
printf("%04X", myChar);
could result in output such as:
007B
The value of 007B
is arbitrary - the point is just it being something other than 00C6
. I’m not aware of anything in the language specification that requires the numerical value of the wchar_t
to equal the "short identifier" (as a concrete hypothetical example, imagine a C language implementation which maps each character to a wchar_t
whose numerical value is the 2's complement of the "short identifier").
I know I can write a non-ASCII character literal using a Unicode escape sequence like:
wchar_t myChar = L'\u00C6';
But, is there any guarantee that the resulting numerical value of myChar
is actually hexadecimal C6
? Or, does the C language specification leave this as an implementation-defined detail?
Section 6.10.8 of this (apparent?) draft spec seems to imply that such a guarantee exists only if the optional __STDC_ISO_10646__
macro is defined (I guess either explicitly or as a compiler default). But, I'm not 100% sure of my understanding, or of how official that doc is (the truly official spec seems hidden behind a paywall). So, I'm wondering whether anyone knows for sure.
Update:
To clarify, this question has nothing to do with the issue of Unicode characters that don't fit in 16 bits. It has to do with the relationship between a character's "short identifier" (the hexadecimal code shown on unicode. charts and used in the escape code) versus the corresponding numerical value of the wchar_t
variable. That is, whether this code:
wchar_t myChar = L'\u00C6';
printf("%04X", myChar);
could result in output such as:
007B
The value of 007B
is arbitrary - the point is just it being something other than 00C6
. I’m not aware of anything in the language specification that requires the numerical value of the wchar_t
to equal the "short identifier" (as a concrete hypothetical example, imagine a C language implementation which maps each character to a wchar_t
whose numerical value is the 2's complement of the "short identifier").
3 Answers
Reset to default 5Regarding the __STDC_ISO_10646__
macro, I think your reading of the standard is correct. Quoting the N1570 draft of the C11 standard:
__STDC_ISO_10646__
An integer constant of the formyyyymmL
(for example,199712L
). If this symbol is defined, then every character in the Unicode required set, when stored in an object of typewchar_t
, has the same value as the short identifier of that character. The Unicode required set consists of all the characters that are defined by ISO/IEC 10646, along with all amendments and technical corrigenda, as of the specified year and month. If some other encoding is used, the macro shall not be defined and the actual encoding used is implementation-defined.
If the macro is defined, the integer value of the wchar_t
object will equal the hex value of the character's short identifier. Note that this doesn't apply to random hex strings, only the "required set".
Not at all. wchar_t myChar = L'\u00C6';
means it holds the character Æ
, not the numerical value 0xC6
.
The relation between characters and numerical values is called the character encoding, which is implementation-defined. Your question therefore is whether the wchar_t
character encoding is a Unicode encoding such as UTF-16 or UTF-32. That's not at all guaranteed by the C Standard.
You can write wchar_t myChar = L'\xC6';
. That means myChar
has the numerical value 0xC6
.
The short answer is NO. Unicode encodings are not addressed by neither the Ansi-C standard, nor the Unicode Consortium. But if you continue reading you will know why.
Unicode talks about codepoints, which are 21bit unsigned numbers in the range 0...0x10ffff
(independently of how they are represented) and also about the different encodings used to represent those (some, as utf8
do require upto 4 bytes to encode a single codepoint, while other codepoints don't require that; some encodings as utf16
do require two sixteen bits words to be encoded when the codepoint is in the extended plane ---requiring two full 16bit values to serialize a codepoint bigger than 0xffff
, what is called a pair surrogate, in which each word in the pair holds up to 10 bits of the codepoint value---, and some ---utf32
--- do require 4 bytes to encode each codepoint)
On the other side, the C standard leaves to the implementation all subjects related to character sets and encoding, so you must adhere to how your implementation handles the different locales, character sets and how it encodes them. This is normally handled by the locale. wchar_t
is an unspecified type whose size depends on the implementation. As @ikegami explains in one comment, wchar_t
is not a 16bit type. It should be implemented in, e.g. 8bits, or it can be as big as 32bits. What I pretend to say, is that the Standard doesn't say anything about the encodings, which are left to the implementation. @ikegami also says in his comment that it's commonly 32-bits. Well, neither the Ansi-C standard recommends it to be 32bits, nor the Unicode consortium enforces C to complish with this. So the answer should be no answer can be given to this question. Anyway I'll try to do my best, explaining upto a point I have been capable of using the different encodings proposed by Unicode, to be used in C, without trouble.
All codepoints from unicode can be represented in utf8
, utf16
, and utf32
encodings. You only need to understand how this happens. I normally use utf8
, because it permits me to use them in contexts that are not language dependent (like when you use collation sequences, and need the full codepoint to be available) wchar_t
is not a widely extended used type, because Ansi-C didn't set a standard size for it, so you have to cope with what is available.
utf8
requires a codepoint to be represented by a sequence of 8 bit numbers requiring 1, 2, 3 or even 4 characters in the sequence. ASCII characters coincide in position with unicode codepointsU+0000
toU+007f
for every character in the ASCII set. Iso-latin-1 group of characters normally coincide with codepointsU+0080
toU+00ff
(these require two utf8 bytes to be represented) and all codepoints betweenU+0080
andU+7ff
do require two 8bit bytes to be represented. FromU+800
toU+ffff
all can be represented in three bytes, and only characters in the extended plane (fromU+10000
toU+10ffff
) require 4 bytes to be represented.utf16
requires a codepoint to be represented as a sequence of 16bit numbers, requiring 1 or 2 numbers to represent the full set of codepoints. Unicode reserves codepointsU+d800
toU+dfff
for the surrogate pairs, and it encodes only extended plane characters with a pair of surrogates, so all characters in the rangeU+010000
toU+10ffff
are encoded in this way.utf32
uses a full 32bit number to contain the codepoint, the correspondence is one to one (only 20bits are used, plus 2^20 + 65536 more values are only used)
Unicode uses 21 bit codepoints, so if you want to handle all language dependent features, you will have two options:
Use internally full codepoint support, this meaning that you need to use codepoints (the 21 bit values of chars)
Use a mixed approach (e.g. Java uses internally a variant of utf16, as unicode doesn't consider string bounding aspects) using 16bit surrogates for codepoints in the extended plane. You can also use (as I do sometimes) utf8, which requires you to consider strings bounding using some artificial convention so you don't clash with unicode conventions.
It seems that, in spite of the frequency of appearance of codepoints in the extended plane, the second approach (e.g. using utf16 encoding) allows the most effective approach. As wchar_t
is a 16 bit type, it seems appropiate to use Unicode. But you need still to check how the wchar_t
library routines use to delimit strings, and adapt to it (most probably you have it already done for you in the standard library, but you need to check that with your stdlib documentation.
Conclussion
Neither Ansi C enters on how characters are encoded (only on how strings are delimited) or what characters should be represented in a data type. Nor unicode attempts on how to use the characters in a computer (only on what do they represent and how they can be encoded to be used in a data system) Both standards let this subject to the implementations. You are on your own, to use them as you want, but for interoperability, the existent implementations do procure to address the problem in some flexible way (but you have to read the appropiate documentation on this) Unicode tries to anize the universal character set into groups of frequency of use. Leaving the rarest characters in the positions requiring more info to be encoded, which IMHO is a very intelligent way of thinking. But if you want to use utf8 or utf16 encoding (using 8 bit types or 6 bit types) then you will need to deal with multi-word codepoints, and only utf32 will be possible to be used (and use 32bit types for every single character) This is the only way you can handle each character in one information unit. As ikegami says in his comment, wchar_t
's size is unspecified in the standard, and so, can fit, or not, a unicode codepoint.
Note: normally the problem of the subject is addressed by the locale dependent routines, look at how they implement the different locales and select your locale appropiately. You will see the locale used in your environment. You can set it by setting environment variables at runtime, and consult it with the locale(1)
command. For example:
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=en_US.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=en_US.UTF-8
LC_NAME=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
LC_ALL=
$ _
This means I'm using UTF8
encoding. AFAIK gcc/posix implements wchar_t
as a 32 bit quantity and provides full support for utf-8
encoding, but I have not tested other encodings. (utf8 encoding could be handled with normal char
routines, but I have never tried if it has support in the standard library to use unicode codepoints encoded as utf8 strings) probably collation order on the different locales will not be properly handled. Read the documentation of your implementation.
wchar_t
is only 16 bits wide and uses UTF-16LE encoding. – Ian Abbott Commented Apr 1 at 11:15wchar_t
: "In practice, it is 32 bits and holds UTF-32 on Linux and many other non-Windows systems, but 16 bits and holds UTF-16 code units on Windows. " – Ted Lyngmo Commented Apr 1 at 11:33wchar_t
is 4 bytes. But even on Windows, my question is only related to Unicode characters for which the syntaxwchar_t myChar = L’\u00C6’
actually works (characters that actually fit in 16 bits). This question is not at all related to the issue of characters that won’t fit into a singlewchar_t
. The question has to do with whether the numerical value of thewchar_t
variable, expressed in hexadecimal, is guaranteed to equal the hexadecimal value of the code points in the string literal. – NikS Commented Apr 1 at 11:34u
. A UTF-32 character constant is prefixed by the letterU
. – Ted Lyngmo Commented Apr 1 at 11:37wchar_t
isn't specifically for Unicode (any flavour). IIRC, it pre-dates UTF-16, having been introduced for the existing wide character encodings of the time. The only character types with a connection to Unicode arechar8_t
,char16_t
andchar32_t
. – Toby Speight Commented yesterday