Will Javascript's String prototype method toUpperCase()
deliver the naturally expected result in every UTF-8-supported language/charset?
I've tried simplified chinese, south korean, tamil, japanese and cyrillic and the results seemed reasonable so far. Can I rely on the method being language-safe?
Example:
"イロハニホヘトチリヌルヲワカヨタレソツネナラムウヰノオクヤマケフコエテアサキユメミシヱヒモセス".toUpperCase()
> "イロハニホヘトチリヌルヲワカヨタレソツネナラムウヰノオクヤマケフコエテアサキユメミシヱヒモセス"
Edit: As @Quentin pointed out, there also is a String.prototype.toLocaleUpperCase()
which is probably even "safer" to use, but I also have to support IE 8 and above, as well as Webkit-based browsers. Since it is part of ECMAScript 3 Standard, it should be available on all those browsers, right?
Does anyone know of any cases where using it delivers naturally unexpected results?
Will Javascript's String prototype method toUpperCase()
deliver the naturally expected result in every UTF-8-supported language/charset?
I've tried simplified chinese, south korean, tamil, japanese and cyrillic and the results seemed reasonable so far. Can I rely on the method being language-safe?
Example:
"イロハニホヘトチリヌルヲワカヨタレソツネナラムウヰノオクヤマケフコエテアサキユメミシヱヒモセス".toUpperCase()
> "イロハニホヘトチリヌルヲワカヨタレソツネナラムウヰノオクヤマケフコエテアサキユメミシヱヒモセス"
Edit: As @Quentin pointed out, there also is a String.prototype.toLocaleUpperCase()
which is probably even "safer" to use, but I also have to support IE 8 and above, as well as Webkit-based browsers. Since it is part of ECMAScript 3 Standard, it should be available on all those browsers, right?
Does anyone know of any cases where using it delivers naturally unexpected results?
Share Improve this question edited Jun 10, 2015 at 17:27 connexo asked Jun 10, 2015 at 17:06 connexoconnexo 56.8k15 gold badges108 silver badges145 bronze badges 6- 2 "No" is a safe bet here. There are a lot of languages with UTF-8 characters and many of them do not even have the concept of upper or lower case characters. – tadman Commented Jun 10, 2015 at 17:08
- 4 See also developer.mozilla/en-US/docs/Web/JavaScript/Reference/… – Quentin Commented Jun 10, 2015 at 17:12
- Small aside: Please politely inform your Windows XP users that without security updates, they are (98% likely) part of a global botnet that makes network engineers' jobs much harder. – Katana314 Commented Jun 10, 2015 at 17:36
- 1 @Katana314 an aside that is non-related. Why are you going OT? – connexo Commented Jun 10, 2015 at 17:43
-
@connexo Well, because you mentioned that you're supporting IE8 and above. Windows 7, with security updates, will be on IE11, so the most mon reason to support IE8 is Windows XP. I usually won't point out minor things like "Your image should have an
alt
!" but for reasonably large issues, people usually at least provide a short ment on them to make sure they're aware; or perhaps include it as a note in their answer. – Katana314 Commented Jun 10, 2015 at 17:56
2 Answers
Reset to default 15What do you expect?
JavaScript's toUpperCase()
method is supposed to use the "locale invariant upper case mapping" as defined by the Unicode standard. So, basically, "i".toUpperCase()
is supposed to be I
in all cases. In cases where the locale invariant upper case mapping consists of multiple letters, most browsers will not upper case them correctly, for example "ß".toUpperCase()
is often not SS
.
Also, there are locales that have different uppercase rules than the rest of the world, the most notable example being Turkish, where the uppercase version of i
is İ
(and vice versa) and the lowercase version of I
is ı
(and vice versa).
If you want that behaviour, you will need a browser that is set to Turkish locale, and you have to use the toLocaleUpperCase()
method.
Also note that some writing systems have a third case, "title case", which is applied to the first letter of a word when you want to "capitalize" it. This is also defined by the Unicode standard (for example, the Title case of the ligature nj
is Nj
while the upper case is NJ
), but (as far as I know) not available to JavaScript. Therefore if you try to capitalize a word using substring
and toUpperCase
, expect it to be wrong in rare cases.
Yes. From the spec:
[Returns] a String where each character is either the Unicode uppercase equivalent of the corresponding character of [the input] or the actual corresponding character of [the input] if no Unicode uppercase equivalent exists.
For the purposes of this operation, the 16-bit code units of the Strings are treated as code points in the Unicode Basic Multilingual Plane. Surrogate code points are directly transferred from [input to output] without any mapping.
The result must be derived according to the case mappings in the Unicode character database (this explicitly includes not only the UnicodeData.txt file, but also the SpecialCasings.txt file that acpanies it in Unicode 2.1.8 and later).
So while this might not exactly match your languages expectations (as many languages use the same characters but not necessarily in the same way), it does certainly deliver the naturally expected result as specified in the Unicode Character Database.