I have a strange problem that I can't explain. I'm trying to manipulate a string with an accent as "é". This string comes from the name of an image from an input file type.
What I can not understand is why my string when I parse with for the accented character is split into two character. Here is an example to better understand:
My é
is divided into two character like this e
& ́
.
"é".length
=> 2
It's possible that utf8 is involved ?
I really don't understand anything at all !
I have a strange problem that I can't explain. I'm trying to manipulate a string with an accent as "é". This string comes from the name of an image from an input file type.
What I can not understand is why my string when I parse with for the accented character is split into two character. Here is an example to better understand:
My é
is divided into two character like this e
& ́
.
"é".length
=> 2
It's possible that utf8 is involved ?
I really don't understand anything at all !
Share Improve this question asked Sep 2, 2013 at 17:24 hypeehypee 7186 silver badges20 bronze badges 3- 6 Which browser are you using? It returns 1 in my chrome. – MD Sayem Ahmed Commented Sep 2, 2013 at 17:26
- Im some rar cases it is possible to write this letter with two charactors. I read this in the context of LaTeX. – rekire Commented Sep 2, 2013 at 17:27
- 2 Your character also returns 1 on Firefox. – Kevin Ji Commented Sep 2, 2013 at 17:29
2 Answers
Reset to default 12They are called Combining Diacritical Marks. They are a "piece" of Unicode... Some combinable diacritics that can be "chained" on any character. Clearly the length of the string in that case is 2 (because there is the e
and the '
. The precomposed characters like àéèìòù
have been left for compatibility, but now any character can be accented :-) Clearly 99% of the programmers don't know it, and 99.9% of the programs support it very badly. I'm quite sure they could be used as an attack vector somewhere (but I'm not paranoid :-) )
I'll even add that even Skeet in 2009 wasn't sure on how they worked: http://codeblog.jonskeet.uk/2009/11/02/omg-ponies-aka-humanity-epic-fail/
You see, I couldn't remember whether combining characters came before or after base characters
:-) :-)
Instead of UTF-8, it's more likely combining diacritical marks involved.
>>> "e\u0301"
"é"
>>> "e\u0301".length
2
Javascript's strings are usually encoded as UTF-16, so it could contain the whole single "é" (U+00e9) in 1 code unit.
But characters outside of the BMP (those with code point beyond U+FFFF) will return 2, as they are encoded into 2 UTF-16 code units.
>>> "