最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Strange length of accent as "é" string return 2 - Stack Overflow

programmeradmin10浏览0评论

I have a strange problem that I can't explain. I'm trying to manipulate a string with an accent as "é". This string comes from the name of an image from an input file type.

What I can not understand is why my string when I parse with for the accented character is split into two character. Here is an example to better understand:

My é is divided into two character like this e & ́.

"é".length
=> 2

It's possible that utf8 is involved ?

I really don't understand anything at all !

I have a strange problem that I can't explain. I'm trying to manipulate a string with an accent as "é". This string comes from the name of an image from an input file type.

What I can not understand is why my string when I parse with for the accented character is split into two character. Here is an example to better understand:

My é is divided into two character like this e & ́.

"é".length
=> 2

It's possible that utf8 is involved ?

I really don't understand anything at all !

Share Improve this question asked Sep 2, 2013 at 17:24 hypeehypee 7186 silver badges20 bronze badges 3
  • 6 Which browser are you using? It returns 1 in my chrome. – MD Sayem Ahmed Commented Sep 2, 2013 at 17:26
  • Im some rar cases it is possible to write this letter with two charactors. I read this in the context of LaTeX. – rekire Commented Sep 2, 2013 at 17:27
  • 2 Your character also returns 1 on Firefox. – Kevin Ji Commented Sep 2, 2013 at 17:29
Add a comment  | 

2 Answers 2

Reset to default 12

They are called Combining Diacritical Marks. They are a "piece" of Unicode... Some combinable diacritics that can be "chained" on any character. Clearly the length of the string in that case is 2 (because there is the e and the '. The precomposed characters like àéèìòù have been left for compatibility, but now any character can be accented :-) Clearly 99% of the programmers don't know it, and 99.9% of the programs support it very badly. I'm quite sure they could be used as an attack vector somewhere (but I'm not paranoid :-) )

I'll even add that even Skeet in 2009 wasn't sure on how they worked: http://codeblog.jonskeet.uk/2009/11/02/omg-ponies-aka-humanity-epic-fail/

You see, I couldn't remember whether combining characters came before or after base characters

:-) :-)

Instead of UTF-8, it's more likely combining diacritical marks involved.

>>> "e\u0301"
"é"
>>> "e\u0301".length
2

Javascript's strings are usually encoded as UTF-16, so it could contain the whole single "é" (U+00e9) in 1 code unit.


But characters outside of the BMP (those with code point beyond U+FFFF) will return 2, as they are encoded into 2 UTF-16 code units.

>>> "
发布评论

评论列表(0)

  1. 暂无评论