最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Looking up unicode character set of language in JS - Stack Overflow

programmeradmin2浏览0评论

How can I find information about a Unicode character(e.g. character set it belongs to) in Java script ?

E.g.

00e9  LATIN SMALL LETTER E WITH ACUTE
0bf2  TAMIL NUMBER ONE THOUSAND

I am aware of a way to find details about a Unicode code point in python, using theunicodedata library. Is there a way to find out this information in JS?

PS: I am using this for chrome extension development, so a solution using their APIs is also good.

How can I find information about a Unicode character(e.g. character set it belongs to) in Java script ?

E.g.

00e9  LATIN SMALL LETTER E WITH ACUTE
0bf2  TAMIL NUMBER ONE THOUSAND

I am aware of a way to find details about a Unicode code point in python, using theunicodedata library. Is there a way to find out this information in JS?

PS: I am using this for chrome extension development, so a solution using their APIs is also good.

Share Improve this question edited May 13, 2012 at 22:13 varrunr asked May 9, 2012 at 11:18 varrunrvarrunr 8652 gold badges11 silver badges19 bronze badges 4
  • Can you guarantee that the input is always limited to the chosen language? For example naive is often spelt with a diaresis (ï), and annoying people like me use unicode greek like α. Consider just measuring the character set of the inputs you get. – Phil H Commented May 9, 2012 at 11:22
  • I cannot guarantee that. The idea is to find out if the domain of a url entered in the browser has characters which are outside the languages currently set. E.g. if Chinese and English are set, I need to detect if a character which belongs to neither is part of the url. This can just be restricted to the alphabet of the language in this case. – varrunr Commented May 10, 2012 at 4:17
  • Yes, you’re wrong about English being in the code point range of 0–127. Very very wrong. – tchrist Commented May 13, 2012 at 14:38
  • Thanks for the clarifications. I have fixed the problem description after going through the answers. – varrunr Commented May 13, 2012 at 22:15
Add a ment  | 

3 Answers 3

Reset to default 4

English-language text is dominated by code points from the Latin, Common, and Inherited scripts, and in some corpora, also Greek.

For example, the PubMed Open Access collection, which is a very large collection of all English-language text, is filled with non-ASCII code points. Fully 90% of these are accounted for by only 36 distinct code points, as follows:

rank  percent cumulative  code glyph  GC=??   Name
---------------------------------------------------------------------
   1  18.553%  18.553%  U+02013 ‹–›  GC=Pd    EN DASH
   2   7.422%  25.974%  U+000A0 ‹ ›  GC=Zs    NO-BREAK SPACE
   3   7.033%  33.007%  U+000B1 ‹±›  GC=Sm    PLUS-MINUS SIGN
   4   5.461%  38.469%  U+02212 ‹−›  GC=Sm    MINUS SIGN
   5   4.196%  42.664%  U+02003 ‹ ›  GC=Zs    EM SPACE
   6   3.682%  46.346%  U+003BC ‹μ›  GC=Ll    GREEK SMALL LETTER MU
   7   3.619%  49.965%  U+003B2 ‹β›  GC=Ll    GREEK SMALL LETTER BETA
   8   3.568%  53.534%  U+003B1 ‹α›  GC=Ll    GREEK SMALL LETTER ALPHA
   9   3.426%  56.959%  U+0200A ‹ ›  GC=Zs    HAIR SPACE
  10   3.221%  60.181%  U+000B0 ‹°›  GC=So    DEGREE SIGN
  11   2.931%  63.112%  U+02009 ‹ ›  GC=Zs    THIN SPACE
  12   2.620%  65.732%  U+02019 ‹’›  GC=Pf    RIGHT SINGLE QUOTATION MARK
  13   2.506%  68.238%  U+02032 ‹′›  GC=Po    PRIME
  14   2.441%  70.679%  U+000D7 ‹×›  GC=Sm    MULTIPLICATION SIGN
  15   2.042%  72.722%  U+0201D ‹”›  GC=Pf    RIGHT DOUBLE QUOTATION MARK
  16   2.039%  74.761%  U+0201C ‹“›  GC=Pi    LEFT DOUBLE QUOTATION MARK
  17   1.536%  76.296%  U+00394 ‹Δ›  GC=Lu    GREEK CAPITAL LETTER DELTA
  18   1.415%  77.712%  U+000B5 ‹µ›  GC=Ll    MICRO SIGN
  19   1.337%  79.049%  U+003B3 ‹γ›  GC=Ll    GREEK SMALL LETTER GAMMA
  20   1.210%  80.259%  U+000E9 ‹é›  GC=Ll    LATIN SMALL LETTER E WITH ACUTE
  21   1.152%  81.410%  U+02014 ‹—›  GC=Pd    EM DASH
  22   1.135%  82.546%  U+02018 ‹‘›  GC=Pi    LEFT SINGLE QUOTATION MARK
  23   0.998%  83.543%  U+000A9 ‹©›  GC=So    COPYRIGHT SIGN
  24   0.710%  84.253%  U+02265 ‹≥›  GC=Sm    GREATER-THAN OR EQUAL TO
  25   0.600%  84.853%  U+000F6 ‹ö›  GC=Ll    LATIN SMALL LETTER O WITH DIAERESIS
  26   0.599%  85.452%  U+000B7 ‹·›  GC=Po    MIDDLE DOT
  27   0.597%  86.049%  U+02022 ‹•›  GC=Po    BULLET
  28   0.594%  86.644%  U+0223C ‹∼›  GC=Sm    TILDE OPERATOR
  29   0.573%  87.217%  U+003BA ‹κ›  GC=Ll    GREEK SMALL LETTER KAPPA
  30   0.569%  87.785%  U+000FC ‹ü›  GC=Ll    LATIN SMALL LETTER U WITH DIAERESIS
  31   0.493%  88.278%  U+02264 ‹≤›  GC=Sm    LESS-THAN OR EQUAL TO
  32   0.440%  88.718%  U+000AE ‹®›  GC=So    REGISTERED SIGN
  33   0.433%  89.152%  U+000E4 ‹ä›  GC=Ll    LATIN SMALL LETTER A WITH DIAERESIS
  34   0.422%  89.573%  U+02020 ‹†›  GC=Po    DAGGER
  35   0.407%  89.980%  U+003B4 ‹δ›  GC=Ll    GREEK SMALL LETTER DELTA

One way to detect those would be to use the Unicode regular expression that says a character must either be from the Latin, Greek, Common, or Inherited scripts.

In this corpus, the top four prise will over 99% of the code points. However, there are also a great many super-low-frequency code points in this dataset that fall outside those four scripts (e.g. Cyrillic, Han, Kana, Hangul, etc.). You would throw those out as false negatives if you restricted input to the four ultra-mon scripts previously listed. There are 239 such distinct code points in this dataset, of which the top 50 most frequent are the following:

rank  percent cumulative  code glyph  GC=??   Name
---------------------------------------------------------------------
 295   0.002%  99.828%  U+00424 ‹Ф›  GC=Lu    CYRILLIC CAPITAL LETTER EF
 381   0.001%  99.916%  U+0043A ‹к›  GC=Ll    CYRILLIC SMALL LETTER KA
 454   0.000%  99.949%  U+00413 ‹Г›  GC=Lu    CYRILLIC CAPITAL LETTER GHE
 491   0.000%  99.959%  U+0AD6D ‹국›  GC=Lo    HANGUL SYLLABLE GUG
 499   0.000%  99.961%  U+003EC ‹Ϭ›  GC=Lu    COPTIC CAPITAL LETTER SHIMA
 513   0.000%  99.965%  U+00406 ‹І›  GC=Lu    CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
 528   0.000%  99.968%  U+00416 ‹Ж›  GC=Lu    CYRILLIC CAPITAL LETTER ZHE
 534   0.000%  99.969%  U+00430 ‹а›  GC=Ll    CYRILLIC SMALL LETTER A
 539   0.000%  99.970%  U+0041F ‹П›  GC=Lu    CYRILLIC CAPITAL LETTER PE
 545   0.000%  99.971%  U+00421 ‹С›  GC=Lu    CYRILLIC CAPITAL LETTER ES
 553   0.000%  99.972%  U+0D55C ‹한›  GC=Lo    HANGUL SYLLABLE HAN
 555   0.000%  99.972%  U+00404 ‹Є›  GC=Lu    CYRILLIC CAPITAL LETTER UKRAINIAN IE
 566   0.000%  99.974%  U+0C5B4 ‹어›  GC=Lo    HANGUL SYLLABLE EO
 567   0.000%  99.974%  U+0041A ‹К›  GC=Lu    CYRILLIC CAPITAL LETTER KA
 568   0.000%  99.974%  U+0041B ‹Л›  GC=Lu    CYRILLIC CAPITAL LETTER EL
 571   0.000%  99.975%  U+0B2C8 ‹니›  GC=Lo    HANGUL SYLLABLE NI
 575   0.000%  99.975%  U+0AE4C ‹까›  GC=Lo    HANGUL SYLLABLE GGA
 578   0.000%  99.976%  U+00428 ‹Ш›  GC=Lu    CYRILLIC CAPITAL LETTER SHA
 579   0.000%  99.976%  U+00454 ‹є›  GC=Ll    CYRILLIC SMALL LETTER UKRAINIAN IE
 585   0.000%  99.977%  U+00418 ‹И›  GC=Lu    CYRILLIC CAPITAL LETTER I
 587   0.000%  99.977%  U+0B2E4 ‹다›  GC=Lo    HANGUL SYLLABLE DA
 600   0.000%  99.978%  U+00440 ‹р›  GC=Ll    CYRILLIC SMALL LETTER ER
 610   0.000%  99.980%  U+00457 ‹ї›  GC=Ll    CYRILLIC SMALL LETTER YI
 614   0.000%  99.980%  U+0C74C ‹음›  GC=Lo    HANGUL SYLLABLE EUM
 623   0.000%  99.981%  U+0BD80 ‹부›  GC=Lo    HANGUL SYLLABLE BU
 624   0.000%  99.981%  U+0C545 ‹악›  GC=Lo    HANGUL SYLLABLE AG
 625   0.000%  99.981%  U+0C778 ‹인›  GC=Lo    HANGUL SYLLABLE IN
 640   0.000%  99.982%  U+0C5D0 ‹에›  GC=Lo    HANGUL SYLLABLE E
 641   0.000%  99.983%  U+0C744 ‹을›  GC=Lo    HANGUL SYLLABLE EUL
 645   0.000%  99.983%  U+00438 ‹и›  GC=Ll    CYRILLIC SMALL LETTER I
 664   0.000%  99.984%  U+0041C ‹М›  GC=Lu    CYRILLIC CAPITAL LETTER EM
 665   0.000%  99.984%  U+00436 ‹ж›  GC=Ll    CYRILLIC SMALL LETTER ZHE
 674   0.000%  99.985%  U+0C774 ‹이›  GC=Lo    HANGUL SYLLABLE I
 678   0.000%  99.985%  U+00431 ‹б›  GC=Ll    CYRILLIC SMALL LETTER BE
 679   0.000%  99.986%  U+00435 ‹е›  GC=Ll    CYRILLIC SMALL LETTER IE
 689   0.000%  99.986%  U+0B300 ‹대›  GC=Lo    HANGUL SYLLABLE DAE
 690   0.000%  99.986%  U+0BD84 ‹분›  GC=Lo    HANGUL SYLLABLE BUN
 691   0.000%  99.986%  U+0C678 ‹외›  GC=Lo    HANGUL SYLLABLE OE
 696   0.000%  99.987%  U+005DB ‹כ›  GC=Lo    HEBREW LETTER KAF
 703   0.000%  99.987%  U+0B85C ‹로›  GC=Lo    HANGUL SYLLABLE RO
 711   0.000%  99.988%  U+0041D ‹Н›  GC=Lu    CYRILLIC CAPITAL LETTER EN
 712   0.000%  99.988%  U+004D9 ‹ә›  GC=Ll    CYRILLIC SMALL LETTER SCHWA
 725   0.000%  99.988%  U+0B294 ‹는›  GC=Lo    HANGUL SYLLABLE NEUN
 726   0.000%  99.988%  U+0B9CC ‹만›  GC=Lo    HANGUL SYLLABLE MAN
 727   0.000%  99.988%  U+0C11C ‹서›  GC=Lo    HANGUL SYLLABLE SEO
 728   0.000%  99.989%  U+0C2B5 ‹습›  GC=Lo    HANGUL SYLLABLE SEUB
 729   0.000%  99.989%  U+0C601 ‹영›  GC=Lo    HANGUL SYLLABLE YEONG
 741   0.000%  99.989%  U+00441 ‹с›  GC=Ll    CYRILLIC SMALL LETTER ES
 742   0.000%  99.989%  U+00444 ‹ф›  GC=Ll    CYRILLIC SMALL LETTER EF
 743   0.000%  99.989%  U+004B0 ‹Ұ›  GC=Lu    CYRILLIC CAPITAL LETTER STRAIGHT U WITH STROKE

Of those 239 distinct trans-ASCII code points, 59 of them are also outside Unicode’s Basic Multilingual Plane, so any processing must be able to handle the full range of Unicode. All but one of these are mathematical letters. These are the top 20 of those:

rank  percent cumulative  code glyph  GC=??   Name
---------------------------------------------------------------------
 227   0.004%  99.660%  U+1D49E ‹
发布评论

评论列表(0)

  1. 暂无评论