My application was relying on this function to test if a string is Korean or not :
const isKoreanWord = (input) => {
const match = input.match(/[\u3131-\uD79D]/g);
return match ? match.length === input.length : false;
}
isKoreanWord('만두'); // true
isKoreanWord('mandu'); // false
until I started to include Chinese support and now this function is incoherent :
isKoreanWord('幹嘛'); // true
I believe this is caused by the fact that Korean characters and Chinese ones are intermingled into the same Unicode range.
How should I correct this function to make it returns true
if the input contains only Korean characters ?
My application was relying on this function to test if a string is Korean or not :
const isKoreanWord = (input) => {
const match = input.match(/[\u3131-\uD79D]/g);
return match ? match.length === input.length : false;
}
isKoreanWord('만두'); // true
isKoreanWord('mandu'); // false
until I started to include Chinese support and now this function is incoherent :
isKoreanWord('幹嘛'); // true
I believe this is caused by the fact that Korean characters and Chinese ones are intermingled into the same Unicode range.
How should I correct this function to make it returns true
if the input contains only Korean characters ?
- 1 By "Korean characters" you mean hangul? 'Cause Chinese characters are also used in Korea. Asking to distinguish "Chinese Chinese characters" from "Korean Chinese characters" is like asking to distinguish English from French. – deceze ♦ Commented Oct 25, 2018 at 12:33
- @deceze Yes I meant hangul. How to distinguish between hangul and hanja. – vdegenne Commented Oct 25, 2018 at 12:34
- @deceze Also I don't think your comparison is true in that English and French derive from Latin so yes it is extremely hard to compare both language, while Korean is using Chinese as its base language and Chinese, well... is using Chinese as its own historical base language. – vdegenne Commented Oct 25, 2018 at 12:40
- 1 I'm talking purely about the writing system used. If you just look at the range of letters, English is indistinguishable from French. In the same way, seeing just a few Chinese characters it's virtually impossible to tell whether it's a Chinese word or a word used in the context of Korean. – deceze ♦ Commented Oct 25, 2018 at 12:43
- 2 "Korean characters" means hangul, there's no exception. – wonsuc Commented Mar 26, 2019 at 6:59
3 Answers
Reset to default 16Here is the unicode range you need for Hangul (Taken from their wikipedia page).
U+AC00–U+D7AF
U+1100–U+11FF
U+3130–U+318F
U+A960–U+A97F
U+D7B0–U+D7FF
So your regex .match
should look like this:
const match = input.match(/[\uac00-\ud7af]|[\u1100-\u11ff]|[\u3130-\u318f]|[\ua960-\ua97f]|[\ud7b0-\ud7ff]/g);
a shorter version that matches korean characters
const regexKorean = /[\u1100-\u11FF\u3130-\u318F\uA960-\uA97F\uAC00-\uD7AF\uD7B0-\uD7FF]/g
In modern browsers, you can use unicode character classes directly:
const RE = /\p{sc=Hangul}/u
console.log(RE.test('만두')) // true
console.log(RE.test('mandu')) // false
console.log(RE.test('幹嘛')) // false