最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - regex words matching for Chinese and Japanese character - Stack Overflow

programmeradmin2浏览0评论

I know the pattern to detect if it's a string is chinese character but that's not what I need. I need to check if the characters is found in a string.

const words_found = (words, values) => 
 words.some(word => 
   values.match(new RegExp(word + '\\b', 'i'))
)

words_found(['james'], 'my name is james') // true

but failed for chinese character

words_found(['一个'], '你说到这是一个测试') // false

I know the pattern to detect if it's a string is chinese character but that's not what I need. I need to check if the characters is found in a string.

const words_found = (words, values) => 
 words.some(word => 
   values.match(new RegExp(word + '\\b', 'i'))
)

words_found(['james'], 'my name is james') // true

but failed for chinese character

words_found(['一个'], '你说到这是一个测试') // false
Share Improve this question asked Jul 4, 2018 at 3:42 user9728810user9728810 1
  • 1 Summing up: either build a custom word boundary using lookaheads/negated character classes or if you plan to only support ECMAScript 2018 patible JS environments, use new RegExp(word + '(?!\\p{L})', 'i') (not sure you need i since Chinese letters are caseless). To match a whole letter word, use new RegExp('(^|\\P{L})(' + word + ')(?!\\p{L})', 'i'). – Wiktor Stribiżew Commented Jul 4, 2018 at 7:25
Add a ment  | 

2 Answers 2

Reset to default 5

Read the documentation for word boundaries.

A word boundary matches the position between a word character followed by a non-word character, or between a non-word character followed by a word character.

where "word character" is something that matches \w (basically single-byte alphanumerics and the underscore), and "non-word character" is something that matches \W.

Note that all Chinese characters, in the sense that we usually think of them, are considered "non-word characters" as relates to the definition of word boundaries in JavaScript regular expressions. In other words, there is no word boundary between 一 and 个, because both are non-word characters; similarly, there is no word boundary between 一个 and 测试, because both 个 and 测 are non-word characters.

With regard to Japanese, Chinese, and Korean, which do not generally use spaces, there is not even a single clear definition of what the concept of "word" means, and therefore no concept of "word character" or "word boundary". There are libraries which people have worked on for years, involving machine learning, to try to break text into meaningful word-like segments, and they all do it in a slightly different way. The relevant question here is why you think you want to break the Chinese into what you are thinking of as "words" (or find strings which occur right before "word boundaries". What is the point of your \\b that is forcing the match to occur right before a word boundary? What case are you trying to exclude?

Using Unicode regexp properties

However, you may be able to use the new Unicode regexp character class escapes in ECMAScript 2018 (http://2ality./2017/07/regexp-unicode-property-escapes.html). For instance, to match Chinese strings occurring before something that doesn't look like a Chinese character (or any letter), you could use

new RegExp(`${word}(?=$|\P{Letter})`, "u")

Roughly speaking, this translates into "find the word, but only it is followed by (using look-ahead, the (?= part) either end-of-string ($) or a a character which does have the Unicode property "Letter". The "u" flag enables Unicode processing.

Of course, this will not help you find 一个 as a "word" inside 你说到这是一个测试, because the following character 测 falls into the Unicode class "Letter", and so will not match \p{Letter}.

By the way, to match any "non-word" symbol in Unicode, you can use:

[^\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]

\b only works on boundary between words and non-words. In case of Chinese, the entire '你说到这是一个测试' is considered a word, so '一个' won't match '你说到这是一个测试' with your regex pattern with \b since '一个' is not at the word boundary of '你说到这是一个测试'. '测试' on the other hand, will match. For Chinese words, a simple substring match is usually enough.

发布评论

评论列表(0)

  1. 暂无评论