最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

JavaScript regex to get first character of each word in a sentence (Persian, and English sentence) - Stack Overflow

programmeradmin1浏览0评论

Suppose I have the following string:

var englishSentence = 'Hellow World';
var persianSentence = 'گروه جوانان خلاق';

For english I use from following regex, but how can I write a regex to support Persian, or mix of them.

  var matches = englishSentence.match(/\b(\w)/g);
  acronym = matches.join('');

Suppose I have the following string:

var englishSentence = 'Hellow World';
var persianSentence = 'گروه جوانان خلاق';

For english I use from following regex, but how can I write a regex to support Persian, or mix of them.

  var matches = englishSentence.match(/\b(\w)/g);
  acronym = matches.join('');
Share Improve this question asked Apr 12, 2018 at 9:48 jonesjones 1,4534 gold badges37 silver badges76 bronze badges 3
  • 1 How should the output look like? – gurvinder372 Commented Apr 12, 2018 at 9:50
  • try sentence.split( /\s+/ ).map( s => s.charAt(0) ) – gurvinder372 Commented Apr 12, 2018 at 9:51
  • For Persian it should be گجخ – jones Commented Apr 12, 2018 at 9:52
Add a ment  | 

5 Answers 5

Reset to default 10

Root cause

There is no way to match a Unicode word boundary, \b is not Unicode aware even in ECMA 2018.

Solutions

For ECMA2018 patible browsers (e.g. the latest versions of Chrome as of April 2018) you may use:

var englishSentence = 'Hellow World';
var persianSentence = 'گروه جوانان خلاق';
var reg = /(?<!\p{L}\p{M}*)\p{L}\p{M}*/gu;
console.log(englishSentence.match(reg));
console.log(persianSentence.match(reg));

Details

  • (?<!\p{L}\p{M}*) - a negative lookbehind that fails the match if there is a Unicode letter followed with 0+ diacritics
  • \p{L}\p{M}* - a Unicode letter followed with 0+ diacritics
  • gu - g - global, search for all matches, u - make the pattern Unicode aware.

If you need the same functionality in older/other browsers, use XRegExp:

function getFirstLetters(s, regex) {
  var results=[], match;
  XRegExp.forEach(s, regex, function (match, i) {
    results.push(match[1]);
  });
  return results;
}
var rx = XRegExp("(?:^|[^\\pL\\pM])(\\pL\\pM*)", "gu");
console.log(getFirstLetters("Hello world", rx));
console.log(getFirstLetters('گروه جوانان خلاق', rx));
<script src="https://cdnjs.cloudflare./ajax/libs/xregexp/3.2.0/xregexp-all.js"></script>

Details

  • (?:^|[^\\pL\\pM]) - a non-capturing group that matches the start of the string (^) or any char other than a Unicode letter or diacritic
  • (\\pL\\pM*) - Group 1: any Unicode letter followed with 0+ diacritics.

Here, we need to extract Group 1 value, hence .push(match[1]) upon each match.

You can split by space(s) and then get the first character of each item

var output = sentence.split( /\s+/ ).map( s => s.charAt(0) ).join("")

Demo

var fnGetFirstChar = (sentence) => sentence.split( /\s+/ ).map( s => s.charAt(0) ).join("");

var englishSentence = 'Hellow World';
var persianSentence = 'گروه جوانان خلاق';

console.log( fnGetFirstChar( englishSentence ) );

console.log( fnGetFirstChar( persianSentence ) );

If you're doing this in code, one way of doing it is with

(?:\s|^)(\S)

It matches a non white space character (\S) preceded by a white space OR beginning of string (\s|^), capturing the non white space character to capture group 1.

var sentence  = 'Hello World\n'+
                'گروه جوانان خلاق',
    re        = /(?:\s|^)(\S)/g,
    result = '';
    
while( m = re.exec(sentence) )
{
  result += m[1];
};

console.log( result );

You'd better use a character range from آ to ی along with a-z since a word boundary in JS doesn't recognize multibyte letters while in most flavors it does.

console.log(
  "سلام حالت چطوره؟".match(/( |^)[آ-یa-z](?=[آ-یa-z])/gi).map(x => x.trim()).join('')
)

console.log(
  "این یک test است".match(/( |^)[آ-یa-z](?=[آ-یa-z])/gi).map(x => x.trim()).join('')
)

Breakdown:

  • (?: |^) Match a space or beginning of input string
  • [آ-ی] Match a character from Farsi
  • (?= Start a positive lookahead
    • [آ-ی] If followed by another Farsi character
  • ) End of positive lookahead

Note: character range from آ to ی has more than Farsi alphabets in it (some Arabic letters too) for a precise match (I doubt if you use those letters anywhere though) use a solid character class:

[اآبپتثجچحخدذرزژسشصضطظعفقگکلمنوهی]

console.log(
    "سلام دوست من".match(/( |^)[اآبپتثجچحخدذرزژسشصضطظعفقگکلمنوهیa-z](?=[اآبپتثجچحخدذرزژسشصضطظعفقگکلمنوهیa-z])/gi).map(x => x.trim()).join('')
)

In JS you can simulate a word boundary.

Probably relevant is you can simulate a word boundary by enabling an engines Unicode option and using properties [\p{L}\p{N}_] to define a word Then just do the math for left/right boundary's.

/(?:(?<![\p{L}\p{N}_])(?=[\p{L}\p{N}_])|(?<=[\p{L}\p{N}_])(?![\p{L}\p{N}_]))/gu

This is a Korean sample but is applicable for any Unicode.

https://regex101./r/Mjttej/1

(?:                           # Cluster start
   (?<! [\p{L}\p{N}_] )          # Lookbehind assertion for a char that is NOT a word
   (?= [\p{L}\p{N}_] )           # Lookahead assertion for a char that is IS a word
   
 |                              # or,
   
   (?<= [\p{L}\p{N}_] )          # Lookbehind assertion for a char that is IS a word
   (?! [\p{L}\p{N}_] )           # Lookahead assertion for a char that is NOT a word
                                 # -------
)                             # Cluster end

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论