Suppose I have the following string:
var englishSentence = 'Hellow World';
var persianSentence = 'گروه جوانان خلاق';
For english I use from following regex, but how can I write a regex to support Persian, or mix of them.
var matches = englishSentence.match(/\b(\w)/g);
acronym = matches.join('');
Suppose I have the following string:
var englishSentence = 'Hellow World';
var persianSentence = 'گروه جوانان خلاق';
For english I use from following regex, but how can I write a regex to support Persian, or mix of them.
var matches = englishSentence.match(/\b(\w)/g);
acronym = matches.join('');
Share
Improve this question
asked Apr 12, 2018 at 9:48
jonesjones
1,4534 gold badges37 silver badges76 bronze badges
3
- 1 How should the output look like? – gurvinder372 Commented Apr 12, 2018 at 9:50
-
try
sentence.split( /\s+/ ).map( s => s.charAt(0) )
– gurvinder372 Commented Apr 12, 2018 at 9:51 - For Persian it should be گجخ – jones Commented Apr 12, 2018 at 9:52
5 Answers
Reset to default 10Root cause
There is no way to match a Unicode word boundary, \b
is not Unicode aware even in ECMA 2018.
Solutions
For ECMA2018 patible browsers (e.g. the latest versions of Chrome as of April 2018) you may use:
var englishSentence = 'Hellow World';
var persianSentence = 'گروه جوانان خلاق';
var reg = /(?<!\p{L}\p{M}*)\p{L}\p{M}*/gu;
console.log(englishSentence.match(reg));
console.log(persianSentence.match(reg));
Details
(?<!\p{L}\p{M}*)
- a negative lookbehind that fails the match if there is a Unicode letter followed with 0+ diacritics\p{L}\p{M}*
- a Unicode letter followed with 0+ diacriticsgu
-g
- global, search for all matches,u
- make the pattern Unicode aware.
If you need the same functionality in older/other browsers, use XRegExp
:
function getFirstLetters(s, regex) {
var results=[], match;
XRegExp.forEach(s, regex, function (match, i) {
results.push(match[1]);
});
return results;
}
var rx = XRegExp("(?:^|[^\\pL\\pM])(\\pL\\pM*)", "gu");
console.log(getFirstLetters("Hello world", rx));
console.log(getFirstLetters('گروه جوانان خلاق', rx));
<script src="https://cdnjs.cloudflare./ajax/libs/xregexp/3.2.0/xregexp-all.js"></script>
Details
(?:^|[^\\pL\\pM])
- a non-capturing group that matches the start of the string (^
) or any char other than a Unicode letter or diacritic(\\pL\\pM*)
- Group 1: any Unicode letter followed with 0+ diacritics.
Here, we need to extract Group 1 value, hence .push(match[1])
upon each match.
You can split by space(s) and then get the first character of each item
var output = sentence.split( /\s+/ ).map( s => s.charAt(0) ).join("")
Demo
var fnGetFirstChar = (sentence) => sentence.split( /\s+/ ).map( s => s.charAt(0) ).join("");
var englishSentence = 'Hellow World';
var persianSentence = 'گروه جوانان خلاق';
console.log( fnGetFirstChar( englishSentence ) );
console.log( fnGetFirstChar( persianSentence ) );
If you're doing this in code, one way of doing it is with
(?:\s|^)(\S)
It matches a non white space character (\S
) preceded by a white space OR beginning of string (\s|^
), capturing the non white space character to capture group 1.
var sentence = 'Hello World\n'+
'گروه جوانان خلاق',
re = /(?:\s|^)(\S)/g,
result = '';
while( m = re.exec(sentence) )
{
result += m[1];
};
console.log( result );
You'd better use a character range from آ
to ی
along with a-z
since a word boundary in JS doesn't recognize multibyte letters while in most flavors it does.
console.log(
"سلام حالت چطوره؟".match(/( |^)[آ-یa-z](?=[آ-یa-z])/gi).map(x => x.trim()).join('')
)
console.log(
"این یک test است".match(/( |^)[آ-یa-z](?=[آ-یa-z])/gi).map(x => x.trim()).join('')
)
Breakdown:
(?: |^)
Match a space or beginning of input string[آ-ی]
Match a character from Farsi(?=
Start a positive lookahead[آ-ی]
If followed by another Farsi character
)
End of positive lookahead
Note: character range from آ to ی has more than Farsi alphabets in it (some Arabic letters too) for a precise match (I doubt if you use those letters anywhere though) use a solid character class:
[اآبپتثجچحخدذرزژسشصضطظعفقگکلمنوهی]
console.log(
"سلام دوست من".match(/( |^)[اآبپتثجچحخدذرزژسشصضطظعفقگکلمنوهیa-z](?=[اآبپتثجچحخدذرزژسشصضطظعفقگکلمنوهیa-z])/gi).map(x => x.trim()).join('')
)
In JS you can simulate a word boundary.
Probably relevant is you can simulate a word boundary by enabling an engines Unicode option and using properties [\p{L}\p{N}_]
to define a word Then just do the math for left/right boundary's.
/(?:(?<![\p{L}\p{N}_])(?=[\p{L}\p{N}_])|(?<=[\p{L}\p{N}_])(?![\p{L}\p{N}_]))/gu
This is a Korean sample but is applicable for any Unicode.
https://regex101./r/Mjttej/1
(?: # Cluster start
(?<! [\p{L}\p{N}_] ) # Lookbehind assertion for a char that is NOT a word
(?= [\p{L}\p{N}_] ) # Lookahead assertion for a char that is IS a word
| # or,
(?<= [\p{L}\p{N}_] ) # Lookbehind assertion for a char that is IS a word
(?! [\p{L}\p{N}_] ) # Lookahead assertion for a char that is NOT a word
# -------
) # Cluster end