Need to put list of unicode words in unicode string in {}. There is my code:
var txt = "¿One;one oneé two two two two two twö twöu three;;twä;föur?";
var re = new RegExp("(^|\\W)(one|tw|two two|two|twöu|three|föur)(?=\\W|$)", "gi");
alert(txt.replace(re, '$1 {$2}'));
It returns:
¿{One};{one} {one}é {two two} {two two} {two} {tw}ö {tw}öu {three};;{tw}ä;{föur}?
but should be:
¿{One};{one} oneé {two two} {two two} {two} twö {twöu} {three};;twä;{föur}?
What I'm doing wrong?
Need to put list of unicode words in unicode string in {}. There is my code:
var txt = "¿One;one oneé two two two two two twö twöu three;;twä;föur?";
var re = new RegExp("(^|\\W)(one|tw|two two|two|twöu|three|föur)(?=\\W|$)", "gi");
alert(txt.replace(re, '$1 {$2}'));
It returns:
¿{One};{one} {one}é {two two} {two two} {two} {tw}ö {tw}öu {three};;{tw}ä;{föur}?
but should be:
¿{One};{one} oneé {two two} {two two} {two} twö {twöu} {three};;twä;{föur}?
What I'm doing wrong?
Share Improve this question edited Mar 29, 2012 at 18:19 tchrist 80.4k31 gold badges131 silver badges184 bronze badges asked Apr 6, 2011 at 7:28 JohnJohn 511 silver badge2 bronze badges 1- 1 You are doing nothing wrong — unfortunately. Javascript is. A perfect, standards-pliant solution is not possible under Javascript as it is currently implemented, but there is a Javascript plugin library that you maybe be able to co-opt into working out for you in this particular case. See my answer below. – tchrist Commented Apr 8, 2011 at 14:07
3 Answers
Reset to default 14The Problem
What am I doing wrong?
Unfortunately, the answer is that you are doing nothing wrong. Javascript is.
The problem is that Javascript does not support Unicode regular expressions as such are spelled out in The Unicode Standard.
There is, however, a rather nice library called XRegExp which has a JavaScript plugin that helps a great deal. I remend it, albeit with several notable caveats. You need to know what it can do, and what it cannot.
What It Does
- Corrects various bugs in inconsistencies in Javascript implementations, including its
split
function. - Supports the BMP code points covered by the 6.1 release of the Unicode Character Database, from January 2012.
- Correctly ignores case, space, hyphen-minuses, and underscores in Unicode property names, per The Standard — something which even Java gets wrong.
- Supports the Unicode General Categories like
\p{L}
for letters and\p{Sc}
for currency symbols. - Support the standard full property names like
\p{Letter}
for\p{L}
and\p{Currency_Symbol}
for\p{Sc}
. - Supports the Unicode Script properties, like
\p{Latin}
,\p{Greek}
, and\p{Common}
. - Supports the Unicode Block properties, like
\p{InBasic_Latin}
and\p{InMathematical_Alphanumeric_Symbols}
. - Supports the other 9 Unicode properties needed for level-1 pliance:
\p{Alphabetic}
,\p{Uppercase}
,\p{Lowercase}
,\p{White_Space}
,\p{Noncharacter_Code_Point}
,\p{Default_Ignorable_Code_Point}
,\p{Any}
,\p{ASCII}
, and\p{Assigned}
. - Supports named captures instead of just numbered ones, using standard notation to do so:
(?<NAME>⋯)
to declare a named group,\k<NAME>
to backref it by name, and use${NAME}
in the replacement pattern (and in general access it usingresult.NAME
in your code). This is the same syntax used by Perl 5.10, Java 7, .ɴᴇᴛ, and several other languages. It makes writing plex regexes a lot easier by letting you name parts instead of just numbering them, so that when you move stuff around you don’t have to recalculate the numbered variables. - Supports
/s
ᴀᴋᴀ(?s)
mode so that dot matches any single code point, rather than anything except for a linebreak sequence. Most other regex engines support this mode. - Supports
/x
ᴀᴋᴀ(?x)
mode so that whitespace and ments are ignored (if unescaped). Most regex engines support this mode. It is absolutely indispensable for creating legible — and hence, maintainable — patterns. - Supports embedded ments even when not in
/x
mode using the standard(?#⋯)
notation to do so (such as seen in Perl). This lets you put ments in individual regex pieces without going all the way to/x
mode, which is often important in developing more plex patterns, by allowing you to build them up piece-wise. - Supports extensibility, so that you can add new token types if you want, such as
\a
to mean the ALERT character, or the POSIXish character classes.
What It Doesn’t
You should be careful, however, for the things that it does not do:
- Does not support full Unicode, but only code points from Plane 0. This is a forbidden restriction, as The Unicode Standard requires that there be no difference between astral and non-astral code points in a regular expression. Even Java doesn’t get this right until JDK7. (However, the v2.1.0 development version does support full Unicode.)
- Does not support
\X
for grapheme clusters, or\R
for linebreak sequences. - Does not support two-part properties, like
\p{GC=Letter}
,\p{Block=Phonetic_Extensions}
,\p{Script=Greek}
,\p{Bidi_Class=Right_to_Left}
,\p{Word_Break=A_Letter}
, and\p{Numeric_Value=10}
. - It does not update the character class shortcuts to operate per the requirements of UTS#18. Standard JavaScript only allows
\s
to match the Unicode\p{White_Space}
property; it does not allow\d
to match\p{Nd}
(although some old browsers will do that anyway!) nor\w
to match[\p{Alphabetic}\pM\p{Nd}\p{Pc}]
, let alone providing Unicode-aware versions of\b
and\B
, all of which are part of the requirements for supporting Unicode Regular Expressions. - It does not support some monly used properties. In practice, the one that is missing is
\p{digit}
, and perhaps also the rather useful\p{Dash}
,\p{Math}
,\p{Diacritic}
, and\p{Quotation_Mark}
properties. - Has no support for grapheme clusters such as using
\X
or even via(?:\p{Grapheme_Base}\p{Grapheme_Extend}*)
. This is a really big deal.
Workarounds
Here are a few workarounds to handle a few of the places where the library doesn’t follow The Unicode Standard:
- For the missing
\w
, you can use[\p{L}\p{Nl}\p{Nd}\p{M}\p{InEnclosedAlphanumerics}]
. It overstates matters only in the enclosed numbers, as they’re not\p{Nd}
-type numbers which are the only ones that count as alphanumeric. - For the missing
\W
, you can therefore use the set-plement of the previous one, so[^\p{L}\p{Nl}\p{Nd}\p{M}\p{InEnclosedAlphanumerics}]
. It overstates matters only in the enclosed numbers. - Since
\b
is really the same as(?:(?<=\w)(?!\w)|(?<!\w)(?=\w))
, you could plug that\w
definition into that sequence to create a Unicode-aware version of\b
— provided that JavaScript supported all four directions of lookaround, which when last I checked, it did not. You have to have both positive and negative lookbehind, not just lookahead, to do this correctly. Javascript neglects to support those, at least as far as I can see. - Since
\B
is really the same as(?:(?<=\w)(?=\w)|(?<!\w)(?!\w))
, you could do the same, but subject to the same conditions. - For the missing
\X
, you can get sorta close by using\P{M}\p{M}*
, but that incorrectly splits up CRLF constructs and allows marks on the same, all of which is really quite wrong. - For the missing
\R
, you can construct a work-around using(?:\r\n|[\n-\r\u0085\u2028\u2029])
.
Summary
The conclusion is that JavaScript’s regexes are pletely unsuited for Unicode work. However, the XRegExp plugin moves closer to making that feasible. If you can live with its restrictions, this is probably easier than switching to a different but Unicode-aware programming language. It’s certainly better than being unable to use Unicode regexes even at all.
However, it is still a rather long ways from meeting the very most basic requirements (Level 1 support) for Unicode regexes as spelled out in the standard. Someday you are going to want to be able to match characters whether they have accent marks on them or not, or which set up in the Mathematical Alphanumeric Symbols block, or which use the Unicode case-mapping and case-folding definitions, or which follow The Unicode Standard for alphanumeric sorts or for line- and word-breaking, and you cannot do any of those things in Javascript even with the plug-in.
So you might wish to consider using a language that is pliant with The Unicode Standard if you actually need to handle Unicode. Javascript just doesn’t manage that.
Firstly, unless the regex is dynamic, please use the /.../gi
notation.
The problem it returns the wrong value is because \W
in Javascript is really just [^0-9a-zA-Z_]
. The accented characters like é
is not considered a word character. You need to exclude them manually.
var re = /(^|[^a-zäéö])(one|tw|two two|two|twöu|three|föur)(?=[^a-zäéö]|$)/gi;
Try this:
var txt = "¿One;one oneé two two two two two twö twöu three;;twä;föur?";
var re = new RegExp("(^|\\W)(one|two two|two|twöu|three|föur)(?=[^a-zé]|$)", "gi");
alert(txt.replace(re, '$1 {$2}'));
Let me know in case doesnt work...