When I try to call a function of CrossUI(a js framework) to rtrim text contents of DOM which is grep by jQuery, the firefox and chrome will be run into busy.
I found this regular expression in the source code blocks the browser.
I tried /[\s\uFEFF\xA0]+$/
and it works.
Why /(\s|\uFEFF|\xA0)+$/
is stuck? What's internal different between them?
$('body').text().replace(/(\s|\uFEFF|\xA0)+$/, "");
alert('pass');
<script src=".7.2/jquery.min.js"></script>
<div>
<div> </div>
<div>a</div>
</div>
When I try to call a function of CrossUI(a js framework) to rtrim text contents of DOM which is grep by jQuery, the firefox and chrome will be run into busy.
I found this regular expression in the source code blocks the browser.
I tried /[\s\uFEFF\xA0]+$/
and it works.
Why /(\s|\uFEFF|\xA0)+$/
is stuck? What's internal different between them?
$('body').text().replace(/(\s|\uFEFF|\xA0)+$/, "");
alert('pass');
<script src="https://ajax.googleapis./ajax/libs/jquery/1.7.2/jquery.min.js"></script>
<div>
<div> </div>
<div>a</div>
</div>
Share
Improve this question
edited Jul 21, 2015 at 8:06
Junyo
asked Jul 21, 2015 at 7:39
JunyoJunyo
3855 silver badges12 bronze badges
7
-
Well, the square bracket one is a character class, where
|
is included twice… so it will also remove all vertical bars. – Sebastian Simon Commented Jul 21, 2015 at 7:51 - @Xufox Thx, I've fix the code – Junyo Commented Jul 21, 2015 at 8:07
-
1
It seems to be another case of catastrophic backtracking. This is also visible with a simplified example of your RegEx:
/(?:\s|\s)+$/
. – Sebastian Simon Commented Jul 21, 2015 at 8:08 -
1
Try to remove
\xA0
, since it is also subsumed under\s
(JavaScript's\s
being equivalent to[ \f\n\r\t\v\u00a0\u1680\u180e\u2000-\u200a\u2028\u2029\u202f\u205f\u3000]
). – Amadan Commented Jul 21, 2015 at 8:12 -
1
@Amadan:
\s
includes\ufeff
, according to ECMA 5.1 and also 6. The article on MDN missed that part. – nhahtdh Commented Jul 21, 2015 at 10:44
2 Answers
Reset to default 4According to ECMA 5.1 specification, \s
includes WhiteSpace and LineTerminator, and WhiteSpace includes U+FEFF and U+00A0 in its definition.
A simple test
/^\s+$/.test("\ufeff\u00a0")
shows that IE9 and recent versions of Firefox (38) and Chrome (43) follows the specs for those 2 characters. If you decide to drop support for old browsers, you don't need to add those characters manually into the regex. Just use \s
.
If you need to support them in old browser, include them in a character class:
[\s\ufeff\u00a0]
Using alternation will cause catastrophic backtracking in ECMA 5.1 pliant browsers. Since alternation creates a choice point to backtrack1 and \s
matches U+00A0 in ECMA 5.1, \s
and \xA0
in \s|\uFEFF|\xA0
presents 2 valid choices to match one no-break-space. When you have a string of consecutive whitespaces (as defined by \s
), you will have O(2n) cases to check, where n is the number of no-break-space in aforementioned substring. The same applies for \ufeff
, but it's rarer for such character to appear in large quantity.
On the contrary, a character class does not create a choice point for backtracking, so it's safe to use in both new and old browsers.
1 Technically, the engine is allowed to rewrite the alternation in the question into a character class. However, it's not monly done in practice, since it plicates the implementation of the engine.
A character set /[aa]/
is equivalent to /[a]/
: it is a set, it is assumed that every element can only appear once (and extra ones are ignored). This all works okay since each option is only one character - there is no space for plications to arise.
However, in case of failure, an alternation /(a|a)/
needs to check every branch, just in case, because it can't guarantee that a decision of one branch over another won't have consequences. An alternation does not guarantee fixed width, it does not guarantee invariance of capture groups etc. In this case, yes, both branches are the same; but the regexp engine doesn't check for this.
Thus, if you have /[aa]+$/
and are checking aaaaX
, you have four checks, one for each character, before the match fails due to non-end-of-stringness -- same as for /[a]+$/
(and in fact same as for /a+$/
). But for /(a|a)+$/
, you have 2 * (1 + 2 * (1 + 2 * (1 + 2)))
checks, thirty in total, as each of the branches gets checked. For each additional a
in the string, your time doubles, since the engine needs to check both the a
branch and also the a
branch (!) to see if one of them, miraculously, manages to match.
So to apply this to your problem. As said in ments, you have \xA0
in one branch, and it is also implicitly present in \s
; thus /(\s|\uFEFF|\xA0)+$/
will double its execution time on each and every
that you have in a white-space sequence that is not at the end of string. (The actual rtrim
part, the one that does get replaced, does not pose the problem - the end of string white-space sequence is done without any delay, as the first tested branch, the one with all \s
, succeeds and does not backtrack.)