Yesterday I made a question about Detect non valid XML characters in java, and this expression works as expected:
String xml10pattern = "[^"
+ "\u0009\r\n" // #x9 | #xA | #xD
+ "\u0020-\uD7FF" // [#x20-#xD7FF]
+ "\uE000-\uFFFD" // [#xE000-#xFFFD]
+ "\ud800\udc00-\udbff\udfff" // [#x10000-#x10FFFF]
+ "]";
However, I realized it would be better checking for invalid characters on client side using javascript, but I didn't succeed.
I almost achieved, except for range U+10000–U+10FFFF: /
For last range, I tried
var rg = /[^\u0009\r\n\u0020-\uD7FF\uE000-\uFFFD\ud800\udc00-\udbff\udfff]/g;
but it doesn't work. In regextester, tells "Range values reversed". I think it is because \ud800\udc00-\udbff\udfff
is intepreted as 3 expressions:
\ud800; \udc00-\udbff; \udfff
and, of course, the middle one fails.
So, my question is how convert above java regular expression into javascript.
Thanks.
==== UPDATE ====
Thanks to @collapsar ments, I tried to make two regular expressions.
Because of that, I realized I can't negate characters [^...]
.
It'll discard correct characters like U+10001
. I mean, this is not right:
function validateIllegalChars(str) {
var re1 = /[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]/g;
var re2 = /[^[\uD800-\uDBFF][\uDC00-\uDFFF]]/g;
var str2 = str.replace(re1, '').replace(re2, ''); // First replace would remove all valid characters [#x10000-#x10FFFF]
alert('str2:' + str2);
if (str2 != str) return false;
return true;
}
Then, I tried next (/):
function valPos(str) {
var re1 = /[\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]/g;
var re2 = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g;
var str2 = str.replace(re1, '').replace(re2, '');
if (str2.length === 0) return true;
alert('str2:' + str2 + '; length: ' + str2.length);
return false;
}
However, when I call this function: valPos('eo' + String.fromCharCode(65537))
, where 65537 is U+10001
it returns false
.
What is wrong or how can I solve it?
Yesterday I made a question about Detect non valid XML characters in java, and this expression works as expected:
String xml10pattern = "[^"
+ "\u0009\r\n" // #x9 | #xA | #xD
+ "\u0020-\uD7FF" // [#x20-#xD7FF]
+ "\uE000-\uFFFD" // [#xE000-#xFFFD]
+ "\ud800\udc00-\udbff\udfff" // [#x10000-#x10FFFF]
+ "]";
However, I realized it would be better checking for invalid characters on client side using javascript, but I didn't succeed.
I almost achieved, except for range U+10000–U+10FFFF: http://jsfiddle/mymxyjaf/15/
For last range, I tried
var rg = /[^\u0009\r\n\u0020-\uD7FF\uE000-\uFFFD\ud800\udc00-\udbff\udfff]/g;
but it doesn't work. In regextester, tells "Range values reversed". I think it is because \ud800\udc00-\udbff\udfff
is intepreted as 3 expressions:
\ud800; \udc00-\udbff; \udfff
and, of course, the middle one fails.
So, my question is how convert above java regular expression into javascript.
Thanks.
==== UPDATE ====
Thanks to @collapsar ments, I tried to make two regular expressions.
Because of that, I realized I can't negate characters [^...]
.
It'll discard correct characters like U+10001
. I mean, this is not right:
function validateIllegalChars(str) {
var re1 = /[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]/g;
var re2 = /[^[\uD800-\uDBFF][\uDC00-\uDFFF]]/g;
var str2 = str.replace(re1, '').replace(re2, ''); // First replace would remove all valid characters [#x10000-#x10FFFF]
alert('str2:' + str2);
if (str2 != str) return false;
return true;
}
Then, I tried next (http://jsfiddle/mymxyjaf/18/):
function valPos(str) {
var re1 = /[\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]/g;
var re2 = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g;
var str2 = str.replace(re1, '').replace(re2, '');
if (str2.length === 0) return true;
alert('str2:' + str2 + '; length: ' + str2.length);
return false;
}
However, when I call this function: valPos('eo' + String.fromCharCode(65537))
, where 65537 is U+10001
it returns false
.
What is wrong or how can I solve it?
-
the
\u
notation (so far) only supports up to 32 bit codepoints. This SO answer will solve your problem ( specify the code points in question as surrogate pairs ). However, you should be able to use the original solution if you create a RegExp object from a string:new RegExp ( xml10pattern );
withxml10pattern
defined as in your question. – collapsar Commented Mar 13, 2015 at 12:11 -
@collapsar, I think it does not work. For instance,
U+D801
shouldn't be accepted (it's not valid XML) and it seems accepted: jsfiddle/mymxyjaf/16. What is it wrong? – Albert Commented Mar 13, 2015 at 12:42 - In your fiddle,you have nested character classes in your first regex. This is a syntax error. Follow the recipe in the cited answer - you cannot build a single negated character class (ora single regex) because the limits of the offending code points will be represented by 2 characters. – collapsar Commented Mar 13, 2015 at 12:53
-
@collapsar, the expression I just used is
var re = /[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD[\uD800-\uDBFF][\uDC00-\uDFFF]]/g;
. It looks like it don't takeU+D801
as surrogate pair. It seems it only check first part[\uD800-\uDBFF]
– Albert Commented Mar 13, 2015 at 12:53 -
@collapsar, so you mean I must use two regular expressions? One for 32-bits codepoints, and the other for
U+10000 - U+10FFFF
? – Albert Commented Mar 13, 2015 at 12:56
1 Answer
Reset to default 7I finally solved.
The answer to my own question, as @collapsar told me, could be:
function validateIllegalChars(str) {
var re1 = /[\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]/g; // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]
var re2 = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g; // [#x10000-#x10FFFF]
var res = str.replace(re1, '').replace(re2, ''); // Should remove any valid character
if (!!res && res.length > 0) { // any remaining characters, means input str is not valid
return false;
}
return true;
}
The previous examples (the ones I post in jsfiddle) didn't work to me, because String.fromCharCode(65537)
does no generate character with code point U+10001
, as I thought, but U+0001
.
Thanks for help.