最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

regex - Remove Unicode characters within various ranges in javascript - Stack Overflow

programmeradmin3浏览0评论

I'm trying to remove every Unicode character in a string if it falls in any the ranges below.

\uD800-\uDFFF
\u1D800-\u1DFFF
\u2D800-\u2DFFF
\u3D800-\u3DFFF
\u4D800-\u4DFFF
\u5D800-\u5DFFF
\u6D800-\u6DFFF
\u7D800-\u7DFFF
\u8D800-\u8DFFF
\u9D800-\u9DFFF
\uAD800-\uADFFF
\uBD800-\uBDFFF
\uCD800-\uCDFFF
\uDD800-\uDDFFF
\uED800-\uEDFFF
\uFD800-\uFDFFF
\u10D800-\u10DFFF

As an initial prototype, I tried to just remove characters within the first range by using a regex in the replace function.

var buffer = "he\udfffllo world";
var output = buffer.replace(/[\ud800-\udfff]/g, "");
d.innerText = buffer + " is replaced with " + output;

In this case, the character seems to have been replaced fine.

However, when I replace that with

var buffer = "he\udfffllo worl\u1dfffd";
var output = buffer.replace(/[\ud800-\udfff\u1d800-\u1dfff]/g, "");
d.innerText = buffer + " is replaced with " + output;

I see something unexpected. My output shows up as:

he�llo worl᷿fd is replaced with

There are two things to note here:

  1. \u1dfff does not show up as one character - \u1dff gets converted to a character and the f at the end it treated as its own character
  2. the result is an empty string.

Any suggestions on how I can acplish this would be much appreciated.


EDIT

My overall goal is to filter out all characters that the encodeURIComponent function considers invalid. I ran some tests and found the list above to be the set of characters that a invalid. For instance, the code below, which first converts 1dfff to a unicode character before passing that to encodeURIComponent causes an exception to be raised by the latter function.

var v = String.fromCharCode(122879);
var uriComponent = encodeURIComponent(v);

I edited parts of the question after @Blender pointed out that i was using x instead of u in my code to represent Unicode characters.


EDIT 2

I investigated my technique for fetching the "invalid" unicode ranges further, and as it turns out, if you give String.fromCharacterCode a number that's larger than 16 bits, it'll just look at the lowest 16 bits of the number. That explains the pattern I was seeing. So as it turns out, I only need to worry about the first range.

I'm trying to remove every Unicode character in a string if it falls in any the ranges below.

\uD800-\uDFFF
\u1D800-\u1DFFF
\u2D800-\u2DFFF
\u3D800-\u3DFFF
\u4D800-\u4DFFF
\u5D800-\u5DFFF
\u6D800-\u6DFFF
\u7D800-\u7DFFF
\u8D800-\u8DFFF
\u9D800-\u9DFFF
\uAD800-\uADFFF
\uBD800-\uBDFFF
\uCD800-\uCDFFF
\uDD800-\uDDFFF
\uED800-\uEDFFF
\uFD800-\uFDFFF
\u10D800-\u10DFFF

As an initial prototype, I tried to just remove characters within the first range by using a regex in the replace function.

var buffer = "he\udfffllo world";
var output = buffer.replace(/[\ud800-\udfff]/g, "");
d.innerText = buffer + " is replaced with " + output;

In this case, the character seems to have been replaced fine.

However, when I replace that with

var buffer = "he\udfffllo worl\u1dfffd";
var output = buffer.replace(/[\ud800-\udfff\u1d800-\u1dfff]/g, "");
d.innerText = buffer + " is replaced with " + output;

I see something unexpected. My output shows up as:

he�llo worl᷿fd is replaced with

There are two things to note here:

  1. \u1dfff does not show up as one character - \u1dff gets converted to a character and the f at the end it treated as its own character
  2. the result is an empty string.

Any suggestions on how I can acplish this would be much appreciated.


EDIT

My overall goal is to filter out all characters that the encodeURIComponent function considers invalid. I ran some tests and found the list above to be the set of characters that a invalid. For instance, the code below, which first converts 1dfff to a unicode character before passing that to encodeURIComponent causes an exception to be raised by the latter function.

var v = String.fromCharCode(122879);
var uriComponent = encodeURIComponent(v);

I edited parts of the question after @Blender pointed out that i was using x instead of u in my code to represent Unicode characters.


EDIT 2

I investigated my technique for fetching the "invalid" unicode ranges further, and as it turns out, if you give String.fromCharacterCode a number that's larger than 16 bits, it'll just look at the lowest 16 bits of the number. That explains the pattern I was seeing. So as it turns out, I only need to worry about the first range.

Share Improve this question edited Jun 2, 2013 at 4:51 hippietrail 17k21 gold badges109 silver badges179 bronze badges asked Jun 2, 2013 at 2:27 K MehtaK Mehta 10.6k4 gold badges48 silver badges78 bronze badges 4
  • \xdfff is interpreted as \xdf, f and f. – Blender Commented Jun 2, 2013 at 2:44
  • Ahh you're right, that explains why I was seeing weird results in my second attempt. Changing that part of the question now. – K Mehta Commented Jun 2, 2013 at 2:49
  • The notation \u1D800 and most of the other notations are not valid at all (or, technically, \u1D800, means U+1D80 followed by the digit zero. Please formulate your question in terms of Unicode characters, not using assumed (and invalid) escape notations for them. – Jukka K. Korpela Commented Jun 2, 2013 at 12:02
  • 1 @JukkaK.Korpela What I've often noticed on SO is that people who've have knowledge about a topic for a while often forget that others exploring the same topic don't know even the basics to formulate the question correctly. In fact, if they did, they'd be far enough along to have answered their question themselves, leaving the entire point of SO moot. I'm not advocating for laziness; I researched, but just didn't know the right keywords to put into a search engine, which is why I turned to SO. So I formulated my question the best I knew how and posted it here and even edited as I learned more. – K Mehta Commented Jun 2, 2013 at 20:01
Add a ment  | 

1 Answer 1

Reset to default 5

It seems you're trying to remove Unicode surrogate code units from the string. However, only U+D800 through U+DFFF are surrogate code points; the remaining values you name are not, and could be allocated to valid Unicode characters. In that case, the following will suffice (use \u rather than \x to refer to Unicode characters):

buffer.replace(/[\ud800-\udfff]/g, "");
发布评论

评论列表(0)

  1. 暂无评论