最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Javascript - regex to remove special characters but also keep greek characters - Stack Overflow

programmeradmin7浏览0评论

I am trying to remove special characters from a piece of text, but using the following regular expression

var desired = stringToReplace.replace(/[^\w\s]/gi, '')

(found here: javascript regexp remove all special characters)

has the negative effect that deletes greek characters and this is something I don't want.

Can someone also explain me how to use character ranges in regular expressions? Is there a character map which can help me define the range I want?

Answer:

[a-zA-Z0-9ΆΈ-ώ\s]   # See my 2nd ment under Joeytje50's answer.

I am trying to remove special characters from a piece of text, but using the following regular expression

var desired = stringToReplace.replace(/[^\w\s]/gi, '')

(found here: javascript regexp remove all special characters)

has the negative effect that deletes greek characters and this is something I don't want.

Can someone also explain me how to use character ranges in regular expressions? Is there a character map which can help me define the range I want?

Answer:

[a-zA-Z0-9ΆΈ-ώ\s]   # See my 2nd ment under Joeytje50's answer.
Share Improve this question edited May 23, 2017 at 12:00 CommunityBot 11 silver badge asked Apr 27, 2014 at 18:31 tgogostgogos 25.3k20 gold badges102 silver badges132 bronze badges 1
  • You need to define what you mean by “greek characters”. Do you mean letters and punctuation marks used in modern Greek, or any characters that belong to the Greek script (Greek writing system)? – Jukka K. Korpela Commented Apr 27, 2014 at 19:19
Add a ment  | 

3 Answers 3

Reset to default 5

The way these ranges are defined is based on their character code. So, since A has char code 65, and z has char code 122, the following regex:

[A-z]

would match every letter, but also every character with char codes that fall between those char codes, namely those with codes 91 through 95, which would be the characters [\]^_. (demo).

Now, for Greek letters, the character codes for the uppercase characters are 913-937 for alpha through omega, and the lowercase characters are 945-969 for alpha through omega (this includes both lowercase variants of sigma, namely ς (962) and σ (963)).

So, to match every character except for latin letters, greek letters, and arabic numerals, you need the following regex:

[a-zA-Z0-9α-ωΑ-Ω]

So, for greek characters, it works just like latin letters.


Edit: I've tested this via a Google Translate'd Lipsum, and it looks like this doesn't take accented letters into account. I've checked what the character codes for these accented letters were, and it turns out they are placed right before the lowercase letters, or right after the uppercase letters. So, the following regex works for all greek letters, including accented ones:

[a-zA-Z0-9ά-ωΑ-ώ]

Demo

This expanded range now also includes άέήίΰ (char codes 940 through 944) and ϊϋόύώ (codes 970 through 974).

To also include whitespace (spaces, tabs, newlines), simply include a \s in the range:

[a-zA-Z0-9ά-ωΑ-ώ\s]

Demo.


Edit: Apparently there are more Greek letters that needed to be included in this range, namely those in the range [Ά-Ϋ], which is the range of letters right before the ά, so the new regex would look like this:

[a-zA-Z0-9Ά-ωΑ-ώ\s]

Demo.

Try adding the range of Greek characters like this:

/[^\w\sΆΈ-ϗἀ-῾]/gi

I created this pattern by looking at Unicode pages 0370 Greek and Coptic and 1F00 - Greek Extended. I don't speak Greek, and it's likely that a more restricted character set would be more appropriate, but this seems to work:

"-ἄλφα-".replace(/[^\w\sΆΈ-ϗἀ-῾]/gi, ''); // "ἄλφα"
var stringToReplace = "παράδειγμαs & /(";
var result = stringToReplace.replace(/[^\u0370-\u03FF\w\s]/mg, "");

DEMO:

http://jsfiddle/tuga/LKjYd/

0370-03FF Greek and Coptic Character Block 

http://apps.timwhitlock.info/js/regex

发布评论

评论列表(0)

  1. 暂无评论