javascript - Removing control characters from a UTF-8 string in PHP

So I am removing control characters (tab, cr, lf, \v and all other invisible chars) in the client side (after input) but since the client cannot be trusted, I have to remove them in the server too.

so according to this link /

the control characters are from x00 to 1F and from 7F to 9F. thus my client (javascript) control char removal function is:

return s.replace(/[\x00-\x1F\x7F-\x9F]/g, "");

and my php (server) control char removal function is:

$s = preg_replace('/[\x00-\x1F\x7F-\x9F]/', '', $s);

Now this seems to create problems with international utf8 chars such as ς (xCF x82) in PHP only (because x82 is inside the second sequence group), the javascript equivalent does not create any problems.

Now my question is, should I remove the control characters from 7F to 9F? To my understanding those the sequences from 127 to 159 (7F to 9F) obviously can be part of a valid UTF-8 string?

also, maybe I shouldn't even filter the 00 to 31 control characters because also some of those characters can appear in some weird (japanese? chinese?) but valid utf-8 characters ?

So I am removing control characters (tab, cr, lf, \v and all other invisible chars) in the client side (after input) but since the client cannot be trusted, I have to remove them in the server too.

so according to this link http://www.utf8-chartable.de/

the control characters are from x00 to 1F and from 7F to 9F. thus my client (javascript) control char removal function is:

return s.replace(/[\x00-\x1F\x7F-\x9F]/g, "");

and my php (server) control char removal function is:

$s = preg_replace('/[\x00-\x1F\x7F-\x9F]/', '', $s);

Now my question is, should I remove the control characters from 7F to 9F? To my understanding those the sequences from 127 to 159 (7F to 9F) obviously can be part of a valid UTF-8 string?

also, maybe I shouldn't even filter the 00 to 31 control characters because also some of those characters can appear in some weird (japanese? chinese?) but valid utf-8 characters ?

Share Improve this question edited Jul 30, 2018 at 21:18 Rory O'Kane 30.4k11 gold badges100 silver badges132 bronze badges asked Jan 22, 2014 at 13:28 MirrorMirror 1888 gold badges37 silver badges71 bronze badges

Maybe this helps you: stackoverflow.com/q/12543476/1066234 It is a different regex. – Avatar Commented Jan 18, 2016 at 17:30

Add a comment |

2 Answers 2

Sorted by: Reset to default 16

it seems that I just need to add the u flag to the regex thus it becomes:

$s = preg_replace('/[\x00-\x1F\x7F-\x9F]/u', '', $s);

should I remove the control characters from 7F to 9F? To my understanding those the sequences from 127 to 159 (7F to 9F) obviously can be part of a valid UTF-8 string?

You should not, except for \x7F, which is represented as \x7F in UTF-8, because they’re lower surrogates in UTF-8.

maybe I shouldn't even filter the 00 to 31 control characters because also some of those characters can appear in some weird (japanese? chinese?) but valid utf-8 characters ?

Those control characters are still control characters in UTF-8. The presence of them may mean a Mojibake; if you want to correct it, preserve them, otherwise, filter them out.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

javascript - Removing control characters from a UTF-8 string in PHP - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)