if this type character '這'
= NonEnglish
each will take up 2 word space, and English will take up 1 word space, Max length limit is 10 word space; How to get the first 10 space.
for below example how to get the result This這 is
?
I'm trying to use for loop from first word but I don't know how to get each word in string...
string = "This這 is是 English中文 …";
var NonEnglish = "[^\u0000-\u0080]+",
Pattern = new RegExp(NonEnglish),
MaxLength = 10,
Ratio = 2;
if this type character '這'
= NonEnglish
each will take up 2 word space, and English will take up 1 word space, Max length limit is 10 word space; How to get the first 10 space.
for below example how to get the result This這 is
?
I'm trying to use for loop from first word but I don't know how to get each word in string...
string = "This這 is是 English中文 …";
var NonEnglish = "[^\u0000-\u0080]+",
Pattern = new RegExp(NonEnglish),
MaxLength = 10,
Ratio = 2;
Share
Improve this question
edited Feb 27, 2014 at 5:21
user1775888
asked Feb 27, 2014 at 5:19
user1775888user1775888
3,31314 gold badges49 silver badges67 bronze badges
5
- Do you need to get first 10 symbols of string or what? – Y.Puzyrenko Commented Feb 27, 2014 at 5:29
- If it's a mixed of english & non-english, cant you just remove non-english since you don't need them? then do a split after that – fedmich Commented Feb 27, 2014 at 5:29
- @Good.luck I need to get first 10 symbols but if there is 1 non english word will equal 2 symbol – user1775888 Commented Feb 27, 2014 at 5:30
-
@fedmich ?? the words just for example the string maybe will be
th中文isisiisi
– user1775888 Commented Feb 27, 2014 at 5:32 - @user1775888 Are we supposed to use the same regex you provide or something of our own ? – HighBoots Commented Feb 27, 2014 at 5:38
2 Answers
Reset to default 8If you mean you want to get that part of the string where it's length has reached 10, here's the answer:
var string = "This這 is是 English中文 …";
function check(string){
// Length of A-Za-z characters is 1, and other characters which OP wants is 2
var length = i = 0, len = string.length;
// you can iterate over strings just as like arrays
for(;i < len; i++){
// if the character is what the OP wants, add 2, else 1
length += /\u0000-\u0080/.test(string[i]) ? 2 : 1;
// if length is >= 10, e out of loop
if(length >= 10) break;
}
// return string from the first letter till the index where we aborted the for loop
return string.substr(0, i);
}
alert(check(string));
Live Demo
EDIT 1:
- Replaced
.match
with.test
. The former returns a whole array while the latter simply returns true or false. - Improved RegEx. Since we are checking only one character, no need for
^
and+
that were before. - Replaced
len
withstring.length
. Here's why.
I'd suggest something along the following lines (assuming that you're trying to break the string up into snippets that are <= 10 bytes in length):
string = "This這 is是 English中文 …";
function byteCount(text) {
//get the number of bytes consumed by a string
return encodeURI(text).split(/%..|./).length - 1;
}
function tokenize(text, targetLen) {
//break a string up into snippets that are <= to our target length
var result = [];
var pos = 0;
var current = "";
while (pos < text.length) {
var next = current + text.charAt(pos);
if (byteCount(next) > targetLen) {
result.push(current);
current = "";
pos--;
}
else if (byteCount(next) == targetLen) {
result.push(next);
current = "";
}
else {
current = next;
}
pos++;
}
if (current != "") {
result.push(current);
}
return result;
};
console.log(tokenize(string, 10));
http://jsfiddle/5pc6L/