最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Regex character count, but some count for three - Stack Overflow

programmeradmin6浏览0评论

I'm trying to build a regular expression that places a limit on the input length, but not all characters count equal in this length. I'll put the rationale at the bottom of the question. As a simple example, let's limit the maximum length to 12 and allow only a and b, but b counts for 3 characters.

Allowed are:

  • aa (anything less than 12 is fine).
  • aaaaaaaaaaaa (exactly 12 is fine).
  • aaabaaab (6 + 2 * 3 = 12, which is fine).
  • abaaaaab (still 6 + 2 * 3 = 12).

Disallowed is:

  • aaaaaaaaaaaaa (13 a's).
  • bbbba (1 + 4 * 3 = 13, which is too much).
  • baaaaaaab (7 + 2 * 3 = 13, which is too much).

I've made an attempt that gets fairly close:

^(a{0,3}|b){0,4}$

This matches on up to 4 clusters that may consist of 0-3 a's or one b.

However, it fails to match on my last positive example: abaaaaab, because that forces the first cluster to be the single a at the beginning, consumes a second cluster for the b, then leaves only 2 more clusters for the rest, aaaaab, which is too long.

Constraints

  • Must run in JavaScript. This regex is supplied to Qt, which apparently uses JavaScript's syntax.
  • Doesn't really need to be fast. In the end it'll only be applied to strings of up to 40 characters. I hope it validates within 50ms or so, but slightly slower is acceptable.

Rationale

Why do I need to do this with a regular expression?

It's for a user interface in Qt via PyQt and QML. The user can type a name in a text field here for a profile. This profile name is url-encoded (special characters are replaced by %XX), and then saved on the user's file system. We encounter problems when the user types a lot of special characters, such as Chinese, which then encode to a very long file name. Turns out that at somewhere like 17 characters, this file name bees too long for some file systems. The URL-encoding encodes as UTF-8, which has up to 4 bytes per character, resulting in up to 12 characters in the file name (as each of these gets percent-encoded).

16 characters is too short for profile names. Even some of our default names exceed that. We need a variable limit based on these special characters.

Qt normally allows you to specify a Validator to determine which values are acceptable in a text box. We tried implementing such a validator, but that resulted in a segfault upstream, due to a bug in PyQt. It can't seem to handle custom Validator implementations at the moment. However, PyQt also exposes three built-in validators. Two apply only to numbers. The third is a regex validator that allows you to put a regular expression that matches all valid strings. Hence the need for this regular expression.

I'm trying to build a regular expression that places a limit on the input length, but not all characters count equal in this length. I'll put the rationale at the bottom of the question. As a simple example, let's limit the maximum length to 12 and allow only a and b, but b counts for 3 characters.

Allowed are:

  • aa (anything less than 12 is fine).
  • aaaaaaaaaaaa (exactly 12 is fine).
  • aaabaaab (6 + 2 * 3 = 12, which is fine).
  • abaaaaab (still 6 + 2 * 3 = 12).

Disallowed is:

  • aaaaaaaaaaaaa (13 a's).
  • bbbba (1 + 4 * 3 = 13, which is too much).
  • baaaaaaab (7 + 2 * 3 = 13, which is too much).

I've made an attempt that gets fairly close:

^(a{0,3}|b){0,4}$

This matches on up to 4 clusters that may consist of 0-3 a's or one b.

However, it fails to match on my last positive example: abaaaaab, because that forces the first cluster to be the single a at the beginning, consumes a second cluster for the b, then leaves only 2 more clusters for the rest, aaaaab, which is too long.

Constraints

  • Must run in JavaScript. This regex is supplied to Qt, which apparently uses JavaScript's syntax.
  • Doesn't really need to be fast. In the end it'll only be applied to strings of up to 40 characters. I hope it validates within 50ms or so, but slightly slower is acceptable.

Rationale

Why do I need to do this with a regular expression?

It's for a user interface in Qt via PyQt and QML. The user can type a name in a text field here for a profile. This profile name is url-encoded (special characters are replaced by %XX), and then saved on the user's file system. We encounter problems when the user types a lot of special characters, such as Chinese, which then encode to a very long file name. Turns out that at somewhere like 17 characters, this file name bees too long for some file systems. The URL-encoding encodes as UTF-8, which has up to 4 bytes per character, resulting in up to 12 characters in the file name (as each of these gets percent-encoded).

16 characters is too short for profile names. Even some of our default names exceed that. We need a variable limit based on these special characters.

Qt normally allows you to specify a Validator to determine which values are acceptable in a text box. We tried implementing such a validator, but that resulted in a segfault upstream, due to a bug in PyQt. It can't seem to handle custom Validator implementations at the moment. However, PyQt also exposes three built-in validators. Two apply only to numbers. The third is a regex validator that allows you to put a regular expression that matches all valid strings. Hence the need for this regular expression.

Share Improve this question asked Oct 28, 2016 at 1:32 GhostkeeperGhostkeeper 3,0501 gold badge19 silver badges30 bronze badges 4
  • I can make this regex without much trouble, but I feel dirty doing it. I've made several attempts, but can't make a good, generic solution that can be expanded for longer strings (length 13 for example) or higher values (b=4 for example) – Addison Commented Oct 28, 2016 at 4:29
  • Could you not length the submitted name (after url-encoding) then decide to accept or reject it? Seems the simplest solution. – A. L Commented Oct 28, 2016 at 5:00
  • 2 I'm bookmarking this question as my point of reference on how to ask a good regex question. Too many regex questions out there are sloppily written, unspecific and unclear. This is perfect. – Tim Pietzcker Commented Oct 28, 2016 at 5:28
  • @A.Lau That's impossible. I've tried a solution where I could write my own validator via PyQt, but that resulted in a segfault. We traced that to a bug in PyQt and submitted a chreq for Riverbank Solutions. I'm therefore limited to using one of their built-in validators. The only validator that applies to other stuff than numbers is the RegExpValidator. – Ghostkeeper Commented Oct 28, 2016 at 8:04
Add a ment  | 

3 Answers 3

Reset to default 6

There is no real straightforward way to do this, given the limitations of regexp. You're going to have to test for all binations, such as thirteen b with up to one a, twelve b with up to four a, and so on. We will build a little program to generate these for us. The basic format for testing for up to four a will be

/^(?=([^a]*a){0,4}[^a]*$)/

We'll write a little routine to create these lookaheads for us, given some letter and a minimum and maximum number of occurrences:

function matchLetter(c, m, n) {
  return `(?=([^${c}]*${c}){${m},${n}}[^${c}]*$)`;
}

> matchLetter('a', 0, 4)
< "(?=([^a]*a){0,4}[^a]*$)"

We can bine these to test for three b with up to three a:

/^(?=([^b]*b){3}[^b]*$)(?=([^a]*a){0,3}[^a]*$)/

We will write a function to create such bined lookaheads which matches exactly m occurrences of c1 and up to n occurrences of c2:

function matchTwoLetters(c1, m, c2, n) {
  return matchLetter(c1, m, m) + matchLetter(c2, 0, n);
}

We can use this to match exactly twelve b and up to four a, for a total of forty or less:

> matchTwoLetters('b', 12, 'a', 1, 4)
< "(?=([^b]*b){12,12}[^b]*$)(?=([^a]*a){0,4}[^a]*$)"

It remains to simply create versions of this for each count of b, and glom them together (for the case of a max count of 12):

function makeRegExp() {
  const res = [];
  for (let bs = 0; bs <= 4; bs++)
    res.push(matchTwoLetters('b', bs, 'a', 12 - bs*3));
  return new RegExp(`^(${res.join('|')})`);
}

> makeRegExp()
< "^((?=([^b]*b){0,0}[^b]*$)(?=([^a]*a){0,12}[^a]*$)|(?=([^b]*b){1,1}[^b]*$)(?=([^a]*a){0,9}[^a]*$)|(?=([^b]*b){2,2}[^b]*$)(?=([^a]*a){0,6}[^a]*$)|(?=([^b]*b){3,3}[^b]*$)(?=([^a]*a){0,3}[^a]*$)|(?=([^b]*b){4,4}[^b]*$)(?=([^a]*a){0,0}[^a]*$))"

Now you can do the test with

makeRegExp().test("baabaaa");

For the case of length=40, the regxp is 679 characters long. A very rough benchmark shows that it executes in under a microsecond.

If you want to count bytes when multibyte encoding is present, you can use this function:

function bytesLength(str) {
  var s = str.length;
  for (var i = s-1; i > -1; i--) {
    var code = str.charCodeAt(i);
    if (code > 0x7f && code <= 0x7ff) {s++;}
    else if (code > 0x7ff && code <= 0xffff) {s+=2;}
    if (code >= 0xDC00 && code <= 0xDFFF) {i--;}
  }
  return s;
}

console.log(bytesLength('敗')); // length 3

Try using something like this:

^((a{1,3}|b){1,4}|(a{1,4}|a?b|ba){1,3}|((a{2,3}|b){2}|aaba|abaa){2})$

Example: https://regex101./r/yTTiEX/6

This breaks it up into the logical possibilities:

4 parts, each with a value up to 3.
3 parts, each with a value up to 4.
2 parts, each with a value up to 6.

发布评论

评论列表(0)

  1. 暂无评论