I'm trying to build a regular expression that places a limit on the input length, but not all characters count equal in this length. I'll put the rationale at the bottom of the question. As a simple example, let's limit the maximum length to 12 and allow only a and b, but b counts for 3 characters.

Allowed are:

aa (anything less than 12 is fine).
aaaaaaaaaaaa (exactly 12 is fine).
aaabaaab (6 + 2 * 3 = 12, which is fine).
abaaaaab (still 6 + 2 * 3 = 12).

Disallowed is:

aaaaaaaaaaaaa (13 a's).
bbbba (1 + 4 * 3 = 13, which is too much).
baaaaaaab (7 + 2 * 3 = 13, which is too much).

I've made an attempt that gets fairly close:

^(a{0,3}|b){0,4}$

This matches on up to 4 clusters that may consist of 0-3 a's or one b.

However, it fails to match on my last positive example: abaaaaab, because that forces the first cluster to be the single a at the beginning, consumes a second cluster for the b, then leaves only 2 more clusters for the rest, aaaaab, which is too long.

Constraints

Must run in JavaScript. This regex is supplied to Qt, which apparently uses JavaScript's syntax.
Doesn't really need to be fast. In the end it'll only be applied to strings of up to 40 characters. I hope it validates within 50ms or so, but slightly slower is acceptable.

Rationale

Why do I need to do this with a regular expression?

It's for a user interface in Qt via PyQt and QML. The user can type a name in a text field here for a profile. This profile name is url-encoded (special characters are replaced by %XX), and then saved on the user's file system. We encounter problems when the user types a lot of special characters, such as Chinese, which then encode to a very long file name. Turns out that at somewhere like 17 characters, this file name bees too long for some file systems. The URL-encoding encodes as UTF-8, which has up to 4 bytes per character, resulting in up to 12 characters in the file name (as each of these gets percent-encoded).

16 characters is too short for profile names. Even some of our default names exceed that. We need a variable limit based on these special characters.

Qt normally allows you to specify a Validator to determine which values are acceptable in a text box. We tried implementing such a validator, but that resulted in a segfault upstream, due to a bug in PyQt. It can't seem to handle custom Validator implementations at the moment. However, PyQt also exposes three built-in validators. Two apply only to numbers. The third is a regex validator that allows you to put a regular expression that matches all valid strings. Hence the need for this regular expression.

Allowed are:

aa (anything less than 12 is fine).
aaaaaaaaaaaa (exactly 12 is fine).
aaabaaab (6 + 2 * 3 = 12, which is fine).
abaaaaab (still 6 + 2 * 3 = 12).

Disallowed is:

aaaaaaaaaaaaa (13 a's).
bbbba (1 + 4 * 3 = 13, which is too much).
baaaaaaab (7 + 2 * 3 = 13, which is too much).

I've made an attempt that gets fairly close:

^(a{0,3}|b){0,4}$

This matches on up to 4 clusters that may consist of 0-3 a's or one b.

Constraints

Must run in JavaScript. This regex is supplied to Qt, which apparently uses JavaScript's syntax.
Doesn't really need to be fast. In the end it'll only be applied to strings of up to 40 characters. I hope it validates within 50ms or so, but slightly slower is acceptable.

Rationale

Why do I need to do this with a regular expression?

16 characters is too short for profile names. Even some of our default names exceed that. We need a variable limit based on these special characters.

Share Improve this question asked Oct 28, 2016 at 1:32 Ghostkeeper 3,0501 gold badge19 silver badges30 bronze badges

I can make this regex without much trouble, but I feel dirty doing it. I've made several attempts, but can't make a good, generic solution that can be expanded for longer strings (length 13 for example) or higher values (b=4 for example) – Addison Commented Oct 28, 2016 at 4:29
Could you not length the submitted name (after url-encoding) then decide to accept or reject it? Seems the simplest solution. – A. L Commented Oct 28, 2016 at 5:00
2 I'm bookmarking this question as my point of reference on how to ask a good regex question. Too many regex questions out there are sloppily written, unspecific and unclear. This is perfect. – Tim Pietzcker Commented Oct 28, 2016 at 5:28
@A.Lau That's impossible. I've tried a solution where I could write my own validator via PyQt, but that resulted in a segfault. We traced that to a bug in PyQt and submitted a chreq for Riverbank Solutions. I'm therefore limited to using one of their built-in validators. The only validator that applies to other stuff than numbers is the RegExpValidator. – Ghostkeeper Commented Oct 28, 2016 at 8:04

Add a ment |

3 Answers 3

Sorted by: Reset to default 6

There is no real straightforward way to do this, given the limitations of regexp. You're going to have to test for all binations, such as thirteen b with up to one a, twelve b with up to four a, and so on. We will build a little program to generate these for us. The basic format for testing for up to four a will be

/^(?=([^a]*a){0,4}[^a]*$)/

We'll write a little routine to create these lookaheads for us, given some letter and a minimum and maximum number of occurrences:

function matchLetter(c, m, n) {
  return `(?=([^${c}]*${c}){${m},${n}}[^${c}]*$)`;
}

> matchLetter('a', 0, 4)
< "(?=([^a]*a){0,4}[^a]*$)"

We can bine these to test for three b with up to three a:

/^(?=([^b]*b){3}[^b]*$)(?=([^a]*a){0,3}[^a]*$)/

We will write a function to create such bined lookaheads which matches exactly m occurrences of c1 and up to n occurrences of c2:

function matchTwoLetters(c1, m, c2, n) {
  return matchLetter(c1, m, m) + matchLetter(c2, 0, n);
}

We can use this to match exactly twelve b and up to four a, for a total of forty or less:

> matchTwoLetters('b', 12, 'a', 1, 4)
< "(?=([^b]*b){12,12}[^b]*$)(?=([^a]*a){0,4}[^a]*$)"

It remains to simply create versions of this for each count of b, and glom them together (for the case of a max count of 12):

function makeRegExp() {
  const res = [];
  for (let bs = 0; bs <= 4; bs++)
    res.push(matchTwoLetters('b', bs, 'a', 12 - bs*3));
  return new RegExp(`^(${res.join('|')})`);
}

> makeRegExp()
< "^((?=([^b]*b){0,0}[^b]*$)(?=([^a]*a){0,12}[^a]*$)|(?=([^b]*b){1,1}[^b]*$)(?=([^a]*a){0,9}[^a]*$)|(?=([^b]*b){2,2}[^b]*$)(?=([^a]*a){0,6}[^a]*$)|(?=([^b]*b){3,3}[^b]*$)(?=([^a]*a){0,3}[^a]*$)|(?=([^b]*b){4,4}[^b]*$)(?=([^a]*a){0,0}[^a]*$))"

Now you can do the test with

makeRegExp().test("baabaaa");

For the case of length=40, the regxp is 679 characters long. A very rough benchmark shows that it executes in under a microsecond.

If you want to count bytes when multibyte encoding is present, you can use this function:

function bytesLength(str) {
  var s = str.length;
  for (var i = s-1; i > -1; i--) {
    var code = str.charCodeAt(i);
    if (code > 0x7f && code <= 0x7ff) {s++;}
    else if (code > 0x7ff && code <= 0xffff) {s+=2;}
    if (code >= 0xDC00 && code <= 0xDFFF) {i--;}
  }
  return s;
}

console.log(bytesLength('敗')); // length 3

Try using something like this:

^((a{1,3}|b){1,4}|(a{1,4}|a?b|ba){1,3}|((a{2,3}|b){2}|aaba|abaa){2})$

Example: https://regex101./r/yTTiEX/6

This breaks it up into the logical possibilities:

4 parts, each with a value up to 3.
3 parts, each with a value up to 4.
2 parts, each with a value up to 6.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

javascript - Regex character count, but some count for three - Stack Overflow

Constraints

Rationale

Constraints

Rationale

3 Answers 3

与本文相关的文章

评论列表(0)