I have a large (gigabyte) file where an S-expression appears, and I want to skip to the end of the S-expression. The depth of the S-expression is limited to 2, so I tried using a Python regexp (b'\\((?:[^()]|\\((?:[^()]|)*\\))*\\)'
). This turned out to consume too much RAM, and digging deeper I found that memory consumption of moderately complex regexps seems highly unpredictable if the match is large. For instance, the following four equivalent regexps all match a full ten megabyte string. The fourth one (arguably the most complex one) uses a reasonable amount (30M) of RAM, whereas the others consume one gigabyte:
import re
dot = b'[' + b''.join(b'\\x%02x' % (c,) for c in range(256)) + b']'
assert re.match(b'(?:.|.)*', b'a'*10000000).end() > 1000000
assert re.match(b'(?:.|.|a)*', b'a'*10000000).end() > 1000000
assert re.match(b'(?:%s|%s)*' % (dot,dot), b'a'*10000000).end() > 1000000
assert re.match(b'(?:%s|%s|a)*' % (dot,dot), b'a'*10000000).end() > 1000000
(using Python 3.12.3)
Is there a reasonable way to predict if the performance of a Python regexp can scale? And in particular, are there some design principles I can follow if I want to avoid performance pitfalls?
(This question is specifically about the re
module, because I prefer to use standard Python libraries; I suspect this would not be an issue if I switched to a third-party lib like regex
)
I have a large (gigabyte) file where an S-expression appears, and I want to skip to the end of the S-expression. The depth of the S-expression is limited to 2, so I tried using a Python regexp (b'\\((?:[^()]|\\((?:[^()]|)*\\))*\\)'
). This turned out to consume too much RAM, and digging deeper I found that memory consumption of moderately complex regexps seems highly unpredictable if the match is large. For instance, the following four equivalent regexps all match a full ten megabyte string. The fourth one (arguably the most complex one) uses a reasonable amount (30M) of RAM, whereas the others consume one gigabyte:
import re
dot = b'[' + b''.join(b'\\x%02x' % (c,) for c in range(256)) + b']'
assert re.match(b'(?:.|.)*', b'a'*10000000).end() > 1000000
assert re.match(b'(?:.|.|a)*', b'a'*10000000).end() > 1000000
assert re.match(b'(?:%s|%s)*' % (dot,dot), b'a'*10000000).end() > 1000000
assert re.match(b'(?:%s|%s|a)*' % (dot,dot), b'a'*10000000).end() > 1000000
(using Python 3.12.3)
Is there a reasonable way to predict if the performance of a Python regexp can scale? And in particular, are there some design principles I can follow if I want to avoid performance pitfalls?
(This question is specifically about the re
module, because I prefer to use standard Python libraries; I suspect this would not be an issue if I switched to a third-party lib like regex
)
- regex101 has a tool that shows how a regexp is processed. I'm not sure if this is what you need. I don't think there's any easy way to analyze a regexp statically. There may be some heuristics to look for things like catastrophic backtracking. – Barmar Commented Mar 28 at 20:18
1 Answer
Reset to default 2In Python 3.11 (or later), try using "possessive quantifiers" instead. Meaning, in your example, replacing instances of *
with *+.
Your regexp doesn't require backtracking, and a possessive quantifier tells the match engine not to bother saving any backtracking info to begin with. This can save a lot of RAM and time.
A more capable engine could deduce on its own that backtracking isn't useful in your case, but that's beyond what Python's engine can do.
Relatedly, when backtracking isn't useful, it can also help to replace non-capturing groups ((?:
) with atomic groups ((?>)
, Atomic groups are also non-capturing, but also skip saving backtracking information. Possessive quantifiers are in fact syntactic sugar for writing out a more long-winded atomic group.