最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Predicting `re` regexp memory consumption - Stack Overflow

programmeradmin1浏览0评论

I have a large (gigabyte) file where an S-expression appears, and I want to skip to the end of the S-expression. The depth of the S-expression is limited to 2, so I tried using a Python regexp (b'\\((?:[^()]|\\((?:[^()]|)*\\))*\\)'). This turned out to consume too much RAM, and digging deeper I found that memory consumption of moderately complex regexps seems highly unpredictable if the match is large. For instance, the following four equivalent regexps all match a full ten megabyte string. The fourth one (arguably the most complex one) uses a reasonable amount (30M) of RAM, whereas the others consume one gigabyte:

import re
dot = b'[' + b''.join(b'\\x%02x' % (c,) for c in range(256)) + b']'
assert re.match(b'(?:.|.)*', b'a'*10000000).end() > 1000000
assert re.match(b'(?:.|.|a)*', b'a'*10000000).end() > 1000000
assert re.match(b'(?:%s|%s)*' % (dot,dot), b'a'*10000000).end() > 1000000
assert re.match(b'(?:%s|%s|a)*' % (dot,dot), b'a'*10000000).end() > 1000000

(using Python 3.12.3)

Is there a reasonable way to predict if the performance of a Python regexp can scale? And in particular, are there some design principles I can follow if I want to avoid performance pitfalls?

(This question is specifically about the re module, because I prefer to use standard Python libraries; I suspect this would not be an issue if I switched to a third-party lib like regex)

I have a large (gigabyte) file where an S-expression appears, and I want to skip to the end of the S-expression. The depth of the S-expression is limited to 2, so I tried using a Python regexp (b'\\((?:[^()]|\\((?:[^()]|)*\\))*\\)'). This turned out to consume too much RAM, and digging deeper I found that memory consumption of moderately complex regexps seems highly unpredictable if the match is large. For instance, the following four equivalent regexps all match a full ten megabyte string. The fourth one (arguably the most complex one) uses a reasonable amount (30M) of RAM, whereas the others consume one gigabyte:

import re
dot = b'[' + b''.join(b'\\x%02x' % (c,) for c in range(256)) + b']'
assert re.match(b'(?:.|.)*', b'a'*10000000).end() > 1000000
assert re.match(b'(?:.|.|a)*', b'a'*10000000).end() > 1000000
assert re.match(b'(?:%s|%s)*' % (dot,dot), b'a'*10000000).end() > 1000000
assert re.match(b'(?:%s|%s|a)*' % (dot,dot), b'a'*10000000).end() > 1000000

(using Python 3.12.3)

Is there a reasonable way to predict if the performance of a Python regexp can scale? And in particular, are there some design principles I can follow if I want to avoid performance pitfalls?

(This question is specifically about the re module, because I prefer to use standard Python libraries; I suspect this would not be an issue if I switched to a third-party lib like regex)

Share Improve this question asked Mar 28 at 20:05 Erik CarstensenErik Carstensen 9006 silver badges18 bronze badges 1
  • regex101 has a tool that shows how a regexp is processed. I'm not sure if this is what you need. I don't think there's any easy way to analyze a regexp statically. There may be some heuristics to look for things like catastrophic backtracking. – Barmar Commented Mar 28 at 20:18
Add a comment  | 

1 Answer 1

Reset to default 2

In Python 3.11 (or later), try using "possessive quantifiers" instead. Meaning, in your example, replacing instances of * with *+.

Your regexp doesn't require backtracking, and a possessive quantifier tells the match engine not to bother saving any backtracking info to begin with. This can save a lot of RAM and time.

A more capable engine could deduce on its own that backtracking isn't useful in your case, but that's beyond what Python's engine can do.

Relatedly, when backtracking isn't useful, it can also help to replace non-capturing groups ((?:) with atomic groups ((?>), Atomic groups are also non-capturing, but also skip saving backtracking information. Possessive quantifiers are in fact syntactic sugar for writing out a more long-winded atomic group.

发布评论

评论列表(0)

  1. 暂无评论