python - How to skip, if starts with, but match other strings

I want to match and substitute for strings as shown in the example below, but not for some strings which start with test or !!. I have used negative lookahead to skip matching unwanted strings but (Street|,)(?=\d) matching for Street & comma replacing group 1 with UK/ is not working as expected.

import re
input = [ 'Street1-2,4,6,8-10', 
          '!! Street4/31/2',
          'test Street4' ]
pattern = r'(^(?!test\s|!!\s).*(Street|,)(?=\d))'
output = [re.sub(pattern, r'\g<1>UK/', line) for line in input ]

Actual output:

['Street1-2,4,6,UK/8-10', '!! Street4/31/2', 'test Street4']

Expected output:

['StreetUK/1-2,UK/4,UK/6,UK/8-10', '!! Street4/31/2', 'test Street4']

import re
input = [ 'Street1-2,4,6,8-10', 
          '!! Street4/31/2',
          'test Street4' ]
pattern = r'(^(?!test\s|!!\s).*(Street|,)(?=\d))'
output = [re.sub(pattern, r'\g<1>UK/', line) for line in input ]

Actual output:

['Street1-2,4,6,UK/8-10', '!! Street4/31/2', 'test Street4']

Expected output:

['StreetUK/1-2,UK/4,UK/6,UK/8-10', '!! Street4/31/2', 'test Street4']

Share Improve this question edited Mar 28 at 15:25 Arvind Kumar Avinash 79.8k10 gold badges92 silver badges135 bronze badges asked Mar 19 at 7:04 Sunil Bojanapally 12.8k4 gold badges36 silver badges47 bronze badges

5 maybe simpler would be to use ... if input.startswith(('test', '!!')) else ... – furas Commented Mar 19 at 7:55
Does a valid input always start with Street1? What should happen if input is abc-3,4,5,6 ? – anubhava Commented Mar 19 at 10:25

Add a comment |

6 Answers 6

Sorted by: Reset to default 5

You could change the pattern to use 2 capture groups, and then use a callback with re.sub.

The callback checks if there is a group 1 value. If there is, use it in the replacement, else use group 2 followed by UK/

^((?:!!|test)\s.*)|(Street|,)(?=\d)

The regex matches

^((?:!!|test)\s.*) Capture either !! or test at the start of the string followed by a whitespace char and then the rest of the line in group 1
| Or
(Street|,)(?=\d) Capture either Street or , in group 2 while asserting a digit to the right

See a regex101 demo

import re

lst = ['Street1-2,4,6,8-10',
       '!! Street4/31/2',
       'test Street4']

pattern = r'^((?:!!|test)\s.*)|(Street|,)(?=\d)'

output = [re.sub(pattern, lambda m: m.group(1) or m.group(2) + 'UK/', line) for line in lst]

print(output)

Output

['StreetUK/1-2,UK/4,UK/6,UK/8-10', '!! Street4/31/2', 'test Street4']

Here is one robust solution using the python's regex module that allows use to use PCRE features such as (*SKIP)(*F) and \G.

^(?:!!|test)\h.*(*SKIP)(*F)|(\bStreet|\G(?!^)[\d-]*,)(?=\d)

RegEx Demo

RegEx Details:

^: Start
(?:!!|test): Match !! or test
\h.*: Match a horizontal whitespace followed by any text till end of line
(*SKIP)(*F): Skip these matches altogether
|: OR
(: Start capture group #1
- \b: Match word boundary
- Street: Match Street
- |: OR
- \G: Start from end position of the previous match
- (?!^): Make sure we are NOT at the start position
- [\d-]*: Match 0 or more of digit or hyphen characters
- ,: Match a comma
): Close capture group #1
(?=\d): Lookahead to assert that we have a digit ahead

Code

import regex
arr = [ 'Street1-2,4,6,8-10', '!! Street4/31/2', 'test Street4', '1,2,3' ]

rx = regexpile(r'^(?:!!|test)\h.*(*SKIP)(*F)|(\bStreet|\G(?!^)[\d-]*,)(?=\d)')
output = [rx.sub(r'\1UK/', s) for s in arr]

print(output)

Output:

['StreetUK/1-2,UK/4,UK/6,UK/8-10', '!! Street4/31/2', 'test Street4', '1,2,3']

If you are fine with using the regex library use the first approach. Alternatively use the second aproach in the end of the answer for sticking with base re, but moving the check for unwanted lines to python.

First approach

Try matching:

^(?:test|!!)\s.*(*SKIP)(*FAIL)|(?<=Street|,)(?=\d)

and replacing with:

UK/

See: regex101

import regex as re
input = [ 'Street1-2,4,6,8-10', 
          '!! Street4/31/2',
          'test Street4' ]
pattern = r'^(?:test|!!)\s.*(*SKIP)(*FAIL)|(?<=Street|,)(?=\d)'
output = [re.sub(pattern, r'UK/', line) for line in input ]

print(output)

Explanation

MATCH:

^(?:test|!!)\s.*: Match all strings that you do not want...
(*SKIP)(*FAIL): ... and disregard them, making use of this technique.
|: For all valid strings do:
(?<=Street|,): Look for a point behind "Street" or "," ...
(?=\d): ... but ahead of a digit.

REPLACE:

UK/: Replace with "UK/"

Second approach

Alternativly move the check for unwanted starts to python, as suggested by @furas:

import  re
input = [ 'Street1-2,4,6,8-10', 
          '!! Street4/31/2',
          'test Street4']
unwanted_starts=("test","!!")
pattern = r'(?:(?<=Street)|(?<=,))(?=\d)'
output = [
    re.sub(pattern, r'UK/', line) 
    if not line.startswith(unwanted_starts) 
    else line for line in input
    ]

print(output)

Note that (?<=Street|,) becomes (?:(?<=Street)|(?<=,)) to account for fixed width lookbehinds in the regular re module.

Here is a robust approach which works. We can first match every occurrence of Street which is not preceded by the word test or !!. Then, we can use a regex callback to selectively replace each street number (or range of numbers) with the UK prefix followed by the same number/range.

import re
inp = ['Street1-2,4,6,8-10', 
       '!! Street4/31/2',
       'test Street4']

def repl(m):
    return re.sub(r'(\d+(?:-\d+)?)', r'UK/\1', m.group(0))

output = [re.sub(r'(?<!\btest\b )(?<!!! )Street(?:\d+(?:-\d+)?)(?:[,\/](?:\d+(?:-\d+)?)*)*', repl, x) for x in inp]
print(output)

# ['StreetUK/1-2,UK/4,UK/6,UK/8-10', '!! Street4/31/2', 'test Street4']

import re

input = [
    'Street1-2,4,6,8-10',
    '!! Street4/31/2',
    'test Street4'
]

# Pattern to match 'Street' or ',' followed by a digit, excluding lines starting with 'test' or '!!'
pattern = r'^(?!test\s|!!\s).*?(Street|,)(?=\d)'

def substitute_line(line):
    while re.search(pattern, line):
        line = re.sub(pattern, lambda m: m.group(0) + 'UK/', line)
    return line

output = [substitute_line(line) for line in input]

print(output)

# ['StreetUK/1-2,UK/4,UK/6,UK/8-10', '!! Street4/31/2', 'test Street4']

I have just tried this solution, and it works with your input.

import re

input = \[ 'Street1-2,4,6,8-10',
'!! Street4/31/2',
'test Street4' \]

# My solution
pattern = r'(?!^test\\s|!!\\s)(Street|,)(?=\\d)'
# Your solution
pattern = r'(^(?!test\s|!!\s).*(Street|,)(?=\d))'

output = \[re.sub(pattern, r'\\1UK/', line) for line in input \]

print(output)

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - How to skip, if starts with, but match other strings - Stack Overflow

6 Answers 6

与本文相关的文章

评论列表(0)