I want to match and substitute for strings as shown in the example below, but not for some strings which start with test
or !!
. I have used negative lookahead to skip matching unwanted strings but (Street|,)(?=\d)
matching for Street
& comma replacing group 1 with UK/
is not working as expected.
import re
input = [ 'Street1-2,4,6,8-10',
'!! Street4/31/2',
'test Street4' ]
pattern = r'(^(?!test\s|!!\s).*(Street|,)(?=\d))'
output = [re.sub(pattern, r'\g<1>UK/', line) for line in input ]
Actual output:
['Street1-2,4,6,UK/8-10', '!! Street4/31/2', 'test Street4']
Expected output:
['StreetUK/1-2,UK/4,UK/6,UK/8-10', '!! Street4/31/2', 'test Street4']
I want to match and substitute for strings as shown in the example below, but not for some strings which start with test
or !!
. I have used negative lookahead to skip matching unwanted strings but (Street|,)(?=\d)
matching for Street
& comma replacing group 1 with UK/
is not working as expected.
import re
input = [ 'Street1-2,4,6,8-10',
'!! Street4/31/2',
'test Street4' ]
pattern = r'(^(?!test\s|!!\s).*(Street|,)(?=\d))'
output = [re.sub(pattern, r'\g<1>UK/', line) for line in input ]
Actual output:
['Street1-2,4,6,UK/8-10', '!! Street4/31/2', 'test Street4']
Expected output:
['StreetUK/1-2,UK/4,UK/6,UK/8-10', '!! Street4/31/2', 'test Street4']
Share
Improve this question
edited Mar 28 at 15:25
Arvind Kumar Avinash
79.8k10 gold badges92 silver badges135 bronze badges
asked Mar 19 at 7:04
Sunil BojanapallySunil Bojanapally
12.8k4 gold badges36 silver badges47 bronze badges
2
|
6 Answers
Reset to default 5You could change the pattern to use 2 capture groups, and then use a callback with re.sub.
The callback checks if there is a group 1 value. If there is, use it in the replacement, else use group 2 followed by UK/
^((?:!!|test)\s.*)|(Street|,)(?=\d)
The regex matches
^((?:!!|test)\s.*)
Capture either!!
ortest
at the start of the string followed by a whitespace char and then the rest of the line in group 1|
Or(Street|,)(?=\d)
Capture eitherStreet
or,
in group 2 while asserting a digit to the right
See a regex101 demo
import re
lst = ['Street1-2,4,6,8-10',
'!! Street4/31/2',
'test Street4']
pattern = r'^((?:!!|test)\s.*)|(Street|,)(?=\d)'
output = [re.sub(pattern, lambda m: m.group(1) or m.group(2) + 'UK/', line) for line in lst]
print(output)
Output
['StreetUK/1-2,UK/4,UK/6,UK/8-10', '!! Street4/31/2', 'test Street4']
Here is one robust solution using the python's regex
module that allows use to use PCRE features such as (*SKIP)(*F)
and \G
.
^(?:!!|test)\h.*(*SKIP)(*F)|(\bStreet|\G(?!^)[\d-]*,)(?=\d)
RegEx Demo
RegEx Details:
^
: Start(?:!!|test)
: Match!!
ortest
\h.*
: Match a horizontal whitespace followed by any text till end of line(*SKIP)(*F)
: Skip these matches altogether|
: OR(
: Start capture group #1\b
: Match word boundaryStreet
: MatchStreet
|
: OR\G
: Start from end position of the previous match(?!^)
: Make sure we are NOT at the start position[\d-]*
: Match 0 or more of digit or hyphen characters,
: Match a comma
)
: Close capture group #1(?=\d)
: Lookahead to assert that we have a digit ahead
Code
import regex
arr = [ 'Street1-2,4,6,8-10', '!! Street4/31/2', 'test Street4', '1,2,3' ]
rx = regexpile(r'^(?:!!|test)\h.*(*SKIP)(*F)|(\bStreet|\G(?!^)[\d-]*,)(?=\d)')
output = [rx.sub(r'\1UK/', s) for s in arr]
print(output)
Output:
['StreetUK/1-2,UK/4,UK/6,UK/8-10', '!! Street4/31/2', 'test Street4', '1,2,3']
If you are fine with using the regex
library use the first approach. Alternatively use the second aproach in the end of the answer for sticking with base re
, but moving the check for unwanted lines to python.
First approach
Try matching:
^(?:test|!!)\s.*(*SKIP)(*FAIL)|(?<=Street|,)(?=\d)
and replacing with:
UK/
See: regex101
import regex as re
input = [ 'Street1-2,4,6,8-10',
'!! Street4/31/2',
'test Street4' ]
pattern = r'^(?:test|!!)\s.*(*SKIP)(*FAIL)|(?<=Street|,)(?=\d)'
output = [re.sub(pattern, r'UK/', line) for line in input ]
print(output)
Explanation
MATCH:
^(?:test|!!)\s.*
: Match all strings that you do not want...(*SKIP)(*FAIL)
: ... and disregard them, making use of this technique.|
: For all valid strings do:(?<=Street|,)
: Look for a point behind "Street" or "," ...(?=\d)
: ... but ahead of a digit.
REPLACE:
UK/
: Replace with "UK/"
Second approach
Alternativly move the check for unwanted starts to python, as suggested by @furas:
import re
input = [ 'Street1-2,4,6,8-10',
'!! Street4/31/2',
'test Street4']
unwanted_starts=("test","!!")
pattern = r'(?:(?<=Street)|(?<=,))(?=\d)'
output = [
re.sub(pattern, r'UK/', line)
if not line.startswith(unwanted_starts)
else line for line in input
]
print(output)
Note that (?<=Street|,)
becomes (?:(?<=Street)|(?<=,))
to account for fixed width lookbehinds in the regular re module.
Here is a robust approach which works. We can first match every occurrence of Street
which is not preceded by the word test
or !!
. Then, we can use a regex callback to selectively replace each street number (or range of numbers) with the UK prefix followed by the same number/range.
import re
inp = ['Street1-2,4,6,8-10',
'!! Street4/31/2',
'test Street4']
def repl(m):
return re.sub(r'(\d+(?:-\d+)?)', r'UK/\1', m.group(0))
output = [re.sub(r'(?<!\btest\b )(?<!!! )Street(?:\d+(?:-\d+)?)(?:[,\/](?:\d+(?:-\d+)?)*)*', repl, x) for x in inp]
print(output)
# ['StreetUK/1-2,UK/4,UK/6,UK/8-10', '!! Street4/31/2', 'test Street4']
import re
input = [
'Street1-2,4,6,8-10',
'!! Street4/31/2',
'test Street4'
]
# Pattern to match 'Street' or ',' followed by a digit, excluding lines starting with 'test' or '!!'
pattern = r'^(?!test\s|!!\s).*?(Street|,)(?=\d)'
def substitute_line(line):
while re.search(pattern, line):
line = re.sub(pattern, lambda m: m.group(0) + 'UK/', line)
return line
output = [substitute_line(line) for line in input]
print(output)
# ['StreetUK/1-2,UK/4,UK/6,UK/8-10', '!! Street4/31/2', 'test Street4']
I have just tried this solution, and it works with your input.
import re
input = \[ 'Street1-2,4,6,8-10',
'!! Street4/31/2',
'test Street4' \]
# My solution
pattern = r'(?!^test\\s|!!\\s)(Street|,)(?=\\d)'
# Your solution
pattern = r'(^(?!test\s|!!\s).*(Street|,)(?=\d))'
output = \[re.sub(pattern, r'\\1UK/', line) for line in input \]
print(output)
... if input.startswith(('test', '!!')) else ...
– furas Commented Mar 19 at 7:55Street1
? What should happen if input isabc-3,4,5,6
? – anubhava Commented Mar 19 at 10:25