I have some raw data similar to the dataframe below:
df = pd.DataFrame([{'var1': '220-224 (Even) roadname1', 'var2': 'location 1', 'var3': 'area 1'},
{'var1': 'site of 5 to 9 (odd) roadname2', 'var2': 'location 2', 'var3': 'area 2'},
{'var1': '16, 19 roadname3', 'var2': 'location 3', 'var3': 'area 3'}]
)
df
var1 var2 var3
0 220-224 (Even) roadname1 location 1 area 1
1 site of 5 to 9 (odd) roadname2 location 2 area 2
2 16, 19 roadname3 location 3 area 3
I would like to write a function that will split the var1 strings so that each number indicated becomes a separate row in the dataframe, with an output such as:
df = pd.DataFrame([{'var1': '220 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
{'var1': '222 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
{'var1': '224 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
{'var1': '5 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
{'var1': '7 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
{'var1': '9 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
{'var1': '16 roadname3', 'var2': 'location 3', 'var3': 'area 3'},
{'var1': '19 roadname3', 'var2': 'location 3', 'var3': 'area 3'},]
)
df
var1 var2 var3
0 220 roadname1 location 1 area 1
1 222 roadname1 location 1 area 1
2 224 roadname1 location 1 area 1
3 5 roadname2 location 2 area 2
4 7 roadname2 location 2 area 2
5 9 roadname2 location 2 area 2
6 16 roadname3 location 3 area 3
7 19 roadname3 location 3 area 3
the string conditions are a bit variable with capitalization and number ranges and I am not sure if there is an efficient way to do this that can handle the string variation.
I have some raw data similar to the dataframe below:
df = pd.DataFrame([{'var1': '220-224 (Even) roadname1', 'var2': 'location 1', 'var3': 'area 1'},
{'var1': 'site of 5 to 9 (odd) roadname2', 'var2': 'location 2', 'var3': 'area 2'},
{'var1': '16, 19 roadname3', 'var2': 'location 3', 'var3': 'area 3'}]
)
df
var1 var2 var3
0 220-224 (Even) roadname1 location 1 area 1
1 site of 5 to 9 (odd) roadname2 location 2 area 2
2 16, 19 roadname3 location 3 area 3
I would like to write a function that will split the var1 strings so that each number indicated becomes a separate row in the dataframe, with an output such as:
df = pd.DataFrame([{'var1': '220 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
{'var1': '222 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
{'var1': '224 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
{'var1': '5 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
{'var1': '7 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
{'var1': '9 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
{'var1': '16 roadname3', 'var2': 'location 3', 'var3': 'area 3'},
{'var1': '19 roadname3', 'var2': 'location 3', 'var3': 'area 3'},]
)
df
var1 var2 var3
0 220 roadname1 location 1 area 1
1 222 roadname1 location 1 area 1
2 224 roadname1 location 1 area 1
3 5 roadname2 location 2 area 2
4 7 roadname2 location 2 area 2
5 9 roadname2 location 2 area 2
6 16 roadname3 location 3 area 3
7 19 roadname3 location 3 area 3
the string conditions are a bit variable with capitalization and number ranges and I am not sure if there is an efficient way to do this that can handle the string variation.
Share Improve this question asked Nov 20, 2024 at 12:45 suzannesuzanne 495 bronze badges 3 |2 Answers
Reset to default 0Use a custom function to split the ranges (below is an example using regular expressions), then explode
:
import re
def parse_range(s):
# handle the "x-y" / "x to y" case with optional odd/even
pat1 = r'^\D*(\d+)(?:-|\s+to\s+)(\d+)(?:\s*\((even|odd)\))?\s*(.*)$'
# handle the "a,b,c" case
pat2 = r'^\D*([\d ,]+)\s*(.*)$'
m1 = re.search(pat1, s.lower())
if m1:
end = m1.group(4)
if m1.group(3): # if odd/even only generate every other value
# NB. there is no check that odd/even actually matches the
# parity of the numbers, but it is easy to add if needed
return [f'{i} {end}' for i in
range(int(m1.group(1)), int(m1.group(2))+1, 2)]
else: # generate all numbers in range
return [f'{i} {end}' for i in
range(int(m1.group(1)), int(m1.group(2))+1)]
m2 = re.search(pat2, s.lower())
if m2: # second case, split individual digits
end = m2.group(2)
return [f'{i} {end}' for i in re.findall(r'\d+', m2.group(1))]
return s # failback, return the string unchanged
out = (df.assign(var1=df['var1'].map(parse_range))
.explode('var1')
)
Output:
var1 var2 var3
0 220 roadname1 location 1 area 1
0 222 roadname1 location 1 area 1
0 224 roadname1 location 1 area 1
1 5 roadname2 location 2 area 2
1 7 roadname2 location 2 area 2
1 9 roadname2 location 2 area 2
2 16 roadname3 location 3 area 3
2 19 roadname3 location 3 area 3
regex 1 demo
regex 2 demo
A possible solution, whose steps are:
First, it uses
str.extractall
to extract all digit sequences fromvar1
, resulting in a dataframed
with aMultiIndex
.Then, it employs
xs
to select the first two matches (levels 0 and 1) fromd
, concatenates them along the columns usingpd.concat
, and converts them to integers.Next, it applies a lambda function to create a range from the first to the second number, incrementing by 2, and assigns this range to a new 'aux' column.
The dataframe is then exploded with
explode
to expand each range into separate rows.Subsequently, it uses
pipe
to assign a new 'aux' column by concatenating the string representation of the range values with the last word extracted fromvar1
usingstr.extract
withexpand=False
.Finally, the dataframe is reset to a default integer index with
reset_index
, dropping the old index.
d = df['var1'].str.extractall(r'(\d+)')
(df.assign(
aux = pd.concat([d.xs(i, level=1) for i in [0, 1]], axis=1)
.astype(int)
.apply(lambda x: range(x.iloc[0], x.iloc[1] + 1, 2), axis=1))
.explode('aux').pipe(
lambda x: x.assign(
aux=x['aux'].astype(str) + ' ' + x['var1'].str.extract(r'(\w+)$', expand=False)))
.reset_index(drop=True))
Output:
var1 var2 var3 aux
0 220-224 (Even) roadname1 location 1 area 1 220 roadname1
1 220-224 (Even) roadname1 location 1 area 1 222 roadname1
2 220-224 (Even) roadname1 location 1 area 1 224 roadname1
3 site of 5 to 9 (odd) roadname2 location 2 area 2 5 roadname2
4 site of 5 to 9 (odd) roadname2 location 2 area 2 7 roadname2
5 site of 5 to 9 (odd) roadname2 location 2 area 2 9 roadname2
6 16, 19 roadname3 location 3 area 3 16 roadname3
7 16, 19 roadname3 location 3 area 3 18 roadname3
16, 19
should yield those two exact values; if you had9, 5, 12
then this should be those 3 specific values. Check my answer for an example. – mozway Commented Nov 20, 2024 at 13:40