最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Expand a dataframe row into multiple rows based on string conditions - Stack Overflow

programmeradmin0浏览0评论

I have some raw data similar to the dataframe below:


df = pd.DataFrame([{'var1': '220-224 (Even) roadname1', 'var2': 'location 1', 'var3': 'area 1'},
                   {'var1': 'site of 5 to 9 (odd) roadname2', 'var2': 'location 2', 'var3': 'area 2'},
                  {'var1': '16, 19 roadname3', 'var2': 'location 3', 'var3': 'area 3'}]
                 )
df

var1    var2    var3
0   220-224 (Even) roadname1    location 1  area 1
1   site of 5 to 9 (odd) roadname2  location 2  area 2
2   16, 19 roadname3    location 3  area 3

I would like to write a function that will split the var1 strings so that each number indicated becomes a separate row in the dataframe, with an output such as:


df = pd.DataFrame([{'var1': '220 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
                   {'var1': '222 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
                   {'var1': '224 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
                   {'var1': '5 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
                  {'var1': '7 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
                  {'var1': '9 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
                  {'var1': '16 roadname3', 'var2': 'location 3', 'var3': 'area 3'},
                  {'var1': '19 roadname3', 'var2': 'location 3', 'var3': 'area 3'},]
                 )
df

var1    var2    var3
0   220 roadname1   location 1  area 1
1   222 roadname1   location 1  area 1
2   224 roadname1   location 1  area 1
3   5 roadname2     location 2  area 2
4   7 roadname2     location 2  area 2
5   9 roadname2     location 2  area 2
6   16 roadname3    location 3  area 3
7   19 roadname3    location 3  area 3

the string conditions are a bit variable with capitalization and number ranges and I am not sure if there is an efficient way to do this that can handle the string variation.

I have some raw data similar to the dataframe below:


df = pd.DataFrame([{'var1': '220-224 (Even) roadname1', 'var2': 'location 1', 'var3': 'area 1'},
                   {'var1': 'site of 5 to 9 (odd) roadname2', 'var2': 'location 2', 'var3': 'area 2'},
                  {'var1': '16, 19 roadname3', 'var2': 'location 3', 'var3': 'area 3'}]
                 )
df

var1    var2    var3
0   220-224 (Even) roadname1    location 1  area 1
1   site of 5 to 9 (odd) roadname2  location 2  area 2
2   16, 19 roadname3    location 3  area 3

I would like to write a function that will split the var1 strings so that each number indicated becomes a separate row in the dataframe, with an output such as:


df = pd.DataFrame([{'var1': '220 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
                   {'var1': '222 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
                   {'var1': '224 roadname1', 'var2': 'location 1', 'var3': 'area 1'},
                   {'var1': '5 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
                  {'var1': '7 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
                  {'var1': '9 roadname2', 'var2': 'location 2', 'var3': 'area 2'},
                  {'var1': '16 roadname3', 'var2': 'location 3', 'var3': 'area 3'},
                  {'var1': '19 roadname3', 'var2': 'location 3', 'var3': 'area 3'},]
                 )
df

var1    var2    var3
0   220 roadname1   location 1  area 1
1   222 roadname1   location 1  area 1
2   224 roadname1   location 1  area 1
3   5 roadname2     location 2  area 2
4   7 roadname2     location 2  area 2
5   9 roadname2     location 2  area 2
6   16 roadname3    location 3  area 3
7   19 roadname3    location 3  area 3

the string conditions are a bit variable with capitalization and number ranges and I am not sure if there is an efficient way to do this that can handle the string variation.

Share Improve this question asked Nov 20, 2024 at 12:45 suzannesuzanne 495 bronze badges 3
  • Could you please explain why the last row of the expected output is 19 instead of 18? Should not the numbers be incremented by 2? – PaulS Commented Nov 20, 2024 at 13:37
  • @PaulS presumably 16, 19 should yield those two exact values; if you had 9, 5, 12 then this should be those 3 specific values. Check my answer for an example. – mozway Commented Nov 20, 2024 at 13:40
  • It is unclear whether that is the right interpretation, @mozway. For instance, in my country, one side of the streets doors are odd numbered, while the in the other side doors are even numbered. – PaulS Commented Nov 20, 2024 at 13:43
Add a comment  | 

2 Answers 2

Reset to default 0

Use a custom function to split the ranges (below is an example using regular expressions), then explode:

import re

def parse_range(s):
    # handle the "x-y" / "x to y" case with optional odd/even
    pat1 = r'^\D*(\d+)(?:-|\s+to\s+)(\d+)(?:\s*\((even|odd)\))?\s*(.*)$'
    # handle the "a,b,c" case
    pat2 = r'^\D*([\d ,]+)\s*(.*)$'
    m1 = re.search(pat1, s.lower())
    if m1:
        end = m1.group(4)
        if m1.group(3): # if odd/even only generate every other value
            # NB. there is no check that odd/even actually matches the
            # parity of the numbers, but it is easy to add if needed
            return [f'{i} {end}' for i in
                    range(int(m1.group(1)), int(m1.group(2))+1, 2)]
        else: # generate all numbers in range
            return [f'{i} {end}' for i in
                    range(int(m1.group(1)), int(m1.group(2))+1)]
    m2 = re.search(pat2, s.lower())
    if m2: # second case, split individual digits
        end = m2.group(2)
        return [f'{i} {end}' for i in re.findall(r'\d+', m2.group(1))]
    return s # failback, return the string unchanged
    
out = (df.assign(var1=df['var1'].map(parse_range))
         .explode('var1')
      )

Output:

            var1        var2    var3
0  220 roadname1  location 1  area 1
0  222 roadname1  location 1  area 1
0  224 roadname1  location 1  area 1
1    5 roadname2  location 2  area 2
1    7 roadname2  location 2  area 2
1    9 roadname2  location 2  area 2
2   16 roadname3  location 3  area 3
2   19 roadname3  location 3  area 3

regex 1 demo

regex 2 demo

A possible solution, whose steps are:

  • First, it uses str.extractall to extract all digit sequences from var1, resulting in a dataframe d with a MultiIndex.

  • Then, it employs xs to select the first two matches (levels 0 and 1) from d, concatenates them along the columns using pd.concat, and converts them to integers.

  • Next, it applies a lambda function to create a range from the first to the second number, incrementing by 2, and assigns this range to a new 'aux' column.

  • The dataframe is then exploded with explode to expand each range into separate rows.

  • Subsequently, it uses pipe to assign a new 'aux' column by concatenating the string representation of the range values with the last word extracted from var1 using str.extract with expand=False.

  • Finally, the dataframe is reset to a default integer index with reset_index, dropping the old index.

d = df['var1'].str.extractall(r'(\d+)')
(df.assign(
    aux = pd.concat([d.xs(i, level=1) for i in [0, 1]], axis=1)
    .astype(int)
    .apply(lambda x: range(x.iloc[0], x.iloc[1] + 1, 2), axis=1))
 .explode('aux').pipe(
     lambda x: x.assign(
         aux=x['aux'].astype(str) + ' ' + x['var1'].str.extract(r'(\w+)$', expand=False)))
 .reset_index(drop=True))

Output:

                             var1        var2    var3            aux
0        220-224 (Even) roadname1  location 1  area 1  220 roadname1
1        220-224 (Even) roadname1  location 1  area 1  222 roadname1
2        220-224 (Even) roadname1  location 1  area 1  224 roadname1
3  site of 5 to 9 (odd) roadname2  location 2  area 2    5 roadname2
4  site of 5 to 9 (odd) roadname2  location 2  area 2    7 roadname2
5  site of 5 to 9 (odd) roadname2  location 2  area 2    9 roadname2
6                16, 19 roadname3  location 3  area 3   16 roadname3
7                16, 19 roadname3  location 3  area 3   18 roadname3
发布评论

评论列表(0)

  1. 暂无评论