最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Pandas Dataframe: add columns based on list of samples and column headers - Stack Overflow

programmeradmin0浏览0评论

I want to add columns in my df with values based on the sample list in one column and the next column headers as sample numbers. In detail: based on the 11 column, I want to add 3 columns designed as 11_1, 11_2 and 11_3 with values according to the sample list in the 11 and then the same for 00.

My tiny part of input data:

df_matrix_data = {'11': [['P4-1', 'P4-2', 'P4-3'], ['P4-1', 'P4-3', 'P4-4']],
                  '00': [['P4-4', 'P4-6', 'P4-7',], ['P4-2', 'P4-5', 'P4-7']],
                  'P4-1': [1, 2], 'P4-2': [6, 8], 'P4-3': [5, 2], 'P4-4': [2, 3], 'P4-5': [np.nan, 2], 'P4-6': [6, np.nan],
                  'P4-7': [3, 2]}
df_matrix = pd.DataFrame.from_dict(df_matrix_data)

will look like this:

                   11                  00  P4-1  P4-2  P4-3  P4-4  P4-5  P4-6  P4-7
0  [P4-1, P4-2, P4-3]  [P4-4, P4-6, P4-7]     1     6     5     2   NaN   6.0     3
1  [P4-1, P4-3, P4-4]  [P4-2, P4-5, P4-7]     2     8     2     3   2.0   NaN     2

and desired output should look like this:

                   11                  00  P4-1  P4-2  P4-3  P4-4  P4-5  P4-6  P4-7  11_1  11_2  11_3  00_1  00_2  00_3
0  [P4-1, P4-2, P4-3]  [P4-4, P4-6, P4-7]     1     6     5     2   NaN   6.0     3     1     6     5     2     6     3
1  [P4-1, P4-3, P4-4]  [P4-2, P4-5, P4-7]     2     8     2     3   2.0   NaN     2     2     2     3     8     2     2

Any ideas on how to perform it?

I want to add columns in my df with values based on the sample list in one column and the next column headers as sample numbers. In detail: based on the 11 column, I want to add 3 columns designed as 11_1, 11_2 and 11_3 with values according to the sample list in the 11 and then the same for 00.

My tiny part of input data:

df_matrix_data = {'11': [['P4-1', 'P4-2', 'P4-3'], ['P4-1', 'P4-3', 'P4-4']],
                  '00': [['P4-4', 'P4-6', 'P4-7',], ['P4-2', 'P4-5', 'P4-7']],
                  'P4-1': [1, 2], 'P4-2': [6, 8], 'P4-3': [5, 2], 'P4-4': [2, 3], 'P4-5': [np.nan, 2], 'P4-6': [6, np.nan],
                  'P4-7': [3, 2]}
df_matrix = pd.DataFrame.from_dict(df_matrix_data)

will look like this:

                   11                  00  P4-1  P4-2  P4-3  P4-4  P4-5  P4-6  P4-7
0  [P4-1, P4-2, P4-3]  [P4-4, P4-6, P4-7]     1     6     5     2   NaN   6.0     3
1  [P4-1, P4-3, P4-4]  [P4-2, P4-5, P4-7]     2     8     2     3   2.0   NaN     2

and desired output should look like this:

                   11                  00  P4-1  P4-2  P4-3  P4-4  P4-5  P4-6  P4-7  11_1  11_2  11_3  00_1  00_2  00_3
0  [P4-1, P4-2, P4-3]  [P4-4, P4-6, P4-7]     1     6     5     2   NaN   6.0     3     1     6     5     2     6     3
1  [P4-1, P4-3, P4-4]  [P4-2, P4-5, P4-7]     2     8     2     3   2.0   NaN     2     2     2     3     8     2     2

Any ideas on how to perform it?

Share Improve this question asked Mar 12 at 12:36 emoremor 2051 silver badge9 bronze badges
Add a comment  | 

3 Answers 3

Reset to default 1

Another possible solution:

df_matrix.assign(
    **{f"{k}_{i+1}": df_matrix.apply(
        lambda row: row[row[k][i]], axis=1) 
       for k in ['11', '00'] for i in range(3)})

It uses a dictionary comprehension within assign, iterating over each key (e.g., '11') and list index (0-2), then generates columns like 11_1 by mapping the list's element (e.g., row['11'][0]) to its corresponding value in the row via lambda.


To avoid the inefficient apply:

df_matrix.assign(
    **{f"{k}_{i+1}": df_matrix.values[
    np.arange(len(df_matrix)), 
    df_matrix.columns.get_indexer(df_matrix[k].str[i])]
       for k in ['11', '00'] for i in range(3)})

It uses index.get_indexer to convert column names to numeric indices.

Output:

                   11                  00  P4-1  P4-2  P4-3  P4-4  P4-5  P4-6  \
0  [P4-1, P4-2, P4-3]  [P4-4, P4-6, P4-7]     1     6     5     2   NaN   6.0   
1  [P4-1, P4-3, P4-4]  [P4-2, P4-5, P4-7]     2     8     2     3   2.0   NaN   

   P4-7  11_1  11_2  11_3  00_1  00_2  00_3  
0     3     1     6     5     2   6.0     3  
1     2     2     2     3     8   2.0     2

You can split the input in two depending on the target column, then reshape with melt, merge, pivot (with help of groupby.cumcount), and join:

# columns to consider
cols = ['11', '00']

# first reshape the columns with the lists
tmp1 = (df_matrix[cols]
        .melt(value_name='col', ignore_index=False)
        .explode('col').reset_index()
        .assign(n=lambda x: x.groupby(['index', 'variable']).cumcount()+1)
       )
# then reshape the columns with the values
tmp2 = (df_matrix.drop(columns=cols, errors='ignore')
        .melt(var_name='col', ignore_index=False)
        .reset_index()
       )

# merge, reshape, rename columns
out = tmp1.merge(tmp2, how='left').pivot(index='index', columns=['variable', 'n'], values='value')
out.columns = out.columns.map(lambda x: f'{x[0]}_{x[1]}')

# join to original
out = df_matrix.join(out)

Output:

                   11                  00  P4-1  P4-2  P4-3  P4-4  P4-5  P4-6  P4-7  11_1  11_2  11_3  00_1  00_2  00_3
0  [P4-1, P4-2, P4-3]  [P4-4, P4-6, P4-7]     1     6     5     2   NaN   6.0     3   1.0   6.0   5.0   2.0   6.0   3.0
1  [P4-1, P4-3, P4-4]  [P4-2, P4-5, P4-7]     2     8     2     3   2.0   NaN     2   2.0   2.0   3.0   8.0   2.0   2.0

Intermediates:

# tmp1
    index variable   col  n
0       0       11  P4-1  1
1       0       11  P4-2  2
2       0       11  P4-3  3
3       1       11  P4-1  1
4       1       11  P4-3  2
5       1       11  P4-4  3
6       0       00  P4-4  1
7       0       00  P4-6  2
8       0       00  P4-7  3
9       1       00  P4-2  1
10      1       00  P4-5  2
11      1       00  P4-7  3

# tmp2
    index   col  value
0       0  P4-1    1.0
1       1  P4-1    2.0
2       0  P4-2    6.0
3       1  P4-2    8.0
4       0  P4-3    5.0
5       1  P4-3    2.0
6       0  P4-4    2.0
7       1  P4-4    3.0
8       0  P4-5    NaN
9       1  P4-5    2.0
10      0  P4-6    6.0
11      1  P4-6    NaN
12      0  P4-7    3.0
13      1  P4-7    2.0

Here is the script:

import pandas as pd
import numpy as np

df_matrix_data = {
    '11': [['P4-1', 'P4-2', 'P4-3'], ['P4-1', 'P4-3', 'P4-4']],
    '00': [['P4-4', 'P4-6', 'P4-7'], ['P4-2', 'P4-5', 'P4-7']],
    'P4-1': [1, 2], 'P4-2': [6, 8], 'P4-3': [5, 2], 'P4-4': [2, 3], 
    'P4-5': [np.nan, 2], 'P4-6': [6, np.nan], 'P4-7': [3, 2]
}

df_matrix = pd.DataFrame.from_dict(df_matrix_data)

def extract_values(row, col_name):
    return [row[item] if item in row else np.nan for item in row[col_name]]

for col in ['11', '00']:
    extracted_values = df_matrix.apply(lambda row: extract_values(row, col), axis=1)
    df_expanded = pd.DataFrame(extracted_values.tolist(), columns=[f"{col}_{i+1}" for i in range(extracted_values.str.len().max())])
    df_matrix = pd.concat([df_matrix, df_expanded], axis=1)

print(df_matrix)

00_2 column is adding point zero to the numbers. If you want to remove these, please add this line above "df_matrix = ..." line.

df_expanded = df_expanded.map(lambda x: int(x) if pd.notna(x) else np.nan)

发布评论

评论列表(0)

  1. 暂无评论