I want to add columns in my df with values based on the sample list in one column and the next column headers as sample numbers. In detail: based on the 11
column, I want to add 3 columns designed as 11_1
, 11_2
and 11_3
with values according to the sample list in the 11
and then the same for 00
.
My tiny part of input data:
df_matrix_data = {'11': [['P4-1', 'P4-2', 'P4-3'], ['P4-1', 'P4-3', 'P4-4']],
'00': [['P4-4', 'P4-6', 'P4-7',], ['P4-2', 'P4-5', 'P4-7']],
'P4-1': [1, 2], 'P4-2': [6, 8], 'P4-3': [5, 2], 'P4-4': [2, 3], 'P4-5': [np.nan, 2], 'P4-6': [6, np.nan],
'P4-7': [3, 2]}
df_matrix = pd.DataFrame.from_dict(df_matrix_data)
will look like this:
11 00 P4-1 P4-2 P4-3 P4-4 P4-5 P4-6 P4-7
0 [P4-1, P4-2, P4-3] [P4-4, P4-6, P4-7] 1 6 5 2 NaN 6.0 3
1 [P4-1, P4-3, P4-4] [P4-2, P4-5, P4-7] 2 8 2 3 2.0 NaN 2
and desired output should look like this:
11 00 P4-1 P4-2 P4-3 P4-4 P4-5 P4-6 P4-7 11_1 11_2 11_3 00_1 00_2 00_3
0 [P4-1, P4-2, P4-3] [P4-4, P4-6, P4-7] 1 6 5 2 NaN 6.0 3 1 6 5 2 6 3
1 [P4-1, P4-3, P4-4] [P4-2, P4-5, P4-7] 2 8 2 3 2.0 NaN 2 2 2 3 8 2 2
Any ideas on how to perform it?
I want to add columns in my df with values based on the sample list in one column and the next column headers as sample numbers. In detail: based on the 11
column, I want to add 3 columns designed as 11_1
, 11_2
and 11_3
with values according to the sample list in the 11
and then the same for 00
.
My tiny part of input data:
df_matrix_data = {'11': [['P4-1', 'P4-2', 'P4-3'], ['P4-1', 'P4-3', 'P4-4']],
'00': [['P4-4', 'P4-6', 'P4-7',], ['P4-2', 'P4-5', 'P4-7']],
'P4-1': [1, 2], 'P4-2': [6, 8], 'P4-3': [5, 2], 'P4-4': [2, 3], 'P4-5': [np.nan, 2], 'P4-6': [6, np.nan],
'P4-7': [3, 2]}
df_matrix = pd.DataFrame.from_dict(df_matrix_data)
will look like this:
11 00 P4-1 P4-2 P4-3 P4-4 P4-5 P4-6 P4-7
0 [P4-1, P4-2, P4-3] [P4-4, P4-6, P4-7] 1 6 5 2 NaN 6.0 3
1 [P4-1, P4-3, P4-4] [P4-2, P4-5, P4-7] 2 8 2 3 2.0 NaN 2
and desired output should look like this:
11 00 P4-1 P4-2 P4-3 P4-4 P4-5 P4-6 P4-7 11_1 11_2 11_3 00_1 00_2 00_3
0 [P4-1, P4-2, P4-3] [P4-4, P4-6, P4-7] 1 6 5 2 NaN 6.0 3 1 6 5 2 6 3
1 [P4-1, P4-3, P4-4] [P4-2, P4-5, P4-7] 2 8 2 3 2.0 NaN 2 2 2 3 8 2 2
Any ideas on how to perform it?
Share Improve this question asked Mar 12 at 12:36 emoremor 2051 silver badge9 bronze badges3 Answers
Reset to default 1Another possible solution:
df_matrix.assign(
**{f"{k}_{i+1}": df_matrix.apply(
lambda row: row[row[k][i]], axis=1)
for k in ['11', '00'] for i in range(3)})
It uses a dictionary comprehension within assign
, iterating over each key (e.g., '11') and list index (0-2), then generates columns like 11_1
by mapping the list's element (e.g., row['11'][0]
) to its corresponding value in the row via lambda
.
To avoid the inefficient apply
:
df_matrix.assign(
**{f"{k}_{i+1}": df_matrix.values[
np.arange(len(df_matrix)),
df_matrix.columns.get_indexer(df_matrix[k].str[i])]
for k in ['11', '00'] for i in range(3)})
It uses index.get_indexer
to convert column names to numeric indices.
Output:
11 00 P4-1 P4-2 P4-3 P4-4 P4-5 P4-6 \
0 [P4-1, P4-2, P4-3] [P4-4, P4-6, P4-7] 1 6 5 2 NaN 6.0
1 [P4-1, P4-3, P4-4] [P4-2, P4-5, P4-7] 2 8 2 3 2.0 NaN
P4-7 11_1 11_2 11_3 00_1 00_2 00_3
0 3 1 6 5 2 6.0 3
1 2 2 2 3 8 2.0 2
You can split the input in two depending on the target column, then reshape with melt
, merge
, pivot
(with help of groupby.cumcount
), and join
:
# columns to consider
cols = ['11', '00']
# first reshape the columns with the lists
tmp1 = (df_matrix[cols]
.melt(value_name='col', ignore_index=False)
.explode('col').reset_index()
.assign(n=lambda x: x.groupby(['index', 'variable']).cumcount()+1)
)
# then reshape the columns with the values
tmp2 = (df_matrix.drop(columns=cols, errors='ignore')
.melt(var_name='col', ignore_index=False)
.reset_index()
)
# merge, reshape, rename columns
out = tmp1.merge(tmp2, how='left').pivot(index='index', columns=['variable', 'n'], values='value')
out.columns = out.columns.map(lambda x: f'{x[0]}_{x[1]}')
# join to original
out = df_matrix.join(out)
Output:
11 00 P4-1 P4-2 P4-3 P4-4 P4-5 P4-6 P4-7 11_1 11_2 11_3 00_1 00_2 00_3
0 [P4-1, P4-2, P4-3] [P4-4, P4-6, P4-7] 1 6 5 2 NaN 6.0 3 1.0 6.0 5.0 2.0 6.0 3.0
1 [P4-1, P4-3, P4-4] [P4-2, P4-5, P4-7] 2 8 2 3 2.0 NaN 2 2.0 2.0 3.0 8.0 2.0 2.0
Intermediates:
# tmp1
index variable col n
0 0 11 P4-1 1
1 0 11 P4-2 2
2 0 11 P4-3 3
3 1 11 P4-1 1
4 1 11 P4-3 2
5 1 11 P4-4 3
6 0 00 P4-4 1
7 0 00 P4-6 2
8 0 00 P4-7 3
9 1 00 P4-2 1
10 1 00 P4-5 2
11 1 00 P4-7 3
# tmp2
index col value
0 0 P4-1 1.0
1 1 P4-1 2.0
2 0 P4-2 6.0
3 1 P4-2 8.0
4 0 P4-3 5.0
5 1 P4-3 2.0
6 0 P4-4 2.0
7 1 P4-4 3.0
8 0 P4-5 NaN
9 1 P4-5 2.0
10 0 P4-6 6.0
11 1 P4-6 NaN
12 0 P4-7 3.0
13 1 P4-7 2.0
Here is the script:
import pandas as pd
import numpy as np
df_matrix_data = {
'11': [['P4-1', 'P4-2', 'P4-3'], ['P4-1', 'P4-3', 'P4-4']],
'00': [['P4-4', 'P4-6', 'P4-7'], ['P4-2', 'P4-5', 'P4-7']],
'P4-1': [1, 2], 'P4-2': [6, 8], 'P4-3': [5, 2], 'P4-4': [2, 3],
'P4-5': [np.nan, 2], 'P4-6': [6, np.nan], 'P4-7': [3, 2]
}
df_matrix = pd.DataFrame.from_dict(df_matrix_data)
def extract_values(row, col_name):
return [row[item] if item in row else np.nan for item in row[col_name]]
for col in ['11', '00']:
extracted_values = df_matrix.apply(lambda row: extract_values(row, col), axis=1)
df_expanded = pd.DataFrame(extracted_values.tolist(), columns=[f"{col}_{i+1}" for i in range(extracted_values.str.len().max())])
df_matrix = pd.concat([df_matrix, df_expanded], axis=1)
print(df_matrix)
00_2 column is adding point zero to the numbers. If you want to remove these, please add this line above "df_matrix = ..." line.
df_expanded = df_expanded.map(lambda x: int(x) if pd.notna(x) else np.nan)