python - Expand and then sort dataframe based on the value order in the first row

Suppose I have a DataFrame with the following format of strings separated by commas:

Index	ColumnName
0	apple,peach,orange,pear,
1	orange, pear,apple
2	pear
3	peach,apple
4	orange

Suppose I have a DataFrame with the following format of strings separated by commas:

Index	ColumnName
0	apple,peach,orange,pear,
1	orange, pear,apple
2	pear
3	peach,apple
4	orange

The actual number of rows will be greater than 10,000.

I want to expand the DataFrame and sort the DataFrame by row 0.

My expected output is below, where None is of type NoneType:

Index	0	1	2	3
0	apple	peach	orange	pear
1	apple	None	orange	pear
2	None	None	None	pear
3	apple	peach	None	None
4	None	None	orange	None

I have expanded the data using the following code:

df = df['ColumnName'].str.split(',', expand=True)  # Expand initial DataFrame

However, I am unable to sort or reorder the data as desired despite trying various combinations of df.sort_values().

Share Improve this question edited Mar 19 at 13:33 qingy2019 6489 silver badges25 bronze badges asked Mar 16 at 18:41 user29981437 233 bronze badges

Add a comment |

3 Answers 3

Sorted by: Reset to default 1

Here is another way:

s = df['columnName'].str.strip(',').str.split(', ?').explode()

s.set_axis(pd.MultiIndex.from_frame(s.groupby(s).ngroup().reset_index())).unstack()

Output:

0          0       1      2     3
index                            
0      apple  orange  peach  pear
1      apple  orange    NaN  pear
2        NaN     NaN    NaN  pear
3      apple     NaN  peach   NaN
4        NaN  orange    NaN   NaN

We can use np.repeat and after check matches cells for each row in your dataframe

#df = df.set_index("index") #optional if index is a column
all_data = df["columnName"].str.split(",", expand=True).to_numpy()
data_ordered = np.repeat([all_data[0, :-1]], df.shape[0], axis=0)
final_df = \
pd.DataFrame(np.where(np.equal(data_ordered[..., np.newaxis], 
                               all_data[:, np.newaxis, :])
                .any(axis=2), 
                      data_ordered, 
                      None), index=df.index)
print(final_df)

#           0      1       2     3
#index                            
#0      apple  peach  orange  pear
#1      apple   None  orange  pear
#2       None   None    None  pear
#3      apple  peach    None  None
#4       None   None  orange  None

A possible solution:

aux = df['columnName'].str.split(',')
order = aux.iloc[0]
exploded = aux.explode()
m = exploded.values[:, None] == np.array(order)[None, :]

rows, cols = np.where(m)
result_rows = exploded.index

result = np.full((len(aux), len(order)), None)
result[result_rows, cols] = exploded.values[rows]

(pd.DataFrame(result, index=df.index, columns=order)
 .set_axis(range(len(order)), axis=1).reset_index())

With explode, it expands lists into individual rows for element-wise comparison. Afterwards, with m = exploded.values[:, None] == np.array(order)[None, :], it uses numpy broadcasting to create a 2D boolean mask of matches. Next, np.where finds matching positions, and result[result_rows, cols] = ... fills aligned values into a preallocated array.

Another possible solution, based on dictionary comprehension:

out = df['columnName'].str.split(',')

(pd.DataFrame([{
    x: x if x in row else None for x in out[0]} for row in out])
.set_axis(range(len(df)-1), axis=1).reset_index())

It constructs a dataframe where each row aligns elements to the order of the first row. For each row, a dictionary comprehension checks if elements from the first row exist (assigning None otherwise), creating aligned columns. set_axis reindexes columns numerically, and reset_index adds the original index as a column.

Yet another possible solution, based on merge:

out = df['columnName'].str.split(',', expand=True).T
pd.concat(
    [out[[0]]] + [out[[0]].merge(out[[i]], left_on=0, right_on=i, how='left')[[i]] 
     for i in out.columns[1:]], axis=1).T.reset_index()

It splits the columnName into a transposed dataframe (out) (assuming the last comma in the first row is a typo), then uses pd.concat to sequentially merge each subsequent column with the first column (left_on=0, right_on=i, via pd.merge), aligning values based on the first column’s order. The result is transposed and combined, mimicking row-wise alignment to the first row’s structure. The final reset_index restores row indices.

Output:

   index      0      1       2     3
0      0  apple  peach  orange  pear
1      1  apple    NaN  orange   NaN
2      2    NaN    NaN     NaN  pear
3      3  apple  peach     NaN   NaN
4      4    NaN    NaN  orange   NaN

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Expand and then sort dataframe based on the value order in the first row - Stack Overflow

3 Answers 3

与本文相关的文章

评论列表(0)