最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Expand and then sort dataframe based on the value order in the first row - Stack Overflow

programmeradmin4浏览0评论

Suppose I have a DataFrame with the following format of strings separated by commas:

Index ColumnName
0 apple,peach,orange,pear,
1 orange, pear,apple
2 pear
3 peach,apple
4 orange

Suppose I have a DataFrame with the following format of strings separated by commas:

Index ColumnName
0 apple,peach,orange,pear,
1 orange, pear,apple
2 pear
3 peach,apple
4 orange

The actual number of rows will be greater than 10,000.

I want to expand the DataFrame and sort the DataFrame by row 0.

My expected output is below, where None is of type NoneType:

Index 0 1 2 3
0 apple peach orange pear
1 apple None orange pear
2 None None None pear
3 apple peach None None
4 None None orange None

I have expanded the data using the following code:

df = df['ColumnName'].str.split(',', expand=True)  # Expand initial DataFrame

However, I am unable to sort or reorder the data as desired despite trying various combinations of df.sort_values().

Share Improve this question edited Mar 19 at 13:33 qingy2019 6489 silver badges25 bronze badges asked Mar 16 at 18:41 user29981437user29981437 233 bronze badges
Add a comment  | 

3 Answers 3

Reset to default 1

Here is another way:

s = df['columnName'].str.strip(',').str.split(', ?').explode()

s.set_axis(pd.MultiIndex.from_frame(s.groupby(s).ngroup().reset_index())).unstack()

Output:

0          0       1      2     3
index                            
0      apple  orange  peach  pear
1      apple  orange    NaN  pear
2        NaN     NaN    NaN  pear
3      apple     NaN  peach   NaN
4        NaN  orange    NaN   NaN

We can use np.repeat and after check matches cells for each row in your dataframe

#df = df.set_index("index") #optional if index is a column
all_data = df["columnName"].str.split(",", expand=True).to_numpy()
data_ordered = np.repeat([all_data[0, :-1]], df.shape[0], axis=0)
final_df = \
pd.DataFrame(np.where(np.equal(data_ordered[..., np.newaxis], 
                               all_data[:, np.newaxis, :])
                .any(axis=2), 
                      data_ordered, 
                      None), index=df.index)
print(final_df)

#           0      1       2     3
#index                            
#0      apple  peach  orange  pear
#1      apple   None  orange  pear
#2       None   None    None  pear
#3      apple  peach    None  None
#4       None   None  orange  None

A possible solution:

aux = df['columnName'].str.split(',')
order = aux.iloc[0]
exploded = aux.explode()
m = exploded.values[:, None] == np.array(order)[None, :]

rows, cols = np.where(m)
result_rows = exploded.index

result = np.full((len(aux), len(order)), None)
result[result_rows, cols] = exploded.values[rows]

(pd.DataFrame(result, index=df.index, columns=order)
 .set_axis(range(len(order)), axis=1).reset_index())

With explode, it expands lists into individual rows for element-wise comparison. Afterwards, with m = exploded.values[:, None] == np.array(order)[None, :], it uses numpy broadcasting to create a 2D boolean mask of matches. Next, np.where finds matching positions, and result[result_rows, cols] = ... fills aligned values into a preallocated array.


Another possible solution, based on dictionary comprehension:

out = df['columnName'].str.split(',')

(pd.DataFrame([{
    x: x if x in row else None for x in out[0]} for row in out])
.set_axis(range(len(df)-1), axis=1).reset_index())

It constructs a dataframe where each row aligns elements to the order of the first row. For each row, a dictionary comprehension checks if elements from the first row exist (assigning None otherwise), creating aligned columns. set_axis reindexes columns numerically, and reset_index adds the original index as a column.


Yet another possible solution, based on merge:

out = df['columnName'].str.split(',', expand=True).T
pd.concat(
    [out[[0]]] + [out[[0]].merge(out[[i]], left_on=0, right_on=i, how='left')[[i]] 
     for i in out.columns[1:]], axis=1).T.reset_index()

It splits the columnName into a transposed dataframe (out) (assuming the last comma in the first row is a typo), then uses pd.concat to sequentially merge each subsequent column with the first column (left_on=0, right_on=i, via pd.merge), aligning values based on the first column’s order. The result is transposed and combined, mimicking row-wise alignment to the first row’s structure. The final reset_index restores row indices.


Output:

   index      0      1       2     3
0      0  apple  peach  orange  pear
1      1  apple    NaN  orange   NaN
2      2    NaN    NaN     NaN  pear
3      3  apple  peach     NaN   NaN
4      4    NaN    NaN  orange   NaN
发布评论

评论列表(0)

  1. 暂无评论