Suppose I have a DataFrame with the following format of strings separated by commas:
Index | ColumnName |
---|---|
0 | apple,peach,orange,pear, |
1 | orange, pear,apple |
2 | pear |
3 | peach,apple |
4 | orange |
Suppose I have a DataFrame with the following format of strings separated by commas:
Index | ColumnName |
---|---|
0 | apple,peach,orange,pear, |
1 | orange, pear,apple |
2 | pear |
3 | peach,apple |
4 | orange |
The actual number of rows will be greater than 10,000.
I want to expand the DataFrame and sort the DataFrame by row 0.
My expected output is below, where None is of type NoneType:
Index | 0 | 1 | 2 | 3 |
---|---|---|---|---|
0 | apple | peach | orange | pear |
1 | apple | None | orange | pear |
2 | None | None | None | pear |
3 | apple | peach | None | None |
4 | None | None | orange | None |
I have expanded the data using the following code:
df = df['ColumnName'].str.split(',', expand=True) # Expand initial DataFrame
However, I am unable to sort or reorder the data as desired despite trying various combinations of df.sort_values()
.
3 Answers
Reset to default 1Here is another way:
s = df['columnName'].str.strip(',').str.split(', ?').explode()
s.set_axis(pd.MultiIndex.from_frame(s.groupby(s).ngroup().reset_index())).unstack()
Output:
0 0 1 2 3
index
0 apple orange peach pear
1 apple orange NaN pear
2 NaN NaN NaN pear
3 apple NaN peach NaN
4 NaN orange NaN NaN
We can use np.repeat
and after check matches cells for each row in your dataframe
#df = df.set_index("index") #optional if index is a column
all_data = df["columnName"].str.split(",", expand=True).to_numpy()
data_ordered = np.repeat([all_data[0, :-1]], df.shape[0], axis=0)
final_df = \
pd.DataFrame(np.where(np.equal(data_ordered[..., np.newaxis],
all_data[:, np.newaxis, :])
.any(axis=2),
data_ordered,
None), index=df.index)
print(final_df)
# 0 1 2 3
#index
#0 apple peach orange pear
#1 apple None orange pear
#2 None None None pear
#3 apple peach None None
#4 None None orange None
A possible solution:
aux = df['columnName'].str.split(',')
order = aux.iloc[0]
exploded = aux.explode()
m = exploded.values[:, None] == np.array(order)[None, :]
rows, cols = np.where(m)
result_rows = exploded.index
result = np.full((len(aux), len(order)), None)
result[result_rows, cols] = exploded.values[rows]
(pd.DataFrame(result, index=df.index, columns=order)
.set_axis(range(len(order)), axis=1).reset_index())
With explode
, it expands lists into individual rows for element-wise comparison. Afterwards, with m = exploded.values[:, None] == np.array(order)[None, :]
, it uses numpy broadcasting to create a 2D boolean mask of matches. Next, np.where
finds matching positions, and result[result_rows, cols] = ...
fills aligned values into a preallocated array.
Another possible solution, based on dictionary comprehension:
out = df['columnName'].str.split(',')
(pd.DataFrame([{
x: x if x in row else None for x in out[0]} for row in out])
.set_axis(range(len(df)-1), axis=1).reset_index())
It constructs a dataframe where each row aligns elements to the order of the first row. For each row, a dictionary comprehension checks if elements from the first row exist (assigning None
otherwise), creating aligned columns. set_axis
reindexes columns numerically, and reset_index
adds the original index as a column.
Yet another possible solution, based on merge
:
out = df['columnName'].str.split(',', expand=True).T
pd.concat(
[out[[0]]] + [out[[0]].merge(out[[i]], left_on=0, right_on=i, how='left')[[i]]
for i in out.columns[1:]], axis=1).T.reset_index()
It splits the columnName
into a transposed dataframe (out
) (assuming the last comma in the first row is a typo), then uses pd.concat
to sequentially merge each subsequent column with the first column (left_on=0
, right_on=i
, via pd.merge
), aligning values based on the first column’s order. The result is transposed and combined, mimicking row-wise alignment to the first row’s structure. The final reset_index
restores row indices.
Output:
index 0 1 2 3
0 0 apple peach orange pear
1 1 apple NaN orange NaN
2 2 NaN NaN NaN pear
3 3 apple peach NaN NaN
4 4 NaN NaN orange NaN