python - Inverse groupby to assign parent dataframe?

I have irregular 3D point data that looks something like this:

import numpy as np
import pandas as pd

xx, yy = np.meshgrid(
    np.linspace(-50, 50, 101),
    np.linspace(-50, 50, 101),
)
rng = np.random.default_rng(12345)
xx += rng.normal(size=101 * 101).reshape((101, 101))
yy += rng.normal(size=101 * 101).reshape((101, 101))

df3d = pd.DataFrame({
    "X": np.broadcast_to(xx, (11, 101, 101)).T.flatten(),
    "Y": np.broadcast_to(yy, (11, 101, 101)).T.flatten(),
    "Z": np.broadcast_to(np.arange(11, dtype=float), (101, 101, 11)).flatten(),
})

df3d

                X          Y     Z
0      -51.423825 -51.287428   0.0
1      -51.423825 -51.287428   1.0
2      -51.423825 -51.287428   2.0
3      -51.423825 -51.287428   3.0
4      -51.423825 -51.287428   4.0
          ...        ...   ...
112206  51.593733  50.465087   6.0
112207  51.593733  50.465087   7.0
112208  51.593733  50.465087   8.0
112209  51.593733  50.465087   9.0
112210  51.593733  50.465087  10.0

[112211 rows x 3 columns]

With my analysis, I need to group these into 2D locations with 1 or more Z measures (it's not always 11 for my real-world data):

gb2d = df3d.groupby(["X", "Y"])
df2d = gb2d["Z"].count().to_frame("count")
df2d["Zmin"] = gb2d["Z"].min()
df2d["Zmax"] = gb2d["Z"].max()

df2d.reset_index()

               X          Y  count  Zmin  Zmax
0     -51.995857 -49.653017     11   0.0  10.0
1     -51.939229  24.073164     11   0.0  10.0
2     -51.740996  -5.415639     11   0.0  10.0
3     -51.645503  21.830189     11   0.0  10.0
4     -51.639759 -42.850923     11   0.0  10.0
         ...        ...    ...   ...   ...
10196  51.593733  50.465087     11   0.0  10.0
10197  51.905789  37.538099     11   0.0  10.0
10198  51.989935 -32.464752     11   0.0  10.0
10199  52.530599 -40.110744     11   0.0  10.0
10200  52.902015  -6.111877     11   0.0  10.0

[10201 rows x 5 columns]

Question: How would I assign the integer index from df2d (shown above) back to the parent df3d frame?

My best attempt works, but does not scale well with larger frames. E.g.:

idx2d = pd.Series(np.arange(len(df2d)), index=df2d.index)
df3d["idx2d"] = idx2d.loc[df3d[["X", "Y"]].to_records(index=False).tolist()].values

works for this sample size, but takes up beyond my 32 GB RAM with my real-world data of 24 million points. What's a better way that won't eat up all of my RAM?

I have irregular 3D point data that looks something like this:

import numpy as np
import pandas as pd

xx, yy = np.meshgrid(
    np.linspace(-50, 50, 101),
    np.linspace(-50, 50, 101),
)
rng = np.random.default_rng(12345)
xx += rng.normal(size=101 * 101).reshape((101, 101))
yy += rng.normal(size=101 * 101).reshape((101, 101))

df3d = pd.DataFrame({
    "X": np.broadcast_to(xx, (11, 101, 101)).T.flatten(),
    "Y": np.broadcast_to(yy, (11, 101, 101)).T.flatten(),
    "Z": np.broadcast_to(np.arange(11, dtype=float), (101, 101, 11)).flatten(),
})

df3d

                X          Y     Z
0      -51.423825 -51.287428   0.0
1      -51.423825 -51.287428   1.0
2      -51.423825 -51.287428   2.0
3      -51.423825 -51.287428   3.0
4      -51.423825 -51.287428   4.0
          ...        ...   ...
112206  51.593733  50.465087   6.0
112207  51.593733  50.465087   7.0
112208  51.593733  50.465087   8.0
112209  51.593733  50.465087   9.0
112210  51.593733  50.465087  10.0

[112211 rows x 3 columns]

With my analysis, I need to group these into 2D locations with 1 or more Z measures (it's not always 11 for my real-world data):

gb2d = df3d.groupby(["X", "Y"])
df2d = gb2d["Z"].count().to_frame("count")
df2d["Zmin"] = gb2d["Z"].min()
df2d["Zmax"] = gb2d["Z"].max()

df2d.reset_index()

               X          Y  count  Zmin  Zmax
0     -51.995857 -49.653017     11   0.0  10.0
1     -51.939229  24.073164     11   0.0  10.0
2     -51.740996  -5.415639     11   0.0  10.0
3     -51.645503  21.830189     11   0.0  10.0
4     -51.639759 -42.850923     11   0.0  10.0
         ...        ...    ...   ...   ...
10196  51.593733  50.465087     11   0.0  10.0
10197  51.905789  37.538099     11   0.0  10.0
10198  51.989935 -32.464752     11   0.0  10.0
10199  52.530599 -40.110744     11   0.0  10.0
10200  52.902015  -6.111877     11   0.0  10.0

[10201 rows x 5 columns]

Question: How would I assign the integer index from df2d (shown above) back to the parent df3d frame?

My best attempt works, but does not scale well with larger frames. E.g.:

idx2d = pd.Series(np.arange(len(df2d)), index=df2d.index)
df3d["idx2d"] = idx2d.loc[df3d[["X", "Y"]].to_records(index=False).tolist()].values

works for this sample size, but takes up beyond my 32 GB RAM with my real-world data of 24 million points. What's a better way that won't eat up all of my RAM?

Share Improve this question asked Mar 31 at 2:01 Mike T 43.9k18 gold badges164 silver badges212 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

IIUC,

df3d.groupby(['X','Y'], as_index=False)['Z'].agg(['count','min','max']).rename_axis('d2dindex').reset_index().merge(df3d)

Output:

        d2dindex          X          Y  count  min   max     Z
0              0 -51.995857 -49.653017     11  0.0  10.0   0.0
1              0 -51.995857 -49.653017     11  0.0  10.0   1.0
2              0 -51.995857 -49.653017     11  0.0  10.0   2.0
3              0 -51.995857 -49.653017     11  0.0  10.0   3.0
4              0 -51.995857 -49.653017     11  0.0  10.0   4.0
...          ...        ...        ...    ...  ...   ...   ...
112206     10200  52.902015  -6.111877     11  0.0  10.0   6.0
112207     10200  52.902015  -6.111877     11  0.0  10.0   7.0
112208     10200  52.902015  -6.111877     11  0.0  10.0   8.0
112209     10200  52.902015  -6.111877     11  0.0  10.0   9.0
112210     10200  52.902015  -6.111877     11  0.0  10.0  10.0

[112211 rows x 7 columns]

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Inverse groupby to assign parent dataframe? - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)