I have irregular 3D point data that looks something like this:
import numpy as np
import pandas as pd
xx, yy = np.meshgrid(
np.linspace(-50, 50, 101),
np.linspace(-50, 50, 101),
)
rng = np.random.default_rng(12345)
xx += rng.normal(size=101 * 101).reshape((101, 101))
yy += rng.normal(size=101 * 101).reshape((101, 101))
df3d = pd.DataFrame({
"X": np.broadcast_to(xx, (11, 101, 101)).T.flatten(),
"Y": np.broadcast_to(yy, (11, 101, 101)).T.flatten(),
"Z": np.broadcast_to(np.arange(11, dtype=float), (101, 101, 11)).flatten(),
})
df3d
X Y Z
0 -51.423825 -51.287428 0.0
1 -51.423825 -51.287428 1.0
2 -51.423825 -51.287428 2.0
3 -51.423825 -51.287428 3.0
4 -51.423825 -51.287428 4.0
... ... ...
112206 51.593733 50.465087 6.0
112207 51.593733 50.465087 7.0
112208 51.593733 50.465087 8.0
112209 51.593733 50.465087 9.0
112210 51.593733 50.465087 10.0
[112211 rows x 3 columns]
With my analysis, I need to group these into 2D locations with 1 or more Z measures (it's not always 11 for my real-world data):
gb2d = df3d.groupby(["X", "Y"])
df2d = gb2d["Z"].count().to_frame("count")
df2d["Zmin"] = gb2d["Z"].min()
df2d["Zmax"] = gb2d["Z"].max()
df2d.reset_index()
X Y count Zmin Zmax
0 -51.995857 -49.653017 11 0.0 10.0
1 -51.939229 24.073164 11 0.0 10.0
2 -51.740996 -5.415639 11 0.0 10.0
3 -51.645503 21.830189 11 0.0 10.0
4 -51.639759 -42.850923 11 0.0 10.0
... ... ... ... ...
10196 51.593733 50.465087 11 0.0 10.0
10197 51.905789 37.538099 11 0.0 10.0
10198 51.989935 -32.464752 11 0.0 10.0
10199 52.530599 -40.110744 11 0.0 10.0
10200 52.902015 -6.111877 11 0.0 10.0
[10201 rows x 5 columns]
Question: How would I assign the integer index from df2d (shown above) back to the parent df3d frame?
My best attempt works, but does not scale well with larger frames. E.g.:
idx2d = pd.Series(np.arange(len(df2d)), index=df2d.index)
df3d["idx2d"] = idx2d.loc[df3d[["X", "Y"]].to_records(index=False).tolist()].values
works for this sample size, but takes up beyond my 32 GB RAM with my real-world data of 24 million points. What's a better way that won't eat up all of my RAM?
I have irregular 3D point data that looks something like this:
import numpy as np
import pandas as pd
xx, yy = np.meshgrid(
np.linspace(-50, 50, 101),
np.linspace(-50, 50, 101),
)
rng = np.random.default_rng(12345)
xx += rng.normal(size=101 * 101).reshape((101, 101))
yy += rng.normal(size=101 * 101).reshape((101, 101))
df3d = pd.DataFrame({
"X": np.broadcast_to(xx, (11, 101, 101)).T.flatten(),
"Y": np.broadcast_to(yy, (11, 101, 101)).T.flatten(),
"Z": np.broadcast_to(np.arange(11, dtype=float), (101, 101, 11)).flatten(),
})
df3d
X Y Z
0 -51.423825 -51.287428 0.0
1 -51.423825 -51.287428 1.0
2 -51.423825 -51.287428 2.0
3 -51.423825 -51.287428 3.0
4 -51.423825 -51.287428 4.0
... ... ...
112206 51.593733 50.465087 6.0
112207 51.593733 50.465087 7.0
112208 51.593733 50.465087 8.0
112209 51.593733 50.465087 9.0
112210 51.593733 50.465087 10.0
[112211 rows x 3 columns]
With my analysis, I need to group these into 2D locations with 1 or more Z measures (it's not always 11 for my real-world data):
gb2d = df3d.groupby(["X", "Y"])
df2d = gb2d["Z"].count().to_frame("count")
df2d["Zmin"] = gb2d["Z"].min()
df2d["Zmax"] = gb2d["Z"].max()
df2d.reset_index()
X Y count Zmin Zmax
0 -51.995857 -49.653017 11 0.0 10.0
1 -51.939229 24.073164 11 0.0 10.0
2 -51.740996 -5.415639 11 0.0 10.0
3 -51.645503 21.830189 11 0.0 10.0
4 -51.639759 -42.850923 11 0.0 10.0
... ... ... ... ...
10196 51.593733 50.465087 11 0.0 10.0
10197 51.905789 37.538099 11 0.0 10.0
10198 51.989935 -32.464752 11 0.0 10.0
10199 52.530599 -40.110744 11 0.0 10.0
10200 52.902015 -6.111877 11 0.0 10.0
[10201 rows x 5 columns]
Question: How would I assign the integer index from df2d (shown above) back to the parent df3d frame?
My best attempt works, but does not scale well with larger frames. E.g.:
idx2d = pd.Series(np.arange(len(df2d)), index=df2d.index)
df3d["idx2d"] = idx2d.loc[df3d[["X", "Y"]].to_records(index=False).tolist()].values
works for this sample size, but takes up beyond my 32 GB RAM with my real-world data of 24 million points. What's a better way that won't eat up all of my RAM?
Share Improve this question asked Mar 31 at 2:01 Mike TMike T 43.9k18 gold badges164 silver badges212 bronze badges1 Answer
Reset to default 1IIUC,
df3d.groupby(['X','Y'], as_index=False)['Z'].agg(['count','min','max']).rename_axis('d2dindex').reset_index().merge(df3d)
Output:
d2dindex X Y count min max Z
0 0 -51.995857 -49.653017 11 0.0 10.0 0.0
1 0 -51.995857 -49.653017 11 0.0 10.0 1.0
2 0 -51.995857 -49.653017 11 0.0 10.0 2.0
3 0 -51.995857 -49.653017 11 0.0 10.0 3.0
4 0 -51.995857 -49.653017 11 0.0 10.0 4.0
... ... ... ... ... ... ... ...
112206 10200 52.902015 -6.111877 11 0.0 10.0 6.0
112207 10200 52.902015 -6.111877 11 0.0 10.0 7.0
112208 10200 52.902015 -6.111877 11 0.0 10.0 8.0
112209 10200 52.902015 -6.111877 11 0.0 10.0 9.0
112210 10200 52.902015 -6.111877 11 0.0 10.0 10.0
[112211 rows x 7 columns]