I need to pass a variable number of columns to a user-defined function. The docs mention to first create a pl.struct
and subsequently let the function extract it. Here's the example given on the website:
# Add two arrays together:
@guvectorize([(int64[:], int64[:], float64[:])], "(n),(n)->(n)")
def add(arr, arr2, result):
for i in range(len(arr)):
result[i] = arr[i] + arr2[i]
df3 = pl.DataFrame({"values1": [1, 2, 3], "values2": [10, 20, 30]})
out = df3.select(
# Create a struct that has two columns in it:
pl.struct(["values1", "values2"])
# Pass the struct to a lambda that then passes the individual columns to
# the add() function:
.map_batches(
lambda combined: add(
combined.struct.field("values1"), combined.struct.field("values2")
)
)
.alias("add_columns")
)
print(out)
Now, in my case, I don't know upfront how many columns will enter the pl.struct
. Think of using a selector like pl.struct(cs.float())
. In my user-defined function, I need to operate on a np.array
. That is, the user-defined function will have one input argument that takes the whole array. How can I then extract it within the user-defined function?
EDIT: The output of my user-defined function will be an array that has the exact same shape as the input array. This array needs to be appended to the existing dataframe on axis 1 (new columns).
EDIT:
Using pl.concat_arr
might be one way to attack my concrete issue. My use case would be along the following lines:
def multiply_by_two(arr):
# In reality, there are some complex array operations.
return arr * 2
df = pl.DataFrame({"values1": [1, 2, 3], "values2": [10, 20, 30]})
out = df.select(
# Create an array consisting of two columns:
pl.concat_arr(["values1", "values2"])
.map_batches(lambda arr: multiply_by_two(arr))
.alias("result")
)
The new computed column result
holds an array that has the same shape as the input array. I need to unnest
the array (something like pl.struct.unnest()
). The headings should be the original headings suffixed by "result" (values1_result
and values2_result
).
Also, I would like to make use of @guvectorize
to speed things up.
I need to pass a variable number of columns to a user-defined function. The docs mention to first create a pl.struct
and subsequently let the function extract it. Here's the example given on the website:
# Add two arrays together:
@guvectorize([(int64[:], int64[:], float64[:])], "(n),(n)->(n)")
def add(arr, arr2, result):
for i in range(len(arr)):
result[i] = arr[i] + arr2[i]
df3 = pl.DataFrame({"values1": [1, 2, 3], "values2": [10, 20, 30]})
out = df3.select(
# Create a struct that has two columns in it:
pl.struct(["values1", "values2"])
# Pass the struct to a lambda that then passes the individual columns to
# the add() function:
.map_batches(
lambda combined: add(
combined.struct.field("values1"), combined.struct.field("values2")
)
)
.alias("add_columns")
)
print(out)
Now, in my case, I don't know upfront how many columns will enter the pl.struct
. Think of using a selector like pl.struct(cs.float())
. In my user-defined function, I need to operate on a np.array
. That is, the user-defined function will have one input argument that takes the whole array. How can I then extract it within the user-defined function?
EDIT: The output of my user-defined function will be an array that has the exact same shape as the input array. This array needs to be appended to the existing dataframe on axis 1 (new columns).
EDIT:
Using pl.concat_arr
might be one way to attack my concrete issue. My use case would be along the following lines:
def multiply_by_two(arr):
# In reality, there are some complex array operations.
return arr * 2
df = pl.DataFrame({"values1": [1, 2, 3], "values2": [10, 20, 30]})
out = df.select(
# Create an array consisting of two columns:
pl.concat_arr(["values1", "values2"])
.map_batches(lambda arr: multiply_by_two(arr))
.alias("result")
)
The new computed column result
holds an array that has the same shape as the input array. I need to unnest
the array (something like pl.struct.unnest()
). The headings should be the original headings suffixed by "result" (values1_result
and values2_result
).
Also, I would like to make use of @guvectorize
to speed things up.
1 Answer
Reset to default 0A few things, if you use .to_numpy
on either an array or a struct, it seems to return the same np.array so the difference in which to choose comes down to memory efficiency and features. The elements of an Array aren't named and you want the output names to correspond to the input columns so that means you probably want a struct. I'm not sure what the memory implications are between the two. I know that going from columns to a struct is cheaper than going from columns to Array but intuitively it seems that columns->struct->np.array ought to be about the same as columns->array->np.array.
Anyway, with that said, here's how to do it:
def multiply_by_two(arr: pl.Series)->pl.Series:
# capture names of input
names = arr.struct.fields
arrnp=arr.to_numpy()
res = arrnp * 2
return pl.Series(res).arr.to_struct(fields=[f"{name}_result" for name in names])
df.with_columns(
# Create an array consisting of two columns:
pl.struct(["values1", "values2"])
.map_batches(lambda arr: multiply_by_two(arr))
.alias("result")
).unnest("result")
shape: (3, 4)
┌─────────┬─────────┬────────────────┬────────────────┐
│ values1 ┆ values2 ┆ values1_result ┆ values2_result │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪════════════════╪════════════════╡
│ 1 ┆ 10 ┆ 2 ┆ 20 │
│ 2 ┆ 20 ┆ 4 ┆ 40 │
│ 3 ┆ 30 ┆ 6 ┆ 60 │
└─────────┴─────────┴────────────────┴────────────────┘
You can't unnest from within the .with_columns
you have to do it at the DataFrame level.
As for combining the above with numba, it should be relatively the same. Just search for polars and numba to find other questions/answers where the two are used together. If you can make a more specific question specifically about their interaction then ask away.
map_batches
you have a Series, if you want a numpy array you can just calls.to_numpy()
- but it may make more sense to usepl.concat_arr
instead of a struct for your use case? Perhaps you give a code example that is closer to the actual task. – jqurious Commented 2 days agoarr
is a pl.Series, you can just usearr.to_numpy()
if you want a numpy array - right? – jqurious Commented yesterdayarr.to_numpy()
withinmultiply_by_two
, then you're correct. Having said that, I think this is not going to work in conjunction withguvectorize
. I am also struggling when performingmap_batches
over
a group_by. – Andi Commented yesterday