Iam currently working on a duckdb udf using pyarrow compute. Works great so far. Now i need to clip a value between 0 and 1, e.g minimal example:
import numpy as np
import pyarrow as pa
import pyarrowpute as pc
import duckdb
def funcArrow(x: float) -> float:
return np.clip(x, 0, 1) # this should be something like pc.clip(x,0,1)
con = duckdb.connect("test.db")
con.create_function("funcArrow", funcArrow, type="arrow")
print(con.sql("SELECT funcArrow(x) AS f from myTable;")
I tried using pc.min, pc.max or if else loops, but always run into errors. Any suggestions?
Thank you
Iam currently working on a duckdb udf using pyarrow compute. Works great so far. Now i need to clip a value between 0 and 1, e.g minimal example:
import numpy as np
import pyarrow as pa
import pyarrowpute as pc
import duckdb
def funcArrow(x: float) -> float:
return np.clip(x, 0, 1) # this should be something like pc.clip(x,0,1)
con = duckdb.connect("test.db")
con.create_function("funcArrow", funcArrow, type="arrow")
print(con.sql("SELECT funcArrow(x) AS f from myTable;")
I tried using pc.min, pc.max or if else loops, but always run into errors. Any suggestions?
Thank you
Share Improve this question asked Jan 30 at 14:13 user2148566user2148566 12 bronze badges 1- what errors are you getting? – 0x26res Commented Jan 31 at 22:54
1 Answer
Reset to default 1I tried using pc.min, pc.max but always run into errors.
Much like in SQL min
and max
aggregate vertically, and don't do a horizontal comparison.
In the absence of a PyArrow Compute clip
function, I was able to find min_element_wise
and max_element_wise
, which seem to me (on first glance) analogous to SQL's least
and greatest
respectively.
You can use these together to clip the minimum value to 0 and the maximum value to 1.
You may decide whether you prefer a UDF or to express this in native SQL (which I assume will be more performant). I've provided an answer below that includes both options.
import duckdb
import pyarrow as pa
import pyarrowpute as pc
def funcArrow(x: float) -> float:
return pc.min_element_wise(pc.max_element_wise(x, 0), 1)
con = duckdb.connect()
con.create_function("funcArrow", funcArrow, type="arrow")
con.sql("""
SELECT *
x,
funcArrow(x) as udf,
least(greatest(x, 0), 1) as sql,
FROM UNNEST([-1, 0, 0.5, 1, 2]) as tbl(x)
""")
┌───────────────┬────────┬───────────────┐
│ x │ udf │ sql │
│ decimal(11,1) │ double │ decimal(11,1) │
├───────────────┼────────┼───────────────┤
│ -1.0 │ 0.0 │ 0.0 │
│ 0.0 │ 0.0 │ 0.0 │
│ 0.5 │ 0.5 │ 0.5 │
│ 1.0 │ 1.0 │ 1.0 │
│ 2.0 │ 1.0 │ 1.0 │
└───────────────┴────────┴───────────────┘