When doing some types of data processing, I want to receive an indicative error message in polars. For example, if I have the following transformation
import polars as pl
lf = pl.LazyFrame(
{
"first_and_middle_name": ["mister banana", "yoda the jedi", "not gonna"],
"middle_and_last_name": ["banana muffin", "jedi master", "work at all"],
}
)
split_first_name = pl.col("first_and_middle_name").str.split(" ").list
split_last_name = pl.col("middle_and_last_name").str.split(" ").list
lf.with_columns(
pl.when(split_first_name.last() == split_last_name.first())
.then(
pl.col("first_and_middle_name")
+ " "
+ split_last_name.slice(1, split_last_name.len()).list.join(" ")
)
.otherwise(pl.lit(None))
.alias("full_name")
).collect()
I want to receive an informative error that the last row was problematic, instead of a "null".
I couldn't find in the documentation of polars what's a good way to do that. I found hacks like defining a UDF to run there and throw an exception, but this feels like a strange detour.
When doing some types of data processing, I want to receive an indicative error message in polars. For example, if I have the following transformation
import polars as pl
lf = pl.LazyFrame(
{
"first_and_middle_name": ["mister banana", "yoda the jedi", "not gonna"],
"middle_and_last_name": ["banana muffin", "jedi master", "work at all"],
}
)
split_first_name = pl.col("first_and_middle_name").str.split(" ").list
split_last_name = pl.col("middle_and_last_name").str.split(" ").list
lf.with_columns(
pl.when(split_first_name.last() == split_last_name.first())
.then(
pl.col("first_and_middle_name")
+ " "
+ split_last_name.slice(1, split_last_name.len()).list.join(" ")
)
.otherwise(pl.lit(None))
.alias("full_name")
).collect()
I want to receive an informative error that the last row was problematic, instead of a "null".
I couldn't find in the documentation of polars what's a good way to do that. I found hacks like defining a UDF to run there and throw an exception, but this feels like a strange detour.
Share Improve this question asked Feb 16 at 19:27 nadavgenadavge 6091 gold badge4 silver badges14 bronze badges3 Answers
Reset to default 1In general you should perform the validation separately, before doing the actual operation.
You can use polars.testing
functions for that,
from polars.testing import assert_frame_equal
...
assert_frame_equal(
lf.select(split_first_name.last().alias('middle_name')),
lf.select(split_last_name.first().alias('middle_name')),
)
There are also some libraries focused exclusively on data validation such as pandera and patito
There are various discussions on GitHub about it:
- https://github/pola-rs/polars/issues/16120
- https://github/pola-rs/polars/issues/11064
The problem seems to be that it is quite difficult to add to Polars without affecting the optimization phases.
There is a pending PR pull/20100 to add assert_err
and assert_warn
frame methods.
We aren't certain yet. They essentially would block many optimizations.
Not sure if it is useful, but in order to generate an exception from a bool
operation I have used division.
lf.with_columns((1 / (split_first_name.last() == split_last_name.first()))).collect()
shape: (3, 3)
┌───────────────────────┬──────────────────────┬─────────┐
│ first_and_middle_name ┆ middle_and_last_name ┆ literal │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════════════════════╪══════════════════════╪═════════╡
│ mister banana ┆ banana muffin ┆ 1.0 │
│ yoda the jedi ┆ jedi master ┆ 1.0 │
│ not gonna ┆ work at all ┆ inf │
└───────────────────────┴──────────────────────┴─────────┘
inf
cannot be held in an integer column, so .cast()
will raise.
lf.with_columns(
(1 / (split_first_name.last() == split_last_name.first())).cast(pl.Int64)
).collect()
# InvalidOperationError:
# conversion from `f64` to `i64` failed in column 'literal'
# for 1 out of 1 values: [inf]
I think there are potential two concerns here.
1. Short Circuiting for fail-fast:
If you've got 2B rows to go through and the first one fails you'd like it to just raise immediately rather than process all 2B rows. You could try using .all()
, but polars operates columnwise so it's going to do the whole column worth of pl.col("first_and_middle_name").str.split(" ").list.last()
before it checks the bool condition. If you're streaming then it would do this in chunks and hopefully be able to return early but I'm not sure if that's the case, it's worth experimenting with. For completeness, before the main step you'd do
class FullNameBad(BaseException):
"""Bad Full Name exists"""
valid_success = (
lf.select((split_first_name.last() == split_last_name.first()).all())
.collect()
.item()
)
if not valid_success:
raise FullNameBad
2. "Idiomatic" syntax
I put idiomatic in quotes as it can be subjective. Personally, I don't like using extra libraries to make sure my data is the way I think it should be. With that in mind I like to use pipe
with each step a named function:
class FullNameBad(BaseException):
"""Bad Full Name exists"""
def validate_full_name(df: pl.DataFrame)->pl.DataFrame:
filtered = df.filter(pl.col("full_name").is_null())
if filtered.height==0:
return df
if filtered.height==1:
msg = f"there is 1 failing row"
else:
msg=f"there are {filtered.height} failing rows"
msg+=str(filtered)
raise FullNameBad(msg)
then you can just tack that on to the end with a pipe
lf.with_columns(
pl.when(split_first_name.last() == split_last_name.first())
.then(
pl.col("first_and_middle_name")
+ " "
+ split_last_name.slice(1, split_last_name.len()).list.join(" ")
)
.otherwise(pl.lit(None))
.alias("full_name")
).collect().pipe(validate_full_name)
FullNameBad: there is 1 failing row
shape: (1, 3)
┌───────────────────────┬──────────────────────┬───────────┐
│ first_and_middle_name ┆ middle_and_last_name ┆ full_name │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════════════════╪══════════════════════╪═══════════╡
│ not gonna ┆ work at all ┆ null │
└───────────────────────┴──────────────────────┴───────────┘
To take the pipe approach to its full extent it would look like:
def make_full_name(lf: pl.LazyFrame) -> pl.LazyFrame:
split_first_name = pl.col("first_and_middle_name").str.split(" ").list
split_last_name = pl.col("middle_and_last_name").str.split(" ").list
return lf.with_columns(
pl.when(split_first_name.last() == split_last_name.first())
.then(
pl.col("first_and_middle_name")
+ " "
+ split_last_name.slice(1, split_last_name.len()).list.join(" ")
)
.otherwise(pl.lit(None))
.alias("full_name")
)
def main(lf):
df = (
lf
.pipe(make_full_name)
.collect()
.pipe(validate_full_name)
)
Combining the two
You could make a pre_validate function like
def prevalidate_names(lf: pl.LazyFrame) -> pl.LazyFrame:
split_first_name = pl.col("first_and_middle_name").str.split(" ").list
split_last_name = pl.col("middle_and_last_name").str.split(" ").list
valid_success = (
lf.select((split_first_name.last() == split_last_name.first()).all())
.collect()
.item()
)
if valid_success:
return lf
else:
raise FullNameBad
def main(lf):
df = (
lf
.pipe(prevalidate_names)
.pipe(make_full_name)
)