最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Is there an idiomatic way of raising an error in polars - Stack Overflow

programmeradmin4浏览0评论

When doing some types of data processing, I want to receive an indicative error message in polars. For example, if I have the following transformation

import polars as pl

lf = pl.LazyFrame(
    {
        "first_and_middle_name": ["mister banana", "yoda the jedi", "not gonna"],
        "middle_and_last_name": ["banana muffin", "jedi master", "work at all"],
    }
)

split_first_name = pl.col("first_and_middle_name").str.split(" ").list
split_last_name = pl.col("middle_and_last_name").str.split(" ").list

lf.with_columns(
    pl.when(split_first_name.last() == split_last_name.first())
    .then(
        pl.col("first_and_middle_name")
        + " "
        + split_last_name.slice(1, split_last_name.len()).list.join(" ")
    )
    .otherwise(pl.lit(None))
    .alias("full_name")
).collect()

I want to receive an informative error that the last row was problematic, instead of a "null".

I couldn't find in the documentation of polars what's a good way to do that. I found hacks like defining a UDF to run there and throw an exception, but this feels like a strange detour.

When doing some types of data processing, I want to receive an indicative error message in polars. For example, if I have the following transformation

import polars as pl

lf = pl.LazyFrame(
    {
        "first_and_middle_name": ["mister banana", "yoda the jedi", "not gonna"],
        "middle_and_last_name": ["banana muffin", "jedi master", "work at all"],
    }
)

split_first_name = pl.col("first_and_middle_name").str.split(" ").list
split_last_name = pl.col("middle_and_last_name").str.split(" ").list

lf.with_columns(
    pl.when(split_first_name.last() == split_last_name.first())
    .then(
        pl.col("first_and_middle_name")
        + " "
        + split_last_name.slice(1, split_last_name.len()).list.join(" ")
    )
    .otherwise(pl.lit(None))
    .alias("full_name")
).collect()

I want to receive an informative error that the last row was problematic, instead of a "null".

I couldn't find in the documentation of polars what's a good way to do that. I found hacks like defining a UDF to run there and throw an exception, but this feels like a strange detour.

Share Improve this question asked Feb 16 at 19:27 nadavgenadavge 6091 gold badge4 silver badges14 bronze badges
Add a comment  | 

3 Answers 3

Reset to default 1

In general you should perform the validation separately, before doing the actual operation.

You can use polars.testing functions for that,

from polars.testing import assert_frame_equal

...

assert_frame_equal(
    lf.select(split_first_name.last().alias('middle_name')),
    lf.select(split_last_name.first().alias('middle_name')),
)

There are also some libraries focused exclusively on data validation such as pandera and patito

There are various discussions on GitHub about it:

  • https://github/pola-rs/polars/issues/16120
  • https://github/pola-rs/polars/issues/11064

The problem seems to be that it is quite difficult to add to Polars without affecting the optimization phases.

There is a pending PR pull/20100 to add assert_err and assert_warn frame methods.

We aren't certain yet. They essentially would block many optimizations.

Not sure if it is useful, but in order to generate an exception from a bool operation I have used division.

lf.with_columns((1 / (split_first_name.last() == split_last_name.first()))).collect()
shape: (3, 3)
┌───────────────────────┬──────────────────────┬─────────┐
│ first_and_middle_name ┆ middle_and_last_name ┆ literal │
│ ---                   ┆ ---                  ┆ ---     │
│ str                   ┆ str                  ┆ f64     │
╞═══════════════════════╪══════════════════════╪═════════╡
│ mister banana         ┆ banana muffin        ┆ 1.0     │
│ yoda the jedi         ┆ jedi master          ┆ 1.0     │
│ not gonna             ┆ work at all          ┆ inf     │
└───────────────────────┴──────────────────────┴─────────┘

inf cannot be held in an integer column, so .cast() will raise.

lf.with_columns(
    (1 / (split_first_name.last() == split_last_name.first())).cast(pl.Int64)
).collect()
# InvalidOperationError: 
#   conversion from `f64` to `i64` failed in column 'literal' 
#   for 1 out of 1 values: [inf]

I think there are potential two concerns here.

1. Short Circuiting for fail-fast:

If you've got 2B rows to go through and the first one fails you'd like it to just raise immediately rather than process all 2B rows. You could try using .all(), but polars operates columnwise so it's going to do the whole column worth of pl.col("first_and_middle_name").str.split(" ").list.last() before it checks the bool condition. If you're streaming then it would do this in chunks and hopefully be able to return early but I'm not sure if that's the case, it's worth experimenting with. For completeness, before the main step you'd do

class FullNameBad(BaseException):
    """Bad Full Name exists"""
valid_success = (
    lf.select((split_first_name.last() == split_last_name.first()).all())
    .collect()
    .item()
)

if not valid_success:
    raise FullNameBad

2. "Idiomatic" syntax

I put idiomatic in quotes as it can be subjective. Personally, I don't like using extra libraries to make sure my data is the way I think it should be. With that in mind I like to use pipe with each step a named function:

class FullNameBad(BaseException):
    """Bad Full Name exists"""
def validate_full_name(df: pl.DataFrame)->pl.DataFrame:
    filtered = df.filter(pl.col("full_name").is_null())
    if filtered.height==0:
        return df
    if filtered.height==1:
        msg = f"there is 1 failing row"
    else:
        msg=f"there are {filtered.height} failing rows"
    msg+=str(filtered)
    raise FullNameBad(msg)

then you can just tack that on to the end with a pipe

lf.with_columns(
    pl.when(split_first_name.last() == split_last_name.first())
    .then(
        pl.col("first_and_middle_name")
        + " "
        + split_last_name.slice(1, split_last_name.len()).list.join(" ")
    )
    .otherwise(pl.lit(None))
    .alias("full_name")
).collect().pipe(validate_full_name)

FullNameBad: there is 1 failing row
shape: (1, 3)
┌───────────────────────┬──────────────────────┬───────────┐
│ first_and_middle_name ┆ middle_and_last_name ┆ full_name │
│ ---                   ┆ ---                  ┆ ---       │
│ str                   ┆ str                  ┆ str       │
╞═══════════════════════╪══════════════════════╪═══════════╡
│ not gonna             ┆ work at all          ┆ null      │
└───────────────────────┴──────────────────────┴───────────┘

To take the pipe approach to its full extent it would look like:

def make_full_name(lf: pl.LazyFrame) -> pl.LazyFrame:
    split_first_name = pl.col("first_and_middle_name").str.split(" ").list
    split_last_name = pl.col("middle_and_last_name").str.split(" ").list
    return lf.with_columns(
        pl.when(split_first_name.last() == split_last_name.first())
        .then(
            pl.col("first_and_middle_name")
            + " "
            + split_last_name.slice(1, split_last_name.len()).list.join(" ")
        )
        .otherwise(pl.lit(None))
        .alias("full_name")
    )

def main(lf):

    df = (
        lf
        .pipe(make_full_name)
        .collect()
        .pipe(validate_full_name)
    )

Combining the two

You could make a pre_validate function like

def prevalidate_names(lf: pl.LazyFrame) -> pl.LazyFrame:
    split_first_name = pl.col("first_and_middle_name").str.split(" ").list
    split_last_name = pl.col("middle_and_last_name").str.split(" ").list
    valid_success = (
        lf.select((split_first_name.last() == split_last_name.first()).all())
        .collect()
        .item()
    )
    if valid_success:
        return lf
    else:
        raise FullNameBad

def main(lf):

    df = (
        lf
        .pipe(prevalidate_names)
        .pipe(make_full_name)
        
    )
发布评论

评论列表(0)

  1. 暂无评论