filter - Execution of complex filtering procedures in PySpark

Currently I'm trying to execute some filtering procedures in PySpark (educational purposes).

I'm new to PySpark, so decided to ask for a help.

My dataframe look like this:

ID     ApplicationDate  Loansum Company Decision
ID1    2020-06-01       100     B       Negative
ID1    2020-06-04       50      M       Positive
ID1    2020-06-05       50      M       Positive

ID1    2020-06-10       10      M       Positive

ID1    2020-06-15       60      B       Negative
ID1    2020-07-15       40      B       Positive
ID1    2020-06-22       20      M       Positive

ID1    2020-07-01       100     B       Negative
ID1    2020-07-02       40      B       Positive
ID1    2020-07-03       70      M       Positive

ID1    2020-08-01       100     B       Negative
ID1    2020-08-01       40      B       Positive
ID1    2020-08-02       100     M       Positive

ID2    2020-10-01       100     B       Negative
ID2    2020-10-04       50      M       Positive
ID2    2020-10-05       50      M       Positive

ID2    2020-10-10       10      M       Positive

ID2    2020-10-15       60      B       Negative
ID2    2020-10-15       40      B       Positive
ID2    2020-10-22       20      M       Positive

ID2    2020-10-01       100     B       Negative
ID2    2020-10-02       40      B       Positive
ID2    2020-10-03       70      M       Positive

My goal is to filter my dataframe is such a way so for each ID I should find and extract all the cases where:

The ApplicationDate between the first Loansum issued by Company "B" and the next nearest Loansums issued by Company "M" should not exceed 5 days;
The Loansums of all "Positive" issued loans should not be 20% more than a Lonasum of a loan with "Negative" Decision.

My expected result:

ID     ApplicationDate  Loansum Company Decision
ID1    2020-06-01       100     B       Negative
ID1    2020-06-04       50      M       Positive
ID1    2020-06-05       50      M       Positive

ID1    2020-07-01       100     B       Negative
ID1    2020-07-02       40      B       Positive
ID1    2020-07-03       70      M       Positive

ID2    2020-10-01       100     B       Negative
ID2    2020-10-04       50      M       Positive
ID2    2020-10-05       50      M       Positive

Any help is highly appreciated!

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

filter - Execution of complex filtering procedures in PySpark - Stack Overflow

与本文相关的文章

评论列表(0)