最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

filter - Execution of complex filtering procedures in PySpark - Stack Overflow

programmeradmin1浏览0评论

Currently I'm trying to execute some filtering procedures in PySpark (educational purposes).

I'm new to PySpark, so decided to ask for a help.

My dataframe look like this:

ID     ApplicationDate  Loansum Company Decision
ID1    2020-06-01       100     B       Negative
ID1    2020-06-04       50      M       Positive
ID1    2020-06-05       50      M       Positive

ID1    2020-06-10       10      M       Positive

ID1    2020-06-15       60      B       Negative
ID1    2020-07-15       40      B       Positive
ID1    2020-06-22       20      M       Positive

ID1    2020-07-01       100     B       Negative
ID1    2020-07-02       40      B       Positive
ID1    2020-07-03       70      M       Positive

ID1    2020-08-01       100     B       Negative
ID1    2020-08-01       40      B       Positive
ID1    2020-08-02       100     M       Positive

ID2    2020-10-01       100     B       Negative
ID2    2020-10-04       50      M       Positive
ID2    2020-10-05       50      M       Positive

ID2    2020-10-10       10      M       Positive

ID2    2020-10-15       60      B       Negative
ID2    2020-10-15       40      B       Positive
ID2    2020-10-22       20      M       Positive

ID2    2020-10-01       100     B       Negative
ID2    2020-10-02       40      B       Positive
ID2    2020-10-03       70      M       Positive

My goal is to filter my dataframe is such a way so for each ID I should find and extract all the cases where:

  1. The ApplicationDate between the first Loansum issued by Company "B" and the next nearest Loansums issued by Company "M" should not exceed 5 days;
  2. The Loansums of all "Positive" issued loans should not be 20% more than a Lonasum of a loan with "Negative" Decision.

My expected result:

ID     ApplicationDate  Loansum Company Decision
ID1    2020-06-01       100     B       Negative
ID1    2020-06-04       50      M       Positive
ID1    2020-06-05       50      M       Positive

ID1    2020-07-01       100     B       Negative
ID1    2020-07-02       40      B       Positive
ID1    2020-07-03       70      M       Positive

ID2    2020-10-01       100     B       Negative
ID2    2020-10-04       50      M       Positive
ID2    2020-10-05       50      M       Positive

Any help is highly appreciated!

发布评论

评论列表(0)

  1. 暂无评论