Currently I'm trying to execute some filtering procedures in PySpark (educational purposes).
I'm new to PySpark, so decided to ask for a help.
My dataframe look like this:
ID ApplicationDate Loansum Company Decision
ID1 2020-06-01 100 B Negative
ID1 2020-06-04 50 M Positive
ID1 2020-06-05 50 M Positive
ID1 2020-06-10 10 M Positive
ID1 2020-06-15 60 B Negative
ID1 2020-07-15 40 B Positive
ID1 2020-06-22 20 M Positive
ID1 2020-07-01 100 B Negative
ID1 2020-07-02 40 B Positive
ID1 2020-07-03 70 M Positive
ID1 2020-08-01 100 B Negative
ID1 2020-08-01 40 B Positive
ID1 2020-08-02 100 M Positive
ID2 2020-10-01 100 B Negative
ID2 2020-10-04 50 M Positive
ID2 2020-10-05 50 M Positive
ID2 2020-10-10 10 M Positive
ID2 2020-10-15 60 B Negative
ID2 2020-10-15 40 B Positive
ID2 2020-10-22 20 M Positive
ID2 2020-10-01 100 B Negative
ID2 2020-10-02 40 B Positive
ID2 2020-10-03 70 M Positive
My goal is to filter my dataframe is such a way so for each ID I should find and extract all the cases where:
- The ApplicationDate between the first Loansum issued by Company "B" and the next nearest Loansums issued by Company "M" should not exceed 5 days;
- The Loansums of all "Positive" issued loans should not be 20% more than a Lonasum of a loan with "Negative" Decision.
My expected result:
ID ApplicationDate Loansum Company Decision
ID1 2020-06-01 100 B Negative
ID1 2020-06-04 50 M Positive
ID1 2020-06-05 50 M Positive
ID1 2020-07-01 100 B Negative
ID1 2020-07-02 40 B Positive
ID1 2020-07-03 70 M Positive
ID2 2020-10-01 100 B Negative
ID2 2020-10-04 50 M Positive
ID2 2020-10-05 50 M Positive
Any help is highly appreciated!