I have a dataset like this one:
+--------+---------+------+
| game_id|player_id|rating|
+--------+---------+------+
| 23681 | 1132 | 1|
| 16357 | 4700 | 1|
| 33324 | 3245 | 1|
| 33324 | 2324 | 1|
| 12575 | 3218 | 1|
| 12575 | 1140 | 1|
| 19252 | 7789 | 1|
| 19252 | 6255 | 1|
| 19252 | 4479 | 1|
| 19252 | 2357 | 1|
+--------+---------+------+
These is information about the dataset:
users: 192
games: 425
givenRatings: 1039
totalPossibleRatings: 81600
density: 0.012732843137254903
sparsity: 0.9872671568627451
The dataset shows which games a user has played. It contains only 1s for the played games, so the games which the user has not played are blank, not a 0. I have been reading the spark's documentation and it says that when using implicit data, I should use the .setImplicitPrefs(true)
setting:
ALS als = new ALS()
.setMaxIter(8)
.setRegParam(0.1)
.setRank(10)
.setUserCol("player_id")
.setItemCol("game_id")
.setRatingCol("rating");
.setImplicitPrefs(true);
However, the results are of the RMSE are drastically different. On average, the error when using the setting for implicit rating is always above 0.9, mainly ~0.9371518015113052, while when I remove the setting and leave the model on default (explicit) mode it is between 0.3-0.4, almost all time ~0.31484808203384196.
If I understand correctly, when the setting is on, a confidence value is calculated, which basically multiplies the 1s by a alpha
(setAlpha
in the configuration), while the 0s become 1s, since the formula is 1 (Rui) * alpha + 1
, where Rui is the rating of a pair [game_id, player_id], which is always 1 here
. I am not sure but the loss function seems to be different as well, taking into account alpha
.
I don't understand however why there is such a big difference between the errors of the two settings. I fail to understand also why this setting should be used since at the end whether the ratings are 1s or 1 * alpha + 1
, matrix factorization creates two smaller matrices and tries to find the most matching latent factors that produce the ratings, so there should be no difference between 1 and 42 (if alpha
is 41 and implicit mode is true). Also, should 0s be used on the places where data is missing - if I use them, in the explicit default setting they might really brake the results or in the case of the implicit setting I think it will make most users extremely similar? When should this setting be used?