最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

recommendation engine - Spark's ALS for collaborative filtering gives better RMSE when using explicit type for implicit

programmeradmin0浏览0评论

I have a dataset like this one:

+--------+---------+------+
| game_id|player_id|rating|
+--------+---------+------+
| 23681  |  1132   |     1|
| 16357  |  4700   |     1|
| 33324  |  3245   |     1|
| 33324  |  2324   |     1|
| 12575  |  3218   |     1|
| 12575  |  1140   |     1|
| 19252  |  7789   |     1|
| 19252  |  6255   |     1|
| 19252  |  4479   |     1|
| 19252  |  2357   |     1|
+--------+---------+------+

These is information about the dataset:

users: 192
games: 425
givenRatings: 1039
totalPossibleRatings: 81600
density: 0.012732843137254903
sparsity: 0.9872671568627451

The dataset shows which games a user has played. It contains only 1s for the played games, so the games which the user has not played are blank, not a 0. I have been reading the spark's documentation and it says that when using implicit data, I should use the .setImplicitPrefs(true) setting:

ALS als = new ALS()
                .setMaxIter(8)
                .setRegParam(0.1)
                .setRank(10)
                .setUserCol("player_id")
                .setItemCol("game_id")
                .setRatingCol("rating");
                .setImplicitPrefs(true);

However, the results are of the RMSE are drastically different. On average, the error when using the setting for implicit rating is always above 0.9, mainly ~0.9371518015113052, while when I remove the setting and leave the model on default (explicit) mode it is between 0.3-0.4, almost all time ~0.31484808203384196.

If I understand correctly, when the setting is on, a confidence value is calculated, which basically multiplies the 1s by a alpha (setAlpha in the configuration), while the 0s become 1s, since the formula is 1 (Rui) * alpha + 1, where Rui is the rating of a pair [game_id, player_id], which is always 1 here. I am not sure but the loss function seems to be different as well, taking into account alpha.

I don't understand however why there is such a big difference between the errors of the two settings. I fail to understand also why this setting should be used since at the end whether the ratings are 1s or 1 * alpha + 1, matrix factorization creates two smaller matrices and tries to find the most matching latent factors that produce the ratings, so there should be no difference between 1 and 42 (if alpha is 41 and implicit mode is true). Also, should 0s be used on the places where data is missing - if I use them, in the explicit default setting they might really brake the results or in the case of the implicit setting I think it will make most users extremely similar? When should this setting be used?

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论