最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - `sklearn.metrics.r2_score` is giving wrong R2 value? - Stack Overflow

programmeradmin1浏览0评论

I notice that sklearn.metrics.r2_score is giving wrong R2 value.

from sklearn.metrics import r2_score

r2_score(y_true=[2,4,3,34,23], y_pred=[21,12,3,11,17])   # -0.17
r2_score(y_pred=[21,12,3,11,17], y_true=[2,4,3,34,23])   # -4.36

However, the true R2 value should be 0.002 according to the rsq function in Excel. R2 should be between 0~1. Also, switching the order of "y_true" and "y_pred" should not affect the final result. How to fix this issue?

Also,

In simple linear regression (one predictor), the coefficient of determination is numerically equal to the square of the Pearson correlation coefficient.

I wonder why sklearn.metrics.r2_score is different to the squared Pearson correlation coefficient in this case?

I notice that sklearn.metrics.r2_score is giving wrong R2 value.

from sklearn.metrics import r2_score

r2_score(y_true=[2,4,3,34,23], y_pred=[21,12,3,11,17])   # -0.17
r2_score(y_pred=[21,12,3,11,17], y_true=[2,4,3,34,23])   # -4.36

However, the true R2 value should be 0.002 according to the rsq function in Excel. R2 should be between 0~1. Also, switching the order of "y_true" and "y_pred" should not affect the final result. How to fix this issue?

Also,

In simple linear regression (one predictor), the coefficient of determination is numerically equal to the square of the Pearson correlation coefficient.

I wonder why sklearn.metrics.r2_score is different to the squared Pearson correlation coefficient in this case?

Share Improve this question edited 3 hours ago desertnaut 60.3k32 gold badges151 silver badges177 bronze badges asked 10 hours ago Yang YangYang Yang 9023 gold badges29 silver badges57 bronze badges 0
Add a comment  | 

1 Answer 1

Reset to default 1

The site in question gives you a different r-squared. What it gives you is a squared version of the Pearson's correlation coefficient (r), which is also a commonly used metric.

The actual R2 as in coefficient of determination is calculated as

1 - (SS_res / SS_tot)

where SS_res is the sum of squared residuals and SS_tot is the total sum of squares.

Link to Wiki article

You can recreate the calculation and confirm that sklearn is correct and the internet is wrong (technically, internet is also correct but it's just a different metric altogether):

import numpy as np

y_true = np.array([2,4,3,34,23])
y_pred = np.array([21,12,3,11,17])

ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.full(y_true.shape[0], np.mean(y_true)))**2)

r2 = 1 - (ss_res / ss_tot)  # Out: -0.174655908875178

For the second question - why results are different when we swtich True and Pred around the thing is that in sklearn.metrics.r2_score variables y_true and y_pred are positional-only depending on the version of sklearn, i.e. the one that goes first becomes y_true and the second is y_pred.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

If you use the snippet above to run the calculation with values flipped around, you see that sklearn is correct again. In this case SS_res would not change but SS_tot will change because you're only using y_true and its mean in the calculation.

UPDATE: In order to get the squared correlation coefficient that you get from Excel (as per discussion in the comments) you can scipy instead:

from scipy.stats import pearsonr

r2 = pearsonr([2,4,3,34,23], [21,12,3,11,17])[0] ** 2  # Out: 0.002366878494073563

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

发布评论

评论列表(0)

  1. 暂无评论