python - `sklearn.metrics.r2_score` is giving wrong R2 value?

I notice that sklearn.metrics.r2_score is giving wrong R2 value.

from sklearn.metrics import r2_score

r2_score(y_true=[2,4,3,34,23], y_pred=[21,12,3,11,17])   # -0.17
r2_score(y_pred=[21,12,3,11,17], y_true=[2,4,3,34,23])   # -4.36

However, the true R2 value should be 0.002 according to the rsq function in Excel. R2 should be between 0~1. Also, switching the order of "y_true" and "y_pred" should not affect the final result. How to fix this issue?

Also,

In simple linear regression (one predictor), the coefficient of determination is numerically equal to the square of the Pearson correlation coefficient.

I wonder why sklearn.metrics.r2_score is different to the squared Pearson correlation coefficient in this case?

I notice that sklearn.metrics.r2_score is giving wrong R2 value.

from sklearn.metrics import r2_score

r2_score(y_true=[2,4,3,34,23], y_pred=[21,12,3,11,17])   # -0.17
r2_score(y_pred=[21,12,3,11,17], y_true=[2,4,3,34,23])   # -4.36

Also,

In simple linear regression (one predictor), the coefficient of determination is numerically equal to the square of the Pearson correlation coefficient.

I wonder why sklearn.metrics.r2_score is different to the squared Pearson correlation coefficient in this case?

Share Improve this question edited 3 hours ago desertnaut 60.3k32 gold badges151 silver badges177 bronze badges asked 10 hours ago Yang Yang 9023 gold badges29 silver badges57 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

The site in question gives you a different r-squared. What it gives you is a squared version of the Pearson's correlation coefficient (r), which is also a commonly used metric.

The actual R2 as in coefficient of determination is calculated as

1 - (SS_res / SS_tot)

where SS_res is the sum of squared residuals and SS_tot is the total sum of squares.

Link to Wiki article

You can recreate the calculation and confirm that sklearn is correct and the internet is wrong (technically, internet is also correct but it's just a different metric altogether):

import numpy as np

y_true = np.array([2,4,3,34,23])
y_pred = np.array([21,12,3,11,17])

ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.full(y_true.shape[0], np.mean(y_true)))**2)

r2 = 1 - (ss_res / ss_tot)  # Out: -0.174655908875178

For the second question - why results are different when we swtich True and Pred around the thing is that in sklearn.metrics.r2_score variables y_true and y_pred are positional-only depending on the version of sklearn, i.e. the one that goes first becomes y_true and the second is y_pred.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

If you use the snippet above to run the calculation with values flipped around, you see that sklearn is correct again. In this case SS_res would not change but SS_tot will change because you're only using y_true and its mean in the calculation.

UPDATE: In order to get the squared correlation coefficient that you get from Excel (as per discussion in the comments) you can scipy instead:

from scipy.stats import pearsonr

r2 = pearsonr([2,4,3,34,23], [21,12,3,11,17])[0] ** 2  # Out: 0.002366878494073563

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - `sklearn.metrics.r2_score` is giving wrong R2 value? - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)