I notice that sklearn.metrics.r2_score
is giving wrong R2 value.
from sklearn.metrics import r2_score
r2_score(y_true=[2,4,3,34,23], y_pred=[21,12,3,11,17]) # -0.17
r2_score(y_pred=[21,12,3,11,17], y_true=[2,4,3,34,23]) # -4.36
However, the true R2 value should be 0.002 according to the rsq
function in Excel. R2 should be between 0~1. Also, switching the order of "y_true" and "y_pred" should not affect the final result. How to fix this issue?
Also,
In simple linear regression (one predictor), the coefficient of determination is numerically equal to the square of the Pearson correlation coefficient.
I wonder why sklearn.metrics.r2_score
is different to the squared Pearson correlation coefficient
in this case?
I notice that sklearn.metrics.r2_score
is giving wrong R2 value.
from sklearn.metrics import r2_score
r2_score(y_true=[2,4,3,34,23], y_pred=[21,12,3,11,17]) # -0.17
r2_score(y_pred=[21,12,3,11,17], y_true=[2,4,3,34,23]) # -4.36
However, the true R2 value should be 0.002 according to the rsq
function in Excel. R2 should be between 0~1. Also, switching the order of "y_true" and "y_pred" should not affect the final result. How to fix this issue?
Also,
In simple linear regression (one predictor), the coefficient of determination is numerically equal to the square of the Pearson correlation coefficient.
I wonder why sklearn.metrics.r2_score
is different to the squared Pearson correlation coefficient
in this case?
1 Answer
Reset to default 1The site in question gives you a different r-squared. What it gives you is a squared version of the Pearson's correlation coefficient (r), which is also a commonly used metric.
The actual R2
as in coefficient of determination is calculated as
1 - (SS_res / SS_tot)
where SS_res
is the sum of squared residuals and SS_tot
is the total sum of squares.
Link to Wiki article
You can recreate the calculation and confirm that sklearn
is correct and the internet is wrong (technically, internet is also correct but it's just a different metric altogether):
import numpy as np
y_true = np.array([2,4,3,34,23])
y_pred = np.array([21,12,3,11,17])
ss_res = np.sum((y_true - y_pred)**2)
ss_tot = np.sum((y_true - np.full(y_true.shape[0], np.mean(y_true)))**2)
r2 = 1 - (ss_res / ss_tot) # Out: -0.174655908875178
For the second question - why results are different when we swtich True and Pred around the thing is that in sklearn.metrics.r2_score
variables y_true
and y_pred
are positional-only depending on the version of sklearn, i.e. the one that goes first becomes y_true
and the second is y_pred
.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
If you use the snippet above to run the calculation with values flipped around, you see that sklearn
is correct again. In this case SS_res
would not change but SS_tot
will change because you're only using y_true
and its mean in the calculation.
UPDATE:
In order to get the squared correlation coefficient that you get from Excel (as per discussion in the comments) you can scipy
instead:
from scipy.stats import pearsonr
r2 = pearsonr([2,4,3,34,23], [21,12,3,11,17])[0] ** 2 # Out: 0.002366878494073563
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html