I'm looking to build out a multiple regression in Python and need to numerically encode my categorical data. I have fields such as gender (Male, Female, Prefer not to Say), education level (High School, Undergraduate, Graduate, MBA, PhD), etc. My dependent variable is employee salary. My independent variables are as stated before, along with job level. My question is, does it matter how I assign the numbers to these fields? Will the numbering affect the outcome or validity of my regression?
I'm looking to build out a multiple regression in Python and need to numerically encode my categorical data. I have fields such as gender (Male, Female, Prefer not to Say), education level (High School, Undergraduate, Graduate, MBA, PhD), etc. My dependent variable is employee salary. My independent variables are as stated before, along with job level. My question is, does it matter how I assign the numbers to these fields? Will the numbering affect the outcome or validity of my regression?
Share Improve this question edited 2 days ago learning to code asked Apr 1 at 21:13 learning to codelearning to code 794 bronze badges 3- 1 I think it doesn't matter. If would matter then you would see it in all tutorials. But you may ask people on similar portals DataScience, CrossValidated or forum Kaggle – furas Commented Apr 1 at 22:21
- Are you asking about encoding schemes like one-hot or ordinal classification and whether they'd be suitable for your model or whether the order of categorization matters regardless of encoding scheme used? – WyattBradley Commented Apr 1 at 23:03
- @WyattBradley The latter! I'm looking to understand whether the order of categorization matters regardless of encoding scheme used? My dependent variable is employee salary. My independent variables are similar to what's in my post, along with job level. Does it matter in which order I assign my independent variables their numerical values? – learning to code Commented Apr 2 at 15:43
1 Answer
Reset to default 2You don't say whether you're interested in prediction or inference.
In both cases, as long as the categorical variables are properly encoded (for example using dummy variables - this is discussed further at the end), it generally does not matter how they are encoded; the model will yield the same predictions and inferences. However, the coefficients (including the intercept) will differ depending on the chosen reference level.
For example, with a binary variable like Gender (Male, Female), if Male is the reference level, its effect is absorbed into the intercept. The coefficient for Female then represents the difference (ie. contrast) in the outcome between Female and Male.
Here is a simple example in Python using statsmodels
import pandas as pd
import statsmodels.formula.api as smf
df = pd.DataFrame({
'Age': [25, 30, 22, 35, 40, 29],
'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female']
})
df['Gender_F1'] = df['Gender'].map({'Male': 0, 'Female': 1})
model1 = smf.ols('Age ~ Gender_F1', data=df).fit()
print(model1.summary())
# Now reverse the coding – Female as 0, Male as 1
df['Gender_F2'] = df['Gender'].map({'Male': 1, 'Female': 0})
model2 = smf.ols('Age ~ Gender_F2', data=df).fit()
print(model2.summary())
which yields (with extraneous details redacted):
Model 1:
OLS Regression Results
==============================================================================
Dep. Variable: Age R-squared: 0.038
Model: OLS Adj. R-squared: -0.202
Method: Least Squares F-statistic: 0.1581
Date: Thu, 03 Apr 2025 Prob (F-statistic): 0.711
Time: 10:44:15 Log-Likelihood: -19.132
No. Observations: 6 AIC: 42.26
Df Residuals: 4 BIC: 41.85
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 29.0000 4.150 6.988 0.002 17.478 40.522
Gender_F1 2.3333 5.869 0.398 0.711 -13.961 18.628
==============================================================================
Omnibus: nan Durbin-Watson: 1.861
Prob(Omnibus): nan Jarque-Bera (JB): 0.697
Skew: 0.790 Prob(JB): 0.706
Kurtosis: 2.460 Cond. No. 2.62
==============================================================================
Model 2:
OLS Regression Results
==============================================================================
Dep. Variable: Age R-squared: 0.038
Model: OLS Adj. R-squared: -0.202
Method: Least Squares F-statistic: 0.1581
Date: Thu, 03 Apr 2025 Prob (F-statistic): 0.711
Time: 10:44:15 Log-Likelihood: -19.132
No. Observations: 6 AIC: 42.26
Df Residuals: 4 BIC: 41.85
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 31.3333 4.150 7.550 0.002 19.811 42.855
Gender_F2 -2.3333 5.869 -0.398 0.711 -18.628 13.961
==============================================================================
Omnibus: nan Durbin-Watson: 1.861
Prob(Omnibus): nan Jarque-Bera (JB): 0.697
Skew: 0.790 Prob(JB): 0.706
Kurtosis: 2.460 Cond. No. 2.62
==============================================================================
Note that:
In Model 1 (Gender_F1: Male = 0, Female = 1):
Intercept (29.0) is the mean age for Males.
Coefficient for Female (2.33) means Females are on average 2.33 years older than Males.
In Model 2 (Gender_F2: Female = 0, Male = 1):
Intercept (31.33) is the mean age for Females.
Coefficient for Male (−2.33) means Males are on average 2.33 years younger than Females.
The predicted values, residuals, R², and fit statistics are all identical in both models.
An exception is when a categorical variable is ordinal — such as education level where using integer codes may be appropriate if the levels have a meaningful order and roughly equal spacing, though this assumption should be made cautiously.
The examples above use treatment coding (also called dummy coding in some circles), where one category is chosen as the reference, and all other levels are compared against it. This is the default in most statistical software. However, other coding schemes exist, such as Helmert, sum-to-zero (effect coding), or orthogonal polynomial contrasts. These can be useful in specific contexts, for example when testing complex hypotheses or working with ordinal predictors. The choice of contrast affects the interpretation of coefficients but, again, not the overall model fit.