If I fit an lm
or glm
with factors as explanatory variables, I get a model with estimates and SEs for each level of the factor (minus the baseline level). How can I reliably extract that info for estimates relating to each term in the formula?
Example:
> hec = reshape2::melt(HairEyeColor)
> head(hec)
Hair Eye Sex value
1 Black Brown Male 32
2 Brown Brown Male 53
3 Red Brown Male 10
Then fit a model with two factors:
m = glm(value ~ -1 + Hair + Eye, data=hec)
The fitted coefficients are available in a named vector (and as a full table with these rownames via summary(m)$coef
):
> coef(m)
HairBlack HairBrown HairRed HairBlond EyeBlue EyeHazel EyeGreen
22.500 44.750 17.875 24.875 -0.625 -15.875 -19.500
but now I want to get the first four out since they relate to Hair, and the next three as they relate to Eye.
I could do this using string pattern matching (which I have been doing for specific examples, but I'm doing this often enough I want to get past this) but I am hoping there's a package or a function or an easy way to do this without recourse to string matching, which can possibly break if you construct your data perversely enough and get identical matching coefficient names...
(In the following example, both foo
and foob
terms generate a coefficient named "foobaz"
.)
> d = data.frame(foo=sample(c("bar","baz","qux","fnord"),100,TRUE), foob=sample(c("ar","az","sqq","moo"),100,TRUE), y=rnorm(100))
> coef(glm(y~foo+foob, data=d))
(Intercept) foobaz foofnord fooqux foobaz foobmoo
-0.37880701 0.40920067 0.10645248 0.19536840 0.21929754 -0.05912854
foobsqq
0.37137127
This is also fragile for interaction terms, where the order of interacted factors depends on the order the terms appear in the formula. You could have X:Y
in your formula and get a term out with Y:X
if Y
has already appeared in the formula. Great.
I've looked at the fitted model object for clues as to how outputs relate to formula terms, and I think something could be done using the $terms
component, then using that to see how many levels are in each factor, and getting that many elements from the coefficients at a time, but also that requires compensating for intercept terms, and then interaction terms add another level of complexity.
So before I code this all up, has it been done?
If I fit an lm
or glm
with factors as explanatory variables, I get a model with estimates and SEs for each level of the factor (minus the baseline level). How can I reliably extract that info for estimates relating to each term in the formula?
Example:
> hec = reshape2::melt(HairEyeColor)
> head(hec)
Hair Eye Sex value
1 Black Brown Male 32
2 Brown Brown Male 53
3 Red Brown Male 10
Then fit a model with two factors:
m = glm(value ~ -1 + Hair + Eye, data=hec)
The fitted coefficients are available in a named vector (and as a full table with these rownames via summary(m)$coef
):
> coef(m)
HairBlack HairBrown HairRed HairBlond EyeBlue EyeHazel EyeGreen
22.500 44.750 17.875 24.875 -0.625 -15.875 -19.500
but now I want to get the first four out since they relate to Hair, and the next three as they relate to Eye.
I could do this using string pattern matching (which I have been doing for specific examples, but I'm doing this often enough I want to get past this) but I am hoping there's a package or a function or an easy way to do this without recourse to string matching, which can possibly break if you construct your data perversely enough and get identical matching coefficient names...
(In the following example, both foo
and foob
terms generate a coefficient named "foobaz"
.)
> d = data.frame(foo=sample(c("bar","baz","qux","fnord"),100,TRUE), foob=sample(c("ar","az","sqq","moo"),100,TRUE), y=rnorm(100))
> coef(glm(y~foo+foob, data=d))
(Intercept) foobaz foofnord fooqux foobaz foobmoo
-0.37880701 0.40920067 0.10645248 0.19536840 0.21929754 -0.05912854
foobsqq
0.37137127
This is also fragile for interaction terms, where the order of interacted factors depends on the order the terms appear in the formula. You could have X:Y
in your formula and get a term out with Y:X
if Y
has already appeared in the formula. Great.
I've looked at the fitted model object for clues as to how outputs relate to formula terms, and I think something could be done using the $terms
component, then using that to see how many levels are in each factor, and getting that many elements from the coefficients at a time, but also that requires compensating for intercept terms, and then interaction terms add another level of complexity.
So before I code this all up, has it been done?
Share Improve this question edited Apr 1 at 13:19 Darren Tsai 36.3k5 gold badges25 silver badges57 bronze badges asked Apr 1 at 8:37 SpacedmanSpacedman 94.5k12 gold badges147 silver badges230 bronze badges 02 Answers
Reset to default 10Example:
d <- data.frame(foo = sample(c("bar","baz","qux","fnord"), 100, TRUE),
foob = sample(c("ar","az","sqq","moo"), 100, TRUE),
y = rnorm(100))
mod <- glm(y ~ foo + foob, data = d)
model.matrix()
is a function that generates a design matrix for a model. Apart from this, it also contains an attribute "assign"
, an integer vector with an entry for each column in the matrix giving the term in the formula which gave rise to the column.
attr(model.matrix(mod), "assign")
# [1] 0 1 1 1 2 2 2
Value 0 corresponds to the intercept (if any), and positive values to terms in the order given by the "term.labels"
attribute of the terms
structure corresponding to the model.
attr(terms(mod), "term.labels") # or labels(terms(mod))
# [1] "foo" "foob"
By combining the above, we can write a function that extracts the coefficients of specific term(s) from a model.
getCoef <- function(term, model) {
ind <- attr(model.matrix(model), "assign")
lab <- labels(terms(model))
coef(model)[ind %in% match(term, lab)]
}
getCoef("foo", mod)
# foobaz foofnord fooqux
# 0.12634573 0.27871612 -0.04792015
getCoef("foob", mod)
# foobaz foobmoo foobsqq
# -0.1596896 -0.1452948 -0.0532562
getCoef(c("foo", "foob"), mod)
# foobaz foofnord fooqux foobaz foobmoo foobsqq
# 0.12634573 0.27871612 -0.04792015 -0.15968959 -0.14529477 -0.05325620
# (the two foobaz's from `foo` and `foob` are all extracted)
If a nonexistent term is provided, ...
getCoef("bar", mod)
# named numeric(0)
Try dummy.coef
. It will give a named list, one component per factor, with class "dummy_coef" having its own print
method. It will include all levels of the factors including ones omitted by coef
. When using treatment contrasts, as in the example, the extra level will always be zero. In this example the Brown level of Eye has no coef
component but dum$Eye
does have a zero dummy.coef
component.
hec <- reshape2::melt(HairEyeColor)
m <- glm(value ~ -1 + Hair + Eye, data=hec)
dum <- dummy.coef(m); dum
## Full coefficients are
##
## Hair: Black Brown Red Blond
## 22.500 44.750 17.875 24.875
## Eye: Brown Blue Hazel Green
## 0.000 -0.625 -15.875 -19.500
dum$Hair
## Black Brown Red Blond
## 22.500 44.750 17.875 24.875
dum$Eye
## Brown Blue Hazel Green
## 0.000 -0.625 -15.875 -19.500