02-20-2024
The following example is based on:
Pedhazur, E. J. (1997). //Multiple regression in behavioral research: Explanation and prediction// (3rd ed.). Wadsworth.
From the chapter: “Categorical Dependent Variable: Logistic Regression” (pp. 714-725).
The toy example here involves the gender difference in admission to an engineering program. Create the data in R, add as grouped data.
yes <- c(7, 3)
no <- c(3, 7)
log.ex <- rbind(yes, no)
dimnames(log.ex) <- list("Admit" = c("Yes", "No"),
"Gender" = c("Male", "Female"))
# Display the table
log.ex
The table:
|---------|---------------------|
| | Gender |
|---------|---------------------|
| Admit | Male | Female |
| Yes | 7 | 3 |
| No | 3 | 7 |
|---------|---------------------|
where
|---------|---------------------|
| | X |
|---------|---------------------|
| Y | 1 | 0 |
| 1 | a | b |
| 0 | c | d |
|---------|---------------------|
Pedhazur shows how to calculate the odds ratio using two methods. First, using the table above:
Second,
(or )
Where equals and equals (or ), then:
; ; and;
The logistic regression provides the parameters for the model. When collecting data, it may be natural to enter the data by individual observations (i.e., ungrouped). Because the data was added as grouped data (since it’s taken from a textbook), entering it in R as such results in a matrix:
class(log.ex)
[1] "matrix"
log.ex
Gender
Admit Male Female
Yes 7 3
No 3 7
Based on a tip http://stats.stackexchange.com/questions/27400/logistic-regression-grouped-and-ungrouped-variables-using-r, the following code works on grouped data (note the order of the Admit coding (1,0), which matches the order in the //log.ex// table):
fit.1 <- glm(as.table(log.ex) ~ c(1,0), family = binomial)
fit.1
Call: glm(formula = as.table(log.ex) ~ c(1, 0), family = binomial)
Coefficients:
(Intercept) c(1, 0)
-0.8473 1.6946
Degrees of Freedom: 1 Total (i.e. Null); 0 Residual
Null Deviance: 3.291
Residual Deviance: 8.882e-16 AIC: 9.285
A summary of the model:
summary(fit.1)
Call:
glm(formula = as.table(log.ex) ~ c(1, 0), family = binomial)
Deviance Residuals:
[1] 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8473 0.6901 -1.228 0.2195
c(1, 0) 1.6946 0.9759 1.736 0.0825 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3.2913e+00 on 1 degrees of freedom
Residual deviance: 8.8818e-16 on 0 degrees of freedom
AIC: 9.2846
Number of Fisher Scoring iterations: 3
Exponentiate the coefficients:
exp(coef(fit.1))
(Intercept) c(1, 0)
0.4285714 5.4444444
Now we have the parameters for the odds ratio (5.44) we calculated above. The interpretation of this model is: “the odds of being admitted to the program, rather than being denied, are about 5.44 times greater (more favorable) for males than they are for females” (Pedhazur, p. 718).
On p. 729, Pedhazur describes a set of data where “the independent variable consists of more than two categories.” He presents a table on page 730 that looks like this:
|--------------------|-------------|-------------|---------|--------|
| | Training |
| Dependent Variable | Treatment 1 | Treatment 2 | Control | Totals |
|--------------------|-------------|-------------|---------|--------|
| Success | 30 | 40 | 10 | 80 |
| Failure | 20 | 10 | 40 | 70 |
| Totals | 50 | 50 | 50 | 150 |
|--------------------|-------------|-------------|---------|--------|
Let’s call the dependent variable the outcome variable and the training variable the treatment variable. The cell counts are frequencies, and so we create the data in R like so:
treatment <- c(rep(1, 50), rep(2, 50), rep(3, 50))
outcome <- c(rep(1, 30), rep(0, 20), rep(1, 40), rep(0, 10),
rep(1, 10), rep(0, 40))
test.set <- as.data.frame(cbind(outcome, treatment))
Next we make sure our treatment variable is a factor and is properly leveled, such that the Control level is the reference level:
test.set$treatment <- factor(test.set$treatment, levels = c(1, 2, 3),
labels = c("T1", "T2", "Control"))
test.set$treatment <- relevel(test.set$treatment, ref = "Control")
Here’s a glimpse of some of the rows in the data frame (e.g. ‘’test.set[51:55, ]’’):
|-----|---------|-----------|
| | outcome | treatment |
|-----|---------|-----------|
| 1 | 1 | T1 |
| 2 | 1 | T1 |
| ... | ... | ... |
| 51 | 1 | T2 |
| 52 | 1 | T2 |
| ... | ... | ... |
| 101 | 1 | Control |
| 102 | 1 | Control |
|-----|---------|-----------|
We regress outcome on treatment and take a look at the model and its parameters:
fit.1 <- glm(outcome ~ treatment, data = test.set, family = "binomial")
summary(fit.1)
The model (below) corresponds to Table 17.1 (p. 731) in Pedhazur, but neither summary nor anova provide all the needed information. Pedhazur also reports the Wald chi-square test and the degrees of freedom for the coefficients in the model. To do a Wald test, we need the aod package and then run the Wald test on each coefficient and on all the levels of the treatment variable (see http://stats.idre.ucla.edu/r/dae/logit-regression/ for more details):
library("aod")
wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 1) # test constant
wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 2) # test treatmentT1
wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 3) # test treatmentT2
wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 2:3) # test treatment entirely
Pedhazur also reports the exponentiated coefficients for the treatment levels. I’ll add the exponentiated confidence intervals too, at the 95% level:
round(exp(cbind(OR = coef(fit.1), confint(fit.1))), 3)
We can now reconstruct Pedhazur’s table. R rounds differently, but the results are the same:
Variable | Estimate B | S.E. | Wald | df | Sig | Exp(B) | 2.5% | 97.5% |
---|---|---|---|---|---|---|---|---|
Intercept | -1.3863 | 0.3536 | 15.4 | 1 | .0000 | |||
treatmentT1 | 1.7918 | 0.4564 | 15.4 | 1 | .0000 | 6.00 | 2.523 | 15.260 |
treatmentT2 | 2.7726 | 0.5000 | 30.7 | 1 | .0000 | 16.00 | 6.268 | 44.988 |
treatment | 31.9 | 2 | .0000 |
The anova
function, as well as the summary
function, provide us with the log likelihoods in the form of the
residual deviances and the chi-square improvement in the form of the
deviance residual for the treatment variable.
anova(fit.1)
Df | Deviance Residual | Df | Resid. Dev | |
---|---|---|---|---|
NULL | 149 | 207.28 | ||
treatment | 2 | 39.895 | 147 | 167.38 |
Field, Miles, and Field (p. 332, 2012) provide another way of reporting this. By adding the treatment variable, the reduction in deviance results in the chi-square statistic, which if significant, means we can reject the null hypothesis that the model is not better than chance at predicting the outcome. We can do this using the following code:
# The reduction in the deviance; results in the chi square statistic
fit.chi <- fit.0$null.deviance - fit.0$deviance
# The degrees of freedom for the chi square statistic
chi.df <- fit.0$df.null - fit.0$df.residual
# The probability associated with the chi-square statistic
chisq.prob <- 1 - pchisq(fit.chi, chi.df)
# Display the results
fit.chi; chi.df; chisq.prob
Note: Europe is the baseline/reference group and Low is the baseline/reference mean score. The objective is to make sense of the odds ratio of any region compared to the odds ratio of Europe, given this is how R’s output should be interpreted. We have the following table:
|------------------|-------|--------|------|------------|------------|---------|------|
| | first_auth_geog |
|------------------|-------|--------|------|------------|------------|---------|------|
| mean.score |Europe | Africa | Asia | Latin Amer | North Amer | Oceania | U.K. |
|------------------|-------|--------|------|------------|------------|---------|------|
| Low | 220 | 4 | 43 | 23 | 217 | 65 | 65 |
| Middle | 260 | 11 | 64 | 30 | 221 | 68 | 43 |
| High | 124 | 5 | 44 | 23 | 79 | 30 | 20 |
|------------------|-------|--------|------|------------|------------|---------|------|
The odds are calculated by: Low / (Middle + High). For Europe, that means:
And for Asia, that means:
Thus, the odds ratio is:
Interpretation: first_auth_geog in Europe are 1.439x more likely to score low than middle or high compared to X in Asia.
It works out a little differently when North America is considered the treatment group:
Thus this odds ratio is:
Interpretation: first_auth_geog in Europe are 0.792x more likely to score low than middle or high compared to first_auth_geog in North America. This, of course, means first_auth_geog in Europe are less likely because .
The information is derived from the ftable
command and
also compared to the output of the ordered logistic regression:
ftable(xtabs(~ mean.rs + first_auth_geog, data = dec_sent))
m <- polr(mean.rs ~ first_auth_geog, data = dec_sent, Hess = TRUE)
round(exp(cbind(OR = coef(m), ci)), 3)
Field, A., Miles, Jeremy, & Field, Zoë. (2012). //Discovering Statistics Using R//. Los Angeles: Sage.