Logistic Regression in R

C. Sean Burns

02-20-2024

Logistic Regression
One Dichotomous Independent Variable
Logistic regression in R
- One Independent Variable with Multiple Categories
Ordered Logistic Regression
References

Logistic Regression

One Dichotomous Independent Variable

The following example is based on:

Pedhazur, E. J. (1997). //Multiple regression in behavioral research: Explanation and prediction// (3rd ed.). Wadsworth.

From the chapter: “Categorical Dependent Variable: Logistic Regression” (pp. 714-725).

The toy example here involves the gender difference in admission to an engineering program. Create the data in R, add as grouped data.

yes <- c(7, 3)
no  <- c(3, 7)
log.ex <- rbind(yes, no)
dimnames(log.ex) <- list("Admit"  = c("Yes", "No"),
                         "Gender" = c("Male", "Female"))
# Display the table
log.ex

The table:

|---------|---------------------|
|         |  Gender             | 
|---------|---------------------|
|  Admit  |  Male    |  Female  |
|  Yes    |  7       |  3       |
|  No     |  3       |  7       |
|---------|---------------------|

where

|---------|---------------------|
|         |  X                  |
|---------|---------------------|
|  Y      |  1       |  0       |
|  1      |  a       |  b       |
|  0      |  c       |  d       |
|---------|---------------------|

Pedhazur shows how to calculate the odds ratio using two methods. First, using the table above:

$OR = \dfrac{\dfrac{a}{c}}{\dfrac{b}{d}} = \dfrac{ad}{bc}$ $= \dfrac{(7)(7)}{(3)(3)} = \dfrac{49}{9} = 5.44$

Second,

$OR = \dfrac{P}{1 - P}$ (or $\dfrac{P}{Q}$ )

Where $odds(M)$ equals $P$ and $odds(F)$ equals $1 - P$ (or $Q$ ), then:

$odds(M) = \dfrac{.7}{.3} = 2.333$ ; $odds(F) = \dfrac{.3}{.7} = 0.429$ ; and;

$OR = \dfrac{2.333}{.429} = 5.44$

Logistic regression in R

The logistic regression provides the parameters for the model. When collecting data, it may be natural to enter the data by individual observations (i.e., ungrouped). Because the data was added as grouped data (since it’s taken from a textbook), entering it in R as such results in a matrix:

class(log.ex)
[1] "matrix"
log.ex
     Gender
Admit Male Female
  Yes    7      3
  No     3      7

Based on a tip http://stats.stackexchange.com/questions/27400/logistic-regression-grouped-and-ungrouped-variables-using-r, the following code works on grouped data (note the order of the Admit coding (1,0), which matches the order in the //log.ex// table):

fit.1 <- glm(as.table(log.ex) ~ c(1,0), family = binomial)
fit.1

Call:  glm(formula = as.table(log.ex) ~ c(1, 0), family = binomial)

Coefficients:
(Intercept)      c(1, 0)  
    -0.8473       1.6946  

Degrees of Freedom: 1 Total (i.e. Null);  0 Residual
Null Deviance:      3.291 
Residual Deviance: 8.882e-16    AIC: 9.285

A summary of the model:

summary(fit.1)

Call:
glm(formula = as.table(log.ex) ~ c(1, 0), family = binomial)

Deviance Residuals: 
[1]  0  0

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -0.8473     0.6901  -1.228   0.2195  
c(1, 0)       1.6946     0.9759   1.736   0.0825 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3.2913e+00  on 1  degrees of freedom
Residual deviance: 8.8818e-16  on 0  degrees of freedom
AIC: 9.2846

Number of Fisher Scoring iterations: 3

Exponentiate the coefficients:

exp(coef(fit.1))
(Intercept)     c(1, 0) 
  0.4285714   5.4444444

Now we have the parameters for the odds ratio (5.44) we calculated above. The interpretation of this model is: “the odds of being admitted to the program, rather than being denied, are about 5.44 times greater (more favorable) for males than they are for females” (Pedhazur, p. 718).

One Independent Variable with Multiple Categories

On p. 729, Pedhazur describes a set of data where “the independent variable consists of more than two categories.” He presents a table on page 730 that looks like this:

|--------------------|-------------|-------------|---------|--------|
|                    | Training                                     |
| Dependent Variable | Treatment 1 | Treatment 2 | Control | Totals |
|--------------------|-------------|-------------|---------|--------|
| Success            | 30          | 40          | 10      | 80     |
| Failure            | 20          | 10          | 40      | 70     |
| Totals             | 50          | 50          | 50      | 150    |
|--------------------|-------------|-------------|---------|--------|

Let’s call the dependent variable the outcome variable and the training variable the treatment variable. The cell counts are frequencies, and so we create the data in R like so:

treatment <- c(rep(1, 50), rep(2, 50), rep(3, 50))
outcome   <- c(rep(1, 30), rep(0, 20), rep(1, 40), rep(0, 10),
               rep(1, 10), rep(0, 40))
test.set <- as.data.frame(cbind(outcome, treatment))

Next we make sure our treatment variable is a factor and is properly leveled, such that the Control level is the reference level:

test.set$treatment <- factor(test.set$treatment, levels = c(1, 2, 3),
                             labels = c("T1", "T2", "Control"))
test.set$treatment <- relevel(test.set$treatment, ref = "Control")

Here’s a glimpse of some of the rows in the data frame (e.g. ‘’test.set[51:55, ]’’):

|-----|---------|-----------|
|     | outcome | treatment |
|-----|---------|-----------|
| 1   | 1       | T1        |
| 2   | 1       | T1        |
| ... | ...     | ...       |
| 51  | 1       | T2        |
| 52  | 1       | T2        |
| ... | ...     | ...       |
| 101 | 1       | Control   |
| 102 | 1       | Control   |
|-----|---------|-----------|

We regress outcome $Y$ on treatment $X$ and take a look at the model and its parameters:

fit.1 <- glm(outcome ~ treatment, data = test.set, family = "binomial")
summary(fit.1)

The model (below) corresponds to Table 17.1 (p. 731) in Pedhazur, but neither summary nor anova provide all the needed information. Pedhazur also reports the Wald chi-square test and the degrees of freedom for the coefficients in the model. To do a Wald test, we need the aod package and then run the Wald test on each coefficient and on all the levels of the treatment variable (see http://stats.idre.ucla.edu/r/dae/logit-regression/ for more details):

library("aod")
wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 1)      # test constant
wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 2)      # test treatmentT1
wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 3)      # test treatmentT2
wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 2:3)    # test treatment entirely

Pedhazur also reports the exponentiated coefficients for the treatment levels. I’ll add the exponentiated confidence intervals too, at the 95% level:

round(exp(cbind(OR = coef(fit.1), confint(fit.1))), 3)

We can now reconstruct Pedhazur’s table. R rounds differently, but the results are the same:

Variable	Estimate B	S.E.	Wald	df	Sig	Exp(B)	2.5%	97.5%
Intercept	-1.3863	0.3536	15.4	1	.0000
treatmentT1	1.7918	0.4564	15.4	1	.0000	6.00	2.523	15.260
treatmentT2	2.7726	0.5000	30.7	1	.0000	16.00	6.268	44.988
treatment			31.9	2	.0000

The anova function, as well as the summary function, provide us with the log likelihoods in the form of the residual deviances and the chi-square improvement in the form of the deviance residual for the treatment variable.

anova(fit.1)

	Df	Deviance Residual	Df	Resid. Dev
NULL			149	207.28
treatment	2	39.895	147	167.38

Field, Miles, and Field (p. 332, 2012) provide another way of reporting this. By adding the treatment variable, the reduction in deviance results in the chi-square statistic, which if significant, means we can reject the null hypothesis that the model is not better than chance at predicting the outcome. We can do this using the following code:

# The reduction in the deviance; results in the chi square statistic
fit.chi <- fit.0$null.deviance - fit.0$deviance

# The degrees of freedom for the chi square statistic
chi.df  <- fit.0$df.null - fit.0$df.residual

# The probability associated with the chi-square statistic
chisq.prob <- 1 - pchisq(fit.chi, chi.df)

# Display the results
fit.chi; chi.df; chisq.prob

Ordered Logistic Regression

Note: Europe is the baseline/reference group and Low is the baseline/reference mean score. The objective is to make sense of the odds ratio of any region compared to the odds ratio of Europe, given this is how R’s output should be interpreted. We have the following table:

|------------------|-------|--------|------|------------|------------|---------|------|
|                  | first_auth_geog                                                  |
|------------------|-------|--------|------|------------|------------|---------|------|
| mean.score       |Europe | Africa | Asia | Latin Amer | North Amer | Oceania | U.K. |
|------------------|-------|--------|------|------------|------------|---------|------|
| Low              | 220   | 4      | 43   | 23         | 217        | 65      | 65   |
| Middle           | 260   | 11     | 64   | 30         | 221        | 68      | 43   |
| High             | 124   | 5      | 44   | 23         | 79         | 30      | 20   |
|------------------|-------|--------|------|------------|------------|---------|------|

The odds are calculated by: Low / (Middle + High). For Europe, that means:

$Odds = \dfrac{220}{260+124} = 0.573$

And for Asia, that means:

$Odds = \dfrac{43}{64+44} = 0.398$

Thus, the odds ratio is:

$OR = \dfrac{0.573}{0.398} = 1.439$

Interpretation: first_auth_geog in Europe are 1.439x more likely to score low than middle or high compared to X in Asia.

It works out a little differently when North America is considered the treatment group:

$Odds = \dfrac{217}{221+79} = 0.723$

Thus this odds ratio is:

$OR = \dfrac{0.573}{0.723} = 0.792$

Interpretation: first_auth_geog in Europe are 0.792x more likely to score low than middle or high compared to first_auth_geog in North America. This, of course, means first_auth_geog in Europe are less likely because $OR < 1.00$ .

The information is derived from the ftable command and also compared to the output of the ordered logistic regression:

ftable(xtabs(~ mean.rs + first_auth_geog, data = dec_sent))
m <- polr(mean.rs ~ first_auth_geog, data = dec_sent, Hess = TRUE)
round(exp(cbind(OR = coef(m), ci)), 3)

References

Field, A., Miles, Jeremy, & Field, Zoë. (2012). //Discovering Statistics Using R//. Los Angeles: Sage.