02-20-2024

- Logistic Regression
- One Dichotomous Independent Variable
- Logistic regression in R
- Ordered Logistic Regression
- References

The following example is based on:

Pedhazur, E. J. (1997). //Multiple regression in behavioral research: Explanation and prediction// (3rd ed.). Wadsworth.

From the chapter: “Categorical Dependent Variable: Logistic Regression” (pp. 714-725).

The toy example here involves the gender difference in admission to an engineering program. Create the data in R, add as grouped data.

```
yes <- c(7, 3)
no <- c(3, 7)
log.ex <- rbind(yes, no)
dimnames(log.ex) <- list("Admit" = c("Yes", "No"),
"Gender" = c("Male", "Female"))
# Display the table
log.ex
```

The table:

```
|---------|---------------------|
| | Gender |
|---------|---------------------|
| Admit | Male | Female |
| Yes | 7 | 3 |
| No | 3 | 7 |
|---------|---------------------|
```

*where*

```
|---------|---------------------|
| | X |
|---------|---------------------|
| Y | 1 | 0 |
| 1 | a | b |
| 0 | c | d |
|---------|---------------------|
```

Pedhazur shows how to calculate the odds ratio using two methods. First, using the table above:

$OR = \dfrac{\dfrac{a}{c}}{\dfrac{b}{d}} = \dfrac{ad}{bc}$ $= \dfrac{(7)(7)}{(3)(3)} = \dfrac{49}{9} = 5.44$

Second,

$OR = \dfrac{P}{1 - P}$ (or $\dfrac{P}{Q}$)

Where $odds(M)$ equals $P$ and $odds(F)$ equals $1 - P$ (or $Q$), then:

$odds(M) = \dfrac{.7}{.3} = 2.333$; $odds(F) = \dfrac{.3}{.7} = 0.429$; and;

$OR = \dfrac{2.333}{.429} = 5.44$

The logistic regression provides the parameters for the model. When collecting data, it may be natural to enter the data by individual observations (i.e., ungrouped). Because the data was added as grouped data (since it’s taken from a textbook), entering it in R as such results in a matrix:

```
class(log.ex)
[1] "matrix"
log.ex
Gender
Admit Male Female
Yes 7 3
No 3 7
```

Based on a tip http://stats.stackexchange.com/questions/27400/logistic-regression-grouped-and-ungrouped-variables-using-r, the following code works on grouped data (note the order of the Admit coding (1,0), which matches the order in the //log.ex// table):

```
fit.1 <- glm(as.table(log.ex) ~ c(1,0), family = binomial)
fit.1
Call: glm(formula = as.table(log.ex) ~ c(1, 0), family = binomial)
Coefficients:
(Intercept) c(1, 0)
-0.8473 1.6946
Degrees of Freedom: 1 Total (i.e. Null); 0 Residual
Null Deviance: 3.291
Residual Deviance: 8.882e-16 AIC: 9.285
```

A summary of the model:

```
summary(fit.1)
Call:
glm(formula = as.table(log.ex) ~ c(1, 0), family = binomial)
Deviance Residuals:
[1] 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8473 0.6901 -1.228 0.2195
c(1, 0) 1.6946 0.9759 1.736 0.0825 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3.2913e+00 on 1 degrees of freedom
Residual deviance: 8.8818e-16 on 0 degrees of freedom
AIC: 9.2846
Number of Fisher Scoring iterations: 3
```

Exponentiate the coefficients:

```
exp(coef(fit.1))
(Intercept) c(1, 0)
0.4285714 5.4444444
```

Now we have the parameters for the odds ratio (5.44) we calculated above. The interpretation of this model is: “the odds of being admitted to the program, rather than being denied, are about 5.44 times greater (more favorable) for males than they are for females” (Pedhazur, p. 718).

On p. 729, Pedhazur describes a set of data where “the independent variable consists of more than two categories.” He presents a table on page 730 that looks like this:

```
|--------------------|-------------|-------------|---------|--------|
| | Training |
| Dependent Variable | Treatment 1 | Treatment 2 | Control | Totals |
|--------------------|-------------|-------------|---------|--------|
| Success | 30 | 40 | 10 | 80 |
| Failure | 20 | 10 | 40 | 70 |
| Totals | 50 | 50 | 50 | 150 |
|--------------------|-------------|-------------|---------|--------|
```

Let’s call the dependent variable the **outcome**
variable and the **training** variable the
**treatment** variable. The cell counts are frequencies,
and so we create the data in **R** like so:

```
treatment <- c(rep(1, 50), rep(2, 50), rep(3, 50))
outcome <- c(rep(1, 30), rep(0, 20), rep(1, 40), rep(0, 10),
rep(1, 10), rep(0, 40))
test.set <- as.data.frame(cbind(outcome, treatment))
```

Next we make sure our **treatment** variable is a factor
and is properly leveled, such that the **Control** level is
the reference level:

```
test.set$treatment <- factor(test.set$treatment, levels = c(1, 2, 3),
labels = c("T1", "T2", "Control"))
test.set$treatment <- relevel(test.set$treatment, ref = "Control")
```

Here’s a glimpse of some of the rows in the data frame (e.g. ‘’test.set[51:55, ]’’):

```
|-----|---------|-----------|
| | outcome | treatment |
|-----|---------|-----------|
| 1 | 1 | T1 |
| 2 | 1 | T1 |
| ... | ... | ... |
| 51 | 1 | T2 |
| 52 | 1 | T2 |
| ... | ... | ... |
| 101 | 1 | Control |
| 102 | 1 | Control |
|-----|---------|-----------|
```

We regress **outcome**
$Y$
on **treatment**
$X$
and take a look at the model and its parameters:

```
fit.1 <- glm(outcome ~ treatment, data = test.set, family = "binomial")
summary(fit.1)
```

The model (below) corresponds to Table 17.1 (p. 731) in Pedhazur, but
neither *summary* nor *anova* provide all the needed
information. Pedhazur also reports the *Wald* chi-square test and
the degrees of freedom for the coefficients in the model. To do a
*Wald* test, we need the *aod* package and then run the
*Wald* test on each coefficient and on all the levels of the
treatment variable (see http://stats.idre.ucla.edu/r/dae/logit-regression/
for more details):

```
library("aod")
wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 1) # test constant
wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 2) # test treatmentT1
wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 3) # test treatmentT2
wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 2:3) # test treatment entirely
```

Pedhazur also reports the exponentiated coefficients for the treatment levels. I’ll add the exponentiated confidence intervals too, at the 95% level:

`round(exp(cbind(OR = coef(fit.1), confint(fit.1))), 3)`

We can now reconstruct Pedhazur’s table. **R** rounds
differently, but the results are the same:

Variable | Estimate B | S.E. | Wald | df | Sig | Exp(B) | 2.5% | 97.5% |
---|---|---|---|---|---|---|---|---|

Intercept | -1.3863 | 0.3536 | 15.4 | 1 | .0000 | |||

treatmentT1 | 1.7918 | 0.4564 | 15.4 | 1 | .0000 | 6.00 | 2.523 | 15.260 |

treatmentT2 | 2.7726 | 0.5000 | 30.7 | 1 | .0000 | 16.00 | 6.268 | 44.988 |

treatment | 31.9 | 2 | .0000 |

The `anova`

function, as well as the `summary`

function, provide us with the log likelihoods in the form of the
residual deviances and the chi-square improvement in the form of the
deviance residual for the **treatment** variable.

`anova(fit.1)`

Df | Deviance Residual | Df | Resid. Dev | |
---|---|---|---|---|

NULL | 149 | 207.28 | ||

treatment | 2 | 39.895 | 147 | 167.38 |

Field, Miles, and Field (p. 332, 2012) provide another way of reporting this. By adding the treatment variable, the reduction in deviance results in the chi-square statistic, which if significant, means we can reject the null hypothesis that the model is not better than chance at predicting the outcome. We can do this using the following code:

```
# The reduction in the deviance; results in the chi square statistic
fit.chi <- fit.0$null.deviance - fit.0$deviance
# The degrees of freedom for the chi square statistic
chi.df <- fit.0$df.null - fit.0$df.residual
# The probability associated with the chi-square statistic
chisq.prob <- 1 - pchisq(fit.chi, chi.df)
# Display the results
fit.chi; chi.df; chisq.prob
```

Note: Europe is the baseline/reference group and Low is the baseline/reference mean score. The objective is to make sense of the odds ratio of any region compared to the odds ratio of Europe, given this is how R’s output should be interpreted. We have the following table:

```
|------------------|-------|--------|------|------------|------------|---------|------|
| | first_auth_geog |
|------------------|-------|--------|------|------------|------------|---------|------|
| mean.score |Europe | Africa | Asia | Latin Amer | North Amer | Oceania | U.K. |
|------------------|-------|--------|------|------------|------------|---------|------|
| Low | 220 | 4 | 43 | 23 | 217 | 65 | 65 |
| Middle | 260 | 11 | 64 | 30 | 221 | 68 | 43 |
| High | 124 | 5 | 44 | 23 | 79 | 30 | 20 |
|------------------|-------|--------|------|------------|------------|---------|------|
```

The odds are calculated by: Low / (Middle + High). For Europe, that means:

$Odds = \dfrac{220}{260+124} = 0.573$

And for Asia, that means:

$Odds = \dfrac{43}{64+44} = 0.398$

Thus, the odds ratio is:

$OR = \dfrac{0.573}{0.398} = 1.439$

Interpretation: **first_auth_geog** in Europe are 1.439x
more likely to score low than middle or high compared to X in Asia.

It works out a little differently when North America is considered the treatment group:

$Odds = \dfrac{217}{221+79} = 0.723$

Thus this odds ratio is:

$OR = \dfrac{0.573}{0.723} = 0.792$

Interpretation: **first_auth_geog** in Europe are 0.792x
more likely to score low than middle or high compared to
**first_auth_geog** in North America. This, of course,
means **first_auth_geog** in Europe are less likely because
$OR < 1.00$.

The information is derived from the `ftable`

command and
also compared to the output of the ordered logistic regression:

```
ftable(xtabs(~ mean.rs + first_auth_geog, data = dec_sent))
m <- polr(mean.rs ~ first_auth_geog, data = dec_sent, Hess = TRUE)
round(exp(cbind(OR = coef(m), ci)), 3)
```

Field, A., Miles, Jeremy, & Field, Zoë. (2012). //Discovering Statistics Using R//. Los Angeles: Sage.