User Tools

Site Tools


r:logistic-regression

Logistic Regression

One Dichotomous Independent Variable

The following example is based on:

Pedhazur, E. J. (1997). Multiple regression in behavioral research: Explanation and prediction (3rd ed.). Wadsworth.

From the chapter: “Categorical Dependent Variable: Logistic Regression” (pp. 714-725).

The toy example here involves the gender difference in admission to an engineering program. Create the data in R, add as grouped data.

yes <- c(7, 3)
no  <- c(3, 7)
log.ex <- rbind(yes, no)
dimnames(log.ex) <- list("Admit"  = c("Yes", "No"),
                         "Gender" = c("Male", "Female"))
# Display the table
log.ex

The table:

Gender
Admit Male Female
Yes 7 3
No 3 7

where

X
Y 1 0
1 a b
0 c d

Pedhazur shows how to calculate the odds ratio using two methods. First, using the table above:

$OR = \dfrac{\dfrac{a}{c}}{\dfrac{b}{d}} = \dfrac{ad}{bc}$ $= \dfrac{(7)(7)}{(3)(3)} = \dfrac{49}{9} = 5.44$

Second,

$OR = \dfrac{P}{1 - P}$ (or $\dfrac{P}{Q}$)

Where $odds(M)$ equals $P$ and $odds(F)$ equals $1 - P$ (or $Q$), then:

$odds(M) = \dfrac{.7}{.3} = 2.333$; $odds(F) = \dfrac{.3}{.7} = 0.429$; and;

$OR = \dfrac{2.333}{.429} = 5.44$

Logistic regression in R

The logistic regression provides the parameters for the model. When collecting data, it may be natural to enter the data by individual observations (i.e., ungrouped). Because the data was added as grouped data (since it's taken from a textbook), entering it in R as such results in a matrix:

class(log.ex)
[1] "matrix"
log.ex
     Gender
Admit Male Female
  Yes    7      3
  No     3      7

Based on a tip (http://stats.stackexchange.com/questions/27400/logistic-regression-grouped-and-ungrouped-variables-using-r), the following code works on grouped data (note the order of the Admit coding (1,0), which matches the order in the log.ex table):

fit.1 <- glm(as.table(log.ex) ~ c(1,0), family = binomial)
fit.1
 
Call:  glm(formula = as.table(log.ex) ~ c(1, 0), family = binomial)
 
Coefficients:
(Intercept)      c(1, 0)  
    -0.8473       1.6946  
 
Degrees of Freedom: 1 Total (i.e. Null);  0 Residual
Null Deviance:      3.291 
Residual Deviance: 8.882e-16    AIC: 9.285

A summary of the model:

summary(fit.1)
 
Call:
glm(formula = as.table(log.ex) ~ c(1, 0), family = binomial)
 
Deviance Residuals: 
[1]  0  0
 
Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -0.8473     0.6901  -1.228   0.2195  
c(1, 0)       1.6946     0.9759   1.736   0.0825 .
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1
 
(Dispersion parameter for binomial family taken to be 1)
 
    Null deviance: 3.2913e+00  on 1  degrees of freedom
Residual deviance: 8.8818e-16  on 0  degrees of freedom
AIC: 9.2846
 
Number of Fisher Scoring iterations: 3

Exponentiate the coefficients:

exp(coef(fit.1))
(Intercept)     c(1, 0) 
  0.4285714   5.4444444

Now we have the parameters for the odds ratio (5.44) we calculated above. The interpretation of this model is: “the odds of being admitted to the program, rather than being denied, are about 5.44 times greater (more favorable) for males than they are for females” (Pedhazur, p. 718).

One Independent Variable with Multiple Categories

On p. 729, Pedhazur describes a set of data where “the independent variable consists of more than two categories.” He presents a table on page 730 that looks like this:

Training
Dependent Variable Treatment 1 Treatment 2 Control Totals
Success 30 40 10 80
Failure 20 10 40 70
Totals 50 50 50 150

Let's call the dependent variable the outcome variable and the training variable the treatment variable. The cell counts are frequencies, and so we create the data in R like so:

treatment <- c(rep(1, 50), rep(2, 50), rep(3, 50))
outcome   <- c(rep(1, 30), rep(0, 20), rep(1, 40), rep(0, 10),
               rep(1, 10), rep(0, 40))
test.set <- as.data.frame(cbind(outcome, treatment))

Next we make sure our treatment variable is a factor and is properly leveled, such that the Control level is the reference level:

test.set$treatment <- factor(test.set$treatment, levels = c(1, 2, 3),
                             labels = c("T1", "T2", "Control"))
test.set$treatment <- relevel(test.set$treatment, ref = "Control")

Here's a glimpse of some of the rows in the data frame (e.g. test.set[51:55, ]):

outcome treatment
1 1 T1
2 1 T1
51 1 T2
52 1 T2
101 1 Control
102 1 Control

We regress outcome $Y$ on treatment $X$ and take a look at the model and its parameters:

fit.1 <- glm(outcome ~ treatment, data = test.set, family = "binomial")
summary(fit.1)

The model (below) corresponds to Table 17.1 (p. 731) in Pedhazur, but neither summary nor anova provide all the needed information. Pedhazur also reports the Wald chi-square test and the degrees of freedom for the coefficients in the model. To do a Wald test, we need the aod package and then run the Wald test on each coefficient and on all the levels of the treatment variable (see http://stats.idre.ucla.edu/r/dae/logit-regression/ for more details):

library("aod")
wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 1)      # test constant
wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 2)      # test treatmentT1
wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 3)      # test treatmentT2
wald.test(b = coef(fit.1), Sigma = vcov(fit.1), Terms = 2:3)    # test treatment entirely

Pedhazur also reports the exponentiated coefficients for the treatment levels. I'll add the exponentiated confidence intervals too, at the 95% level:

round(exp(cbind(OR = coef(fit.1), confint(fit.1))), 3)

We can now reconstruct Pedhazur's table. R rounds differently, but the results are the same:

Variable Estimate B S.E. Wald df Sig Exp(B) 2.5% 97.5%
Intercept -1.3863 0.3536 15.4 1 .0000
treatmentT1 1.7918 0.4564 15.4 1 .0000 6.00 2.523 15.260
treatmentT2 2.7726 0.5000 30.7 1 .0000 16.00 6.268 44.988
treatment 31.9 2 .0000

The anova function, as well as the summary function, provide us with the log likelihoods in the form of the residual deviances and the chi-square improvement in the form of the deviance residual for the treatment variable.

anova(fit.1)
Df Deviance Residual Df Resid. Dev
NULL 149 207.28
treatment 2 39.895 147 167.38

Field, Miles, and Field (p. 332, 2012) provide another way of reporting this. By adding the treatment variable, the reduction in deviance results in the chi-square statistic, which if significant, means we can reject the null hypothesis that the model is not better than chance at predicting the outcome. We can do this using the following code:

# The reduction in the deviance; results in the chi square statistic
fit.chi <- fit.0$null.deviance - fit.0$deviance
# The degrees of freedom for the chi square statistic
chi.df  <- fit.0$df.null - fit.0$df.residual
# The probability associated with the chi-square statistic
chisq.prob <- 1 - pchisq(fit.chi, chi.df)
# Display the results
fit.chi; chi.df; chisq.prob

Ordered Logistic Regression

Note: Europe is the baseline/reference group and Low is the baseline/reference mean score. The objective is to make sense of the odds ratio of any region compared to the odds ratio of Europe, given this is how R's output should be interpreted. Given the following table:

first_auth_geog
mean.score Europe Africa Asia Latin Amer North Amer Oceania U.K.
Low 220 4 43 23 217 65 65
Middle 260 11 64 30 221 68 43
High 124 5 44 23 79 30 20

The odds are calculated by: Low / (Middle + High). For Europe, that means:

$Odds = \dfrac{220}{260+124} = 0.573$

And for Asia, that means:

$Odds = \dfrac{43}{64+44} = 0.398$

Thus, the odds ratio is:

$OR = \dfrac{0.573}{0.398} = 1.439$

Interpretation: first_auth_geog in Europe are 1.439x more likely to score low than middle or high compared to X in Asia.

It works out a little differently when North America is considered the treatment group:

$Odds = \dfrac{217}{221+79} = 0.723$

Thus this odds ratio is:

$OR = \dfrac{0.573}{0.723} = 0.792$

Interpretation: first_auth_geog in Europe are 0.792x more likely to score low than middle or high compared to first_auth_geog in North America. This, of course, means first_auth_geog in Europe are less likely because $OR < 1.00$.

The information is derived from the ftable command and also compared to the output of the ordered logistic regression:

ftable(xtabs(~ mean.rs + first_auth_geog, data = dec_sent))
m <- polr(mean.rs ~ first_auth_geog, data = dec_sent, Hess = TRUE)
round(exp(cbind(OR = coef(m), ci)), 3)

References

Field, A., Miles, Jeremy, & Field, Zoë. (2012). Discovering Statistics Using R. Los Angeles: Sage.

r/logistic-regression.txt · Last modified: 2017/02/23 09:30 by seanburns