r:linear-regression

Examples adapted from the following work:

Pagano, R. R. (2002). *Understanding statistics in the behavioral sciences* (6th ed.). Belmont, CA: Wadsworth. See pages 135 - 136.

$Y' = {b_Y}X + a_Y$

*where*

- $Y'$ = predicted $Y$
- $b_Y$ = slope
- $a_Y$ = intercept

And:

$b_Y = \dfrac{\sum{XY} - \dfrac{(\sum{X})(\sum{Y})}{N}}{\sum{X^2} - \dfrac{(\sum{X})^2}{N}}$

And:

$a_Y = \bar{Y} - {b_Y}\bar{X}$

The goal is to predict height in inches at age 20 based on height at age 3.

Individual No. | Height at Age 3 X (in.) | Height at Age 20 Y (in.) |
---|---|---|

1 | 30 | 59 |

2 | 30 | 63 |

3 | 32 | 62 |

4 | 33 | 67 |

5 | 34 | 65 |

6 | 35 | 61 |

7 | 36 | 69 |

8 | 38 | 66 |

9 | 40 | 68 |

10 | 41 | 65 |

11 | 41 | 73 |

12 | 43 | 68 |

13 | 45 | 71 |

14 | 45 | 74 |

15 | 47 | 71 |

16 | 48 | 75 |

Enter the data into R:

x <- c(30, 30, 32, 33, 34, 35, 36, 38, 40, 41, 41, 43, 45, 45, 47, 48) y <- c(59, 63, 62, 67, 65, 61, 69, 66, 68, 65, 73, 68, 71, 74, 71, 75)

Run the model using lm:

fit.1 <- lm(y ~ x) summary(fit.1) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -3.9068 -1.9569 -0.3841 1.7136 4.1113 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 41.6792 4.4698 9.325 2.21e-07 *** x 0.6636 0.1144 5.799 4.61e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.654 on 14 degrees of freedom Multiple R-squared: 0.7061, Adjusted R-squared: 0.6851 F-statistic: 33.63 on 1 and 14 DF, p-value: 4.611e-05

Use the coefficients to create the regression equation: $Y' = {b_Y}X + a_Y = 0.664X + 41.679$

Visualize the regression with a shaded 95% confidence region:

library(ggplot2) dat <- data.frame(x,y) p <- ggplot(dat, aes(x, y)) p + geom_point() + geom_smooth(method = lm)

`summary(fit.1)`

reports the p-value for the F-statistic. Another way to test the regression of Y on X is by comparing the F-statistic observed to its critical value in a F-Distribution table. (If for no other reason but to help develop the intuition involved.) Although the F-statistic is reported by `summary(fit.1)`

, per Pedhazur (1997), it can also be derived by dividing the *regression sums of squares* by the associated *degrees of freedom* and then by the *residual sums of squares* by its associated *degrees of freedom*. The *sums of squares* are not reported by `summary(fit.1)`

, but they are reported by fitting *anova* to the model:

aov(fit.1) Call: aov(formula = fit.1) Terms: x Residuals Sum of Squares 236.83824 98.59926 Deg. of Freedom 1 14 Residual standard error: 2.653828 Estimated effects may be unbalanced

Use these values to confirm the F-statistic:

$F = \dfrac{\dfrac{SS_{reg}}{df_1}}{\dfrac{SS_{res}}{df_2}} = \dfrac{\dfrac{236.83824}{1}}{\dfrac{98.59926}{14}} = 33.63$

r/linear-regression.txt · Last modified: 2017/02/22 09:25 by seanburns