Examples adapted from the following work:
Pagano, R. R. (2002). Understanding statistics in the behavioral sciences (6th ed.). Belmont, CA: Wadsworth. See pages 135 - 136.
$Y' = {b_Y}X + a_Y$
where
And:
$b_Y = \dfrac{\sum{XY} - \dfrac{(\sum{X})(\sum{Y})}{N}}{\sum{X^2} - \dfrac{(\sum{X})^2}{N}}$
And:
$a_Y = \bar{Y} - {b_Y}\bar{X}$
The goal is to predict height in inches at age 20 based on height at age 3.
Individual No. | Height at Age 3 X (in.) | Height at Age 20 Y (in.) |
---|---|---|
1 | 30 | 59 |
2 | 30 | 63 |
3 | 32 | 62 |
4 | 33 | 67 |
5 | 34 | 65 |
6 | 35 | 61 |
7 | 36 | 69 |
8 | 38 | 66 |
9 | 40 | 68 |
10 | 41 | 65 |
11 | 41 | 73 |
12 | 43 | 68 |
13 | 45 | 71 |
14 | 45 | 74 |
15 | 47 | 71 |
16 | 48 | 75 |
Enter the data into R:
x <- c(30, 30, 32, 33, 34, 35, 36, 38, 40, 41, 41, 43, 45, 45, 47, 48) y <- c(59, 63, 62, 67, 65, 61, 69, 66, 68, 65, 73, 68, 71, 74, 71, 75)
Run the model using lm:
fit.1 <- lm(y ~ x) summary(fit.1) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -3.9068 -1.9569 -0.3841 1.7136 4.1113 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 41.6792 4.4698 9.325 2.21e-07 *** x 0.6636 0.1144 5.799 4.61e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.654 on 14 degrees of freedom Multiple R-squared: 0.7061, Adjusted R-squared: 0.6851 F-statistic: 33.63 on 1 and 14 DF, p-value: 4.611e-05
Use the coefficients to create the regression equation: $Y' = {b_Y}X + a_Y = 0.664X + 41.679$
Visualize the regression with a shaded 95% confidence region:
library(ggplot2) dat <- data.frame(x,y) p <- ggplot(dat, aes(x, y)) p + geom_point() + geom_smooth(method = lm)
summary(fit.1)
reports the p-value for the F-statistic. Another way to test the regression of Y on X is by comparing the F-statistic observed to its critical value in a F-Distribution table. (If for no other reason but to help develop the intuition involved.) Although the F-statistic is reported by summary(fit.1)
, per Pedhazur (1997), it can also be derived by dividing the regression sums of squares by the associated degrees of freedom and then by the residual sums of squares by its associated degrees of freedom. The sums of squares are not reported by summary(fit.1)
, but they are reported by fitting anova to the model:
aov(fit.1) Call: aov(formula = fit.1) Terms: x Residuals Sum of Squares 236.83824 98.59926 Deg. of Freedom 1 14 Residual standard error: 2.653828 Estimated effects may be unbalanced
Use these values to confirm the F-statistic:
$F = \dfrac{\dfrac{SS_{reg}}{df_1}}{\dfrac{SS_{res}}{df_2}} = \dfrac{\dfrac{236.83824}{1}}{\dfrac{98.59926}{14}} = 33.63$