# C. Sean Burns: Notebook

### Site Tools

r:linear-regression

### Linear Regression

Examples adapted from the following work:

Pagano, R. R. (2002). Understanding statistics in the behavioral sciences (6th ed.). Belmont, CA: Wadsworth. See pages 135 - 136.

#### Linear regression equation

$Y' = {b_Y}X + a_Y$

where

1. $Y'$ = predicted $Y$
2. $b_Y$ = slope
3. $a_Y$ = intercept

And:

$b_Y = \dfrac{\sum{XY} - \dfrac{(\sum{X})(\sum{Y})}{N}}{\sum{X^2} - \dfrac{(\sum{X})^2}{N}}$

And:

$a_Y = \bar{Y} - {b_Y}\bar{X}$

#### Sample Data

The goal is to predict height in inches at age 20 based on height at age 3.

Individual No. Height at Age 3 X (in.) Height at Age 20 Y (in.)
1 30 59
2 30 63
3 32 62
4 33 67
5 34 65
6 35 61
7 36 69
8 38 66
9 40 68
10 41 65
11 41 73
12 43 68
13 45 71
14 45 74
15 47 71
16 48 75

Enter the data into R:

x <- c(30, 30, 32, 33, 34, 35, 36, 38, 40, 41, 41, 43, 45, 45, 47, 48)
y <- c(59, 63, 62, 67, 65, 61, 69, 66, 68, 65, 73, 68, 71, 74, 71, 75)
##### Build the Model

Run the model using lm:

fit.1 <- lm(y ~ x)
summary(fit.1)

Call:
lm(formula = y ~ x)

Residuals:
Min      1Q  Median      3Q     Max
-3.9068 -1.9569 -0.3841  1.7136  4.1113

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  41.6792     4.4698   9.325 2.21e-07 ***
x             0.6636     0.1144   5.799 4.61e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.654 on 14 degrees of freedom
Multiple R-squared:  0.7061,	Adjusted R-squared:  0.6851
F-statistic: 33.63 on 1 and 14 DF,  p-value: 4.611e-05

Use the coefficients to create the regression equation: $Y' = {b_Y}X + a_Y = 0.664X + 41.679$

#### Visualize the Model

Visualize the regression with a shaded 95% confidence region:

library(ggplot2)
dat <- data.frame(x,y)
p   <- ggplot(dat, aes(x, y))
p + geom_point() + geom_smooth(method = lm)

#### Test $Y$ on $X$

summary(fit.1) reports the p-value for the F-statistic. Another way to test the regression of Y on X is by comparing the F-statistic observed to its critical value in a F-Distribution table. (If for no other reason but to help develop the intuition involved.) Although the F-statistic is reported by summary(fit.1), per Pedhazur (1997), it can also be derived by dividing the regression sums of squares by the associated degrees of freedom and then by the residual sums of squares by its associated degrees of freedom. The sums of squares are not reported by summary(fit.1), but they are reported by fitting anova to the model:

aov(fit.1)
Call:
aov(formula = fit.1)

Terms:
x Residuals
Sum of Squares  236.83824  98.59926
Deg. of Freedom         1        14

Residual standard error: 2.653828
Estimated effects may be unbalanced

Use these values to confirm the F-statistic:

$F = \dfrac{\dfrac{SS_{reg}}{df_1}}{\dfrac{SS_{res}}{df_2}} = \dfrac{\dfrac{236.83824}{1}}{\dfrac{98.59926}{14}} = 33.63$