02-20-2024

If you need help with R, then a search engine is your friend.
However, since “R” is a common term, I precede many of my web searches
on R with the term **r stat [query]**, and that’s worked
well for me.

But R also has builtin in help/documentation/examples. To get help on
a command or function, use the question mark followed by the name of a
function or library, or use two question marks to search for a topic. In
the example that follows, I am pulling up the specific documentation for
the `lm`

function. In the second statement, I’m searching the
documentation for the term `chisquare`

.

```
?lm
??chisquare
```

The main R website is incredibly useful, too. The site contains links to FAQs, manuals, books, and an academic journal dedicated to R. See:

R includes base libraries (functions and data) in the default
install. These are called **Base R** or **R
Base**. A comprehensive list of these packages are on the web at
the link below, and the links to each package on that site retrieve the
same documentation that you access when you use the `?`

commands above:

https://stat.ethz.ch/R-manual/R-devel/library/base/html/00Index.html

Eventually you will want to to install additional libraries to add
functionality. To do so, use the following command, and replace
`package`

with the name of the package to install:

`install.packages("package")`

Once installed, to load a package into the workspace so that you can
use it, you use the `library`

command:

`library("package")`

Periodically you may need or want to update all packages that are installed:

`update.packages()`

Packages and R itself are written by people who work in different industries, but many work in academia. It’s thus appreciated if we cite R and the packages that we use in our analyses in our publications. To get citation information for packages (note the version number in the citation info):

```
citation() # to cite R itself
citation("package")
```

When we write code, we use a text editor (not a word processor). RStudio provides a builtin text editor, but otherwise, which one you choose is based on personal preferences. Atom is a popular, cross-platform text editor that has good integration with Git / GitHub. GitHub is a site to manage and store code. It’s also a popular place to find additional R packages.

When writing in a text editor, use the `#`

to write
comments to document what you’re doing along the way. For example, here
is an example comment followed by a short mathematical statement:

```
# I will add 2 + 2 together:
2 + 2
```

R uses several types of data structures, typically referred to as
**objects**. The hardest part of R, in my opinion, is
learning how to deal with these data objects and applying them to messy
data to organize your data for analysis. It depends on the research and
the researcher, of course, but if I had to estimate the amount of time I
spend on a project that involves R, I’d say that, on average, I
spend:

- 50% of my time organizing and cleaning data
- 35% of my time analyzing the data
- 15% of my time preparing the data and the output (for example, polishing plots for publication)

The main data objects follow. Today we’ll introduce ourselves to a few of these:

- Vectors: cf. column of data of the same type
- Lists: “an object consisting of an ordered collection of objects known as its components”; or, a mixed bag of data
- Arrays: a multidimensional structure of the same data type
- Matrices: a two dimensional structure of the same data type
- Factors: “a vector object used to specify a discrete classification”
- Data Frames: “a list with a class ‘data.frame’”; or, kind of like a spreadsheet

For more details (and attribution for quotes), see: An Introduction to R

A R vector may include data of different **classes**,
such as numeric, character, and logical. Vectors may also include
missing data, which are indicated by `NA`

.

In the statemenst below, I use the `<-`

to make an
assignment. It assigns the values on the right side of the statement to
the variable name on the left side of the statement. Thus, in the first
statement below, we declare the variable `x`

to hold the
value `1`

, and the variable `y`

to hold multiple
values from `1`

to `4`

, inclusive. Sometimes
you’ll see examples where the equal sign `=`

is used instead
of the `<-`

for assignment. It’s a matter of preference,
but `<-`

helps to distinguish assignment operations from
other operations where the equal sign has a different use.

```
x <- 1 # numeric vector with one element
y <- c(1,2,3,4) # numeric vector with multiple elements
z <- c("one", "two", "three") # character vector
x1 <- c(TRUE, FALSE, TRUE, FALSE) # logical vector
y1 <- c(1,2,NA,3) # numeric vector with one missing element
x <- rnorm(20, 1, 0.5)
y <- rnorm(20, 2, 0.7)
z <- x * y
```

All data objects, including vectors, are indexed. To retrieve a specific element in a vector by its index, use square brackets:

```
x[1] # retrieves the first element in the vector x
y[4] # retrieves the fourth element in the vector y
```

Vectors may include data that belongs to the **factor**
class. Factors may be unordered or ordered. Whether factors need to be
ordered depends on the research and data, of course. For example, we
don’t order **female** or **male**, but we may
order grade levels.

Data is not necessarily in the desired class when entered into R, and
we use the `class`

function to see how data is classed and
then change it if needed:

**unordered:**

```
attendance <- c("Yes", "No", "No", "No", "Yes", "Yes", "No")
class(attendance)
attendance <- as.factor(attendance)
```

**ordered:**

```
# enter data
grades <- c("C", "A", "B", "A", "B", "B", "A", "C", "B")
grades
class(grades)
table(grades)
# change data class
grades <- factor(grades, levels = c("A", "B", "C"), ordered = TRUE)
grades
class(grades)
table(grades)
```

We can import data from spreadsheets, CSV files, etc, but sometimes we create data by combining separate vectors. For illustration, let’s create two vectors:

`x`

will be a numeric vector that contains 20 elements, normally distributed with a mean of 83, and a*sd*of 3.0.`y`

will be a logical vector, TRUE if greater than 83, FALSE if less than 83

```
x <- rnorm(20, 83, 3)
x
y <- x > 83
y
y <- as.factor(y)
y
z <- data.frame(x, y)
z
plot(z$x, z$y)
# Reversing the order of the variables or changing the syntax may produce
# different plots:
plot(z$x ~ z$y) # written in formula notation
plot(z$y, z$x) # same as above
```

**Note 1:** R often uses formula notation/syntax:

`response variable ~ predictor variable`

, or:`dependent variable ~ independent variable`

, or:`y ~ x`

**Note 2:** The dollar sign is used to refer to a
variable in a data frame (think of a column in a spreadsheet). Above,
`x`

is a variable (of any length) in the `z`

data
frame.

**Note 3:** To retrieve a specific element in a data
frame, use the square brackets and the index for both row and
column:

```
z[1,1] # retrieve first element in first column
z[1,2] # retrieve first element in second column
z[3,2] # retrieve third element in second column
z[3,] # retrieve third row
z[,2] # retrieve second column
```

We can do mathematical calcuations on all elements of a vector (or variable in a data frame) at once:

```
z$x + 100
z$x - 100
z$x * 100
z$x / 100
(z$x - 1) / 100
```

A **log2** transformation of a vector:

`log(z$x)`

A **log10** transformation of a vector:

`log10(z$x)`

When using the logarithm, be careful if there are zeroes in the vector:

```
n <- 0:10
n
log(n)
```

Instead:

```
n <- 0:10
log(n + 1)
log10(n + 1)
```

Basic math on each element in a vector:

```
z$x - 1
z$x * 2
z$x / 3
(z$x + 100) * 3
exp(z$x) # Euler's number raised to x
pi * z$x
```

Some example statistical analyses:

```
set.seed(1) # used to control random generation :)
x <- rnorm(100, 50, 1)
summary(x)
round(x)
min(x)
max(x)
max(x) - min(x)
sum(x)
rank(x)
quantile(x)
var(x)
sd(x)
mean(x)
median(x)
length(x)
hist(x)
plot(density(x))
```

Three types of correlation, but we’ll also read in CSV data, from Kaggle:

```
health <- read.table(file = "weight-height.csv",
header = TRUE,
sep = ",")
head(health)
cor(health$Height, health$Weight, method = "pearson")
cor(health$Height, health$Weight, method = "spearman")
cor(health$Height, health$Weight, method = "kendal")
boxplot(health$Weight ~ health$Gender)
boxplot(health$Height ~ health$Gender)
```

**Note 1**: When reading data from the file system, it
helps if the data is located in the same part of the file system (i.e.,
folder) that you used to start R. Otherwise, you need to specify the
path to the data in the `read.table`

(and similar commands).
Specifying the path is OS-dependent.

To get the location of your working directory. If you would like see the current working directory and/or change it:

```
getwd()
setwd('/home/user/workspace/project1') # e.g., Linux
setwd('/Users/user/workspace/project1') # e.g., macOS
setwd('C://file/path') # e.g., Windows
```

Linear regression statements take the form of
`model <- lm(DV ~ IV)`

or
`model <- lm(y ~ x)`

, where “model” is simply the name you
choose to refer to the model.

```
head(cars)
plot(cars$dist ~ cars$speed)
abline(lm(cars$dist ~ cars$speed))
cor(cars$speed, cars$dist, method = "pearson")
fit <- lm(cars$dist ~ cars$speed)
plot(fit) # model diagnostics
summary(fit)
confint(fit)
anova(fit)
```

**Note 1:** `cars`

is a data set that is part
of **base R**.

```
M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
dimnames(M) <- list(gender = c("F", "M"),
party = c("Democrat","Independent", "Republican"))
Xsq <- chisq.test(M) # Prints test summary
Xsq$observed # observed counts (same as M)
Xsq$expected # expected counts under the null
Xsq$residuals # Pearson residuals
Xsq$stdres # standardized residuals
```

Here we have prior data and so we don’t assume the null probabilities are equal across all categories of observations:

```
count <- c(399, 193, 63, 82, 13)
null_probs <- c(0.53, 0.32, 0.08, 0.05, 0.02)
chisq.test(count, p = null_probs)
```

If we do test for equal null probabilities:

```
new_count <- c(20, 39, 31)
chisq.test(count)
```

**One Sample t-test:**

```
mean(health$Weight)
t.test(health$Weight, mu = 150)
t.test(health$Weight, mu = 150, alternative = "greater")
```

**Two Sample t-test:**

```
L <- health$Weigth - 10
t.test(health$Weight, L) # if variances are unequal
t.test(health$Weight, L, var.equal = TRUE)
```

**Paired t-test:**

```
btemp1 <- beaver1$temp
length(btemp1)
btemp2 <- beaver2$temp
length(btemp)
set.seed(1)
btemp1a <- sample(btemp1, 100, replace = FALSE)
t.test(btemp1a, btemp2, paired = TRUE)
boxplot(btemp1a, btemp2)
```

**One way ANOVA**

```
x <- c(2,3,7,2,6,10,8,7,5,10,10,13,14,13,15)
g <- c(rep("group1", 5), rep("group2", 5), rep("group3", 5))
xg <- data.frame(x, g)
xg
fit.a <- aov(x ~ g, data = xg)
summary(fit.1)
plot(xg$x ~ xg$g)
fit.tukey <- TukeyHSD(fit.a)
```

When you quit R, you are given the option to save your workspace. To quit, use the following command:

`q()`

But you may want to save your data to something like a CSV file, especially if you’ve organized and cleaned it up for analysis. To do so:

`write.table(beaver1, file = "beaver1.csv", sep = ",", quote = TRUE)`

**Note 1**: The `beaver1`

and
`beaver1`

datasets are provided with **base
R**.

Read CSV from the web:

```
boston <- read.table("https://data.boston.gov/dataset/00c015a1-2b62-4072-a71e-79b292ce9670/resource/9fdbdcad-67c8-4b23-b6ec-861e77d56227/download/tmpilkz66jf.csv",
header = TRUE,
sep = ",",
quote = "\"")
```

Read from other sources, such as Excel, SPSS, SAS, and Stata.

Read from Google Sheets, relational databases, such as MySQL, and other sources.