02-20-2024
If you need help with R, then a search engine is your friend. However, since “R” is a common term, I precede many of my web searches on R with the term r stat [query], and that’s worked well for me.
But R also has builtin in help/documentation/examples. To get help on
a command or function, use the question mark followed by the name of a
function or library, or use two question marks to search for a topic. In
the example that follows, I am pulling up the specific documentation for
the lm
function. In the second statement, I’m searching the
documentation for the term chisquare
.
?lm
??chisquare
The main R website is incredibly useful, too. The site contains links to FAQs, manuals, books, and an academic journal dedicated to R. See:
R includes base libraries (functions and data) in the default
install. These are called Base R or R
Base. A comprehensive list of these packages are on the web at
the link below, and the links to each package on that site retrieve the
same documentation that you access when you use the ?
commands above:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/00Index.html
Eventually you will want to to install additional libraries to add
functionality. To do so, use the following command, and replace
package
with the name of the package to install:
install.packages("package")
Once installed, to load a package into the workspace so that you can
use it, you use the library
command:
library("package")
Periodically you may need or want to update all packages that are installed:
update.packages()
Packages and R itself are written by people who work in different industries, but many work in academia. It’s thus appreciated if we cite R and the packages that we use in our analyses in our publications. To get citation information for packages (note the version number in the citation info):
citation() # to cite R itself
citation("package")
When we write code, we use a text editor (not a word processor). RStudio provides a builtin text editor, but otherwise, which one you choose is based on personal preferences. Atom is a popular, cross-platform text editor that has good integration with Git / GitHub. GitHub is a site to manage and store code. It’s also a popular place to find additional R packages.
When writing in a text editor, use the #
to write
comments to document what you’re doing along the way. For example, here
is an example comment followed by a short mathematical statement:
# I will add 2 + 2 together:
2 + 2
R uses several types of data structures, typically referred to as objects. The hardest part of R, in my opinion, is learning how to deal with these data objects and applying them to messy data to organize your data for analysis. It depends on the research and the researcher, of course, but if I had to estimate the amount of time I spend on a project that involves R, I’d say that, on average, I spend:
The main data objects follow. Today we’ll introduce ourselves to a few of these:
For more details (and attribution for quotes), see: An Introduction to R
A R vector may include data of different classes,
such as numeric, character, and logical. Vectors may also include
missing data, which are indicated by NA
.
In the statemenst below, I use the <-
to make an
assignment. It assigns the values on the right side of the statement to
the variable name on the left side of the statement. Thus, in the first
statement below, we declare the variable x
to hold the
value 1
, and the variable y
to hold multiple
values from 1
to 4
, inclusive. Sometimes
you’ll see examples where the equal sign =
is used instead
of the <-
for assignment. It’s a matter of preference,
but <-
helps to distinguish assignment operations from
other operations where the equal sign has a different use.
x <- 1 # numeric vector with one element
y <- c(1,2,3,4) # numeric vector with multiple elements
z <- c("one", "two", "three") # character vector
x1 <- c(TRUE, FALSE, TRUE, FALSE) # logical vector
y1 <- c(1,2,NA,3) # numeric vector with one missing element
x <- rnorm(20, 1, 0.5)
y <- rnorm(20, 2, 0.7)
z <- x * y
All data objects, including vectors, are indexed. To retrieve a specific element in a vector by its index, use square brackets:
x[1] # retrieves the first element in the vector x
y[4] # retrieves the fourth element in the vector y
Vectors may include data that belongs to the factor class. Factors may be unordered or ordered. Whether factors need to be ordered depends on the research and data, of course. For example, we don’t order female or male, but we may order grade levels.
Data is not necessarily in the desired class when entered into R, and
we use the class
function to see how data is classed and
then change it if needed:
unordered:
attendance <- c("Yes", "No", "No", "No", "Yes", "Yes", "No")
class(attendance)
attendance <- as.factor(attendance)
ordered:
# enter data
grades <- c("C", "A", "B", "A", "B", "B", "A", "C", "B")
grades
class(grades)
table(grades)
# change data class
grades <- factor(grades, levels = c("A", "B", "C"), ordered = TRUE)
grades
class(grades)
table(grades)
We can import data from spreadsheets, CSV files, etc, but sometimes we create data by combining separate vectors. For illustration, let’s create two vectors:
x
will be a numeric vector that contains 20 elements,
normally distributed with a mean of 83, and a sd of 3.0.y
will be a logical vector, TRUE if greater than 83,
FALSE if less than 83x <- rnorm(20, 83, 3)
x
y <- x > 83
y
y <- as.factor(y)
y
z <- data.frame(x, y)
z
plot(z$x, z$y)
# Reversing the order of the variables or changing the syntax may produce
# different plots:
plot(z$x ~ z$y) # written in formula notation
plot(z$y, z$x) # same as above
Note 1: R often uses formula notation/syntax:
response variable ~ predictor variable
, or:dependent variable ~ independent variable
, or:y ~ x
Note 2: The dollar sign is used to refer to a
variable in a data frame (think of a column in a spreadsheet). Above,
x
is a variable (of any length) in the z
data
frame.
Note 3: To retrieve a specific element in a data frame, use the square brackets and the index for both row and column:
z[1,1] # retrieve first element in first column
z[1,2] # retrieve first element in second column
z[3,2] # retrieve third element in second column
z[3,] # retrieve third row
z[,2] # retrieve second column
We can do mathematical calcuations on all elements of a vector (or variable in a data frame) at once:
z$x + 100
z$x - 100
z$x * 100
z$x / 100
(z$x - 1) / 100
A log2 transformation of a vector:
log(z$x)
A log10 transformation of a vector:
log10(z$x)
When using the logarithm, be careful if there are zeroes in the vector:
n <- 0:10
n
log(n)
Instead:
n <- 0:10
log(n + 1)
log10(n + 1)
Basic math on each element in a vector:
z$x - 1
z$x * 2
z$x / 3
(z$x + 100) * 3
exp(z$x) # Euler's number raised to x
pi * z$x
Some example statistical analyses:
set.seed(1) # used to control random generation :)
x <- rnorm(100, 50, 1)
summary(x)
round(x)
min(x)
max(x)
max(x) - min(x)
sum(x)
rank(x)
quantile(x)
var(x)
sd(x)
mean(x)
median(x)
length(x)
hist(x)
plot(density(x))
Three types of correlation, but we’ll also read in CSV data, from Kaggle:
health <- read.table(file = "weight-height.csv",
header = TRUE,
sep = ",")
head(health)
cor(health$Height, health$Weight, method = "pearson")
cor(health$Height, health$Weight, method = "spearman")
cor(health$Height, health$Weight, method = "kendal")
boxplot(health$Weight ~ health$Gender)
boxplot(health$Height ~ health$Gender)
Note 1: When reading data from the file system, it
helps if the data is located in the same part of the file system (i.e.,
folder) that you used to start R. Otherwise, you need to specify the
path to the data in the read.table
(and similar commands).
Specifying the path is OS-dependent.
To get the location of your working directory. If you would like see the current working directory and/or change it:
getwd()
setwd('/home/user/workspace/project1') # e.g., Linux
setwd('/Users/user/workspace/project1') # e.g., macOS
setwd('C://file/path') # e.g., Windows
Linear regression statements take the form of
model <- lm(DV ~ IV)
or
model <- lm(y ~ x)
, where “model” is simply the name you
choose to refer to the model.
head(cars)
plot(cars$dist ~ cars$speed)
abline(lm(cars$dist ~ cars$speed))
cor(cars$speed, cars$dist, method = "pearson")
fit <- lm(cars$dist ~ cars$speed)
plot(fit) # model diagnostics
summary(fit)
confint(fit)
anova(fit)
Note 1: cars
is a data set that is part
of base R.
M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
dimnames(M) <- list(gender = c("F", "M"),
party = c("Democrat","Independent", "Republican"))
Xsq <- chisq.test(M) # Prints test summary
Xsq$observed # observed counts (same as M)
Xsq$expected # expected counts under the null
Xsq$residuals # Pearson residuals
Xsq$stdres # standardized residuals
Here we have prior data and so we don’t assume the null probabilities are equal across all categories of observations:
count <- c(399, 193, 63, 82, 13)
null_probs <- c(0.53, 0.32, 0.08, 0.05, 0.02)
chisq.test(count, p = null_probs)
If we do test for equal null probabilities:
new_count <- c(20, 39, 31)
chisq.test(count)
One Sample t-test:
mean(health$Weight)
t.test(health$Weight, mu = 150)
t.test(health$Weight, mu = 150, alternative = "greater")
Two Sample t-test:
L <- health$Weigth - 10
t.test(health$Weight, L) # if variances are unequal
t.test(health$Weight, L, var.equal = TRUE)
Paired t-test:
btemp1 <- beaver1$temp
length(btemp1)
btemp2 <- beaver2$temp
length(btemp)
set.seed(1)
btemp1a <- sample(btemp1, 100, replace = FALSE)
t.test(btemp1a, btemp2, paired = TRUE)
boxplot(btemp1a, btemp2)
One way ANOVA
x <- c(2,3,7,2,6,10,8,7,5,10,10,13,14,13,15)
g <- c(rep("group1", 5), rep("group2", 5), rep("group3", 5))
xg <- data.frame(x, g)
xg
fit.a <- aov(x ~ g, data = xg)
summary(fit.1)
plot(xg$x ~ xg$g)
fit.tukey <- TukeyHSD(fit.a)
When you quit R, you are given the option to save your workspace. To quit, use the following command:
q()
But you may want to save your data to something like a CSV file, especially if you’ve organized and cleaned it up for analysis. To do so:
write.table(beaver1, file = "beaver1.csv", sep = ",", quote = TRUE)
Note 1: The beaver1
and
beaver1
datasets are provided with base
R.
Read CSV from the web:
boston <- read.table("https://data.boston.gov/dataset/00c015a1-2b62-4072-a71e-79b292ce9670/resource/9fdbdcad-67c8-4b23-b6ec-861e77d56227/download/tmpilkz66jf.csv",
header = TRUE,
sep = ",",
quote = "\"")
Read from other sources, such as Excel, SPSS, SAS, and Stata.
Read from Google Sheets, relational databases, such as MySQL, and other sources.