Working through the intraclass correlation coefficients (ICC) by reading:
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420-428. http://dx.doi.org/10.1037/0033-2909.86.2.420 http://www.ncbi.nlm.nih.gov/pubmed/18839484
And using the irr package along with its documentation:
Gamer, Matthias. Lemon, Jim, Fellows, Ian, & Singh, Puspendra. (2012). irr: Various Coefficients of Interrater Reliability and Agreement. R package version 0.84. http://CRAN.R-project.org/package=irr
The toy data from the Shrout & Fleiss article, Table 2, p. 423:
Judge | ||||
---|---|---|---|---|
Target | 1 | 2 | 3 | 4 |
1 | 9 | 2 | 5 | 8 |
2 | 6 | 1 | 3 | 2 |
3 | 8 | 4 | 6 | 8 |
4 | 7 | 1 | 2 | 6 |
5 | 10 | 5 | 6 | 9 |
6 | 6 | 2 | 4 | 7 |
Add the above data to R and structure for running an Anova:
scores <- c(9,6,8,7,10,6,2,1,4,1,5,2,5,3,6,2,6,4,8,2,8,6,9,7) targets <- rep(c("target1", "target2", "target3", "target4", "target5", "target6"), 4) judges <- c(rep("judge1", 6), rep("judge2", 6), rep("judge3", 6), rep("judge4", 6)) stj_df <- data.frame(scores, targets, judges)
Resulting data frame:
scores | judges | targets | |
---|---|---|---|
1 | 9 | judge1 | target1 |
2 | 6 | judge1 | target2 |
3 | 8 | judge1 | target3 |
4 | 7 | judge1 | target4 |
5 | 10 | judge1 | target5 |
6 | 6 | judge1 | target6 |
7 | 2 | judge2 | target1 |
8 | 1 | judge2 | target2 |
9 | 4 | judge2 | target3 |
10 | 1 | judge2 | target4 |
11 | 5 | judge2 | target5 |
12 | 2 | judge2 | target6 |
13 | 5 | judge3 | target1 |
14 | 3 | judge3 | target2 |
15 | 6 | judge3 | target3 |
16 | 2 | judge3 | target4 |
17 | 6 | judge3 | target5 |
18 | 4 | judge3 | target6 |
19 | 8 | judge4 | target1 |
20 | 2 | judge4 | target2 |
21 | 8 | judge4 | target3 |
22 | 6 | judge4 | target4 |
23 | 9 | judge4 | target5 |
24 | 7 | judge4 | target6 |
Relevant summary of statistics:
Group | N | Mean | Var |
---|---|---|---|
Target 1 | 4 | 6.00 | 10.00 |
Target 2 | 4 | 3.00 | 4.67 |
Target 3 | 4 | 6.50 | 3.67 |
Target 4 | 4 | 4.00 | 8.67 |
Target 5 | 4 | 7.50 | 5.67 |
Target 6 | 4 | 4.75 | 4.92 |
Total | 24 | 5.29 | 7.35 |
Group | N | Mean | Var |
---|---|---|---|
Judge 1 | 6 | 7.67 | 2.67 |
Judge 2 | 6 | 2.50 | 2.70 |
Judge 3 | 6 | 4.33 | 2.67 |
Judge 4 | 6 | 6.67 | 6.27 |
Total | 24 | 5.29 | 7.35 |
Shrout & Fleiss document six versions of the intraclass correlation coefficient (ICC). In deciding which version to use, they state:
The guidelines for choosing the appropriate form of the ICC call for three decisions: (a) Is a one-way or two-way analysis of variance (ANOVA) appropriate for the analysis of the reliability study? (b) Are differences between the judges' mean ratings relevant to the reliability study? (c) Is the unit of analysis an individual rating or the mean of several ratings? (p. 420)
This results in the following six forms:
More specifically:
1. Each target is rated by a different set of k judges, randomly selected from a larger population of judges (p. 421).
fit.1 <- aov(scores ~ targets, data = stj_df) summary(fit.1) Df Sum Sq Mean Sq F value Pr(>F) targets 5 56.21 11.242 1.795 0.165 Residuals 18 112.75 6.264
The ICC(1,1) estimate (one-way, consistency, single):
$$ ICC(1,1) = \frac{BMS - WMS}{BMS + (k - 1)WMS} $$
Where:
Therefore:
\begin{align*} ICC(1,1) & = \frac{11.24 - 6.26}{11.24 + (4 - 1)6.26} \\ & = 0.17 \end{align*}
The ICC(1,4) estimate (one-way, consistency, average):
$$ ICC(1,4) = \frac{BMS - WMS}{BMS} $$
Therefore:
\begin{align*} ICC(1,4) & = \frac{11.24 - 6.26}{11.24} \\ & = 0.44 \end{align*}
2. A random sample of k judges is selected from a larger population, and each judge rates each target, that is, each judge rates n targets altogether (p 421).
fit.2 <- aov(scores ~ targets + judges, data = stj_df) summary(fit.2) Df Sum Sq Mean Sq F value Pr(>F) targets 5 56.21 11.24 11.03 0.000135 *** judges 3 97.46 32.49 31.87 9.45e-07 *** Residuals 15 15.29 1.02
The ICC(2,1) estimate (two-way, agreement, single):
$$ ICC(2,1) = \frac{BMS - EMS}{BMS + (k - 1)EMS + \frac{k(JMS - EMS)}{n}} $$
Where:
Therefore:
\begin{align*} ICC(2,1) & = \frac{11.24 - 1.02}{11.24 + (4 - 1)1.02 + \frac{4(32.49 - 1.02)}{6}} \\ & = 0.29 \end{align*}
The ICC(2,4) estimate (two-way, agreement, average):
$$ ICC(2,4) = \frac{BMS - EMS}{BMS + \frac{(JMS - EMS)}{n}} $$
Therefore:
\begin{align*} ICC(2,4) & = \frac{11.24 - 1.02}{11.24 + \frac{(32.49 - 1.02)}{6}} \\ & = 0.62 \end{align*}
3. Each target is rated by each of the same k judges, who are the only judges of interest (p. 421).
fit.2 <- aov(scores ~ targets + judges, data = stj_df) summary(fit.2) Df Sum Sq Mean Sq F value Pr(>F) targets 5 56.21 11.24 11.03 0.000135 *** judges 3 97.46 32.49 31.87 9.45e-07 *** Residuals 15 15.29 1.02
The ICC(3,1) estimate (two-way, consistency, single):
$$ ICC(3,1) = \frac{BMS - EMS}{BMS + (k - 1)EMS} $$
Where:
Therefore:
\begin{align*} ICC(3,1) & = \frac{11.24 - 1.02}{11.24 + (4 - 1)1.02} \\ & = 0.71 \end{align*}
The ICC(3,4) estimate (two-way, consistency, average):
$$ ICC(3,4) = \frac{BMS - EMS}{BMS} $$
Therefore:
\begin{align*} ICC(3,4) & = \frac{11.24 - 1.02}{11.24} \\ & = 0.91 \end{align*}
These agree with the ICC scores in Table 4 from Shrout & Fleiss (p. 424):
Version | Estimate | Model | Type | Unit of Analysis |
---|---|---|---|---|
ICC(1,1) | 0.17 | One-way | Consistency | Single |
ICC(1,4) | 0.44 | One-way | Consistency | Average |
ICC(2,1) | 0.29 | Two-way | Agreement | Single |
ICC(2,4) | 0.62 | Two-way | Agreement | Average |
ICC(3,1) | 0.71 | Two-way | Consistency | Single |
ICC(3,4) | 0.91 | Two-way | Consistency | Average |
Using the irr package, the data has to be reshaped (here just re-added into R):
library("irr") score_1 <- c(9,6,8,7,10,6) score_2 <- c(2,1,4,1,5,2) score_3 <- c(5,3,6,2,6,4) score_4 <- c(8,2,8,6,9,7)
Viewing the data (irr uses the data as it appears in the table at the top of this page):
cbind(score_1, score_2, score_3, score_4) score_1 score_2 score_3 score_4 [1,] 9 2 5 8 [2,] 6 1 3 2 [3,] 8 4 6 8 [4,] 7 1 2 6 [5,] 10 5 6 9 [6,] 6 2 4 7
Then:
ICC(1,1) (one-way, consistency, single):
icc(cbind(score_1, score_2, score_3, score_4), model = "oneway", type = "consistency", unit = "single") Single Score Intraclass Correlation Model: oneway Type : consistency Subjects = 6 Raters = 4 ICC(1) = 0.166 F-Test, H0: r0 = 0 ; H1: r0 > 0 F(5,18) = 1.79 , p = 0.165 95%-Confidence Interval for ICC Population Values: -0.133 < ICC < 0.723
ICC(1,4) (one-way, consistency, average):
icc(cbind(score_1, score_2, score_3, score_4), model = "oneway", type = "consistency", unit = "average") Average Score Intraclass Correlation Model: oneway Type : consistency Subjects = 6 Raters = 4 ICC(4) = 0.443 F-Test, H0: r0 = 0 ; H1: r0 > 0 F(5,18) = 1.79 , p = 0.165 95%-Confidence Interval for ICC Population Values: -0.884 < ICC < 0.912
ICC(2,1) (two-way, agreement, single):
icc(cbind(score_1, score_2, score_3, score_4), model = "twoway", type = "agreement", unit = "single") Single Score Intraclass Correlation Model: twoway Type : agreement Subjects = 6 Raters = 4 ICC(A,1) = 0.29 F-Test, H0: r0 = 0 ; H1: r0 > 0 F(5,4.79) = 11 , p = 0.0113 95%-Confidence Interval for ICC Population Values: 0.019 < ICC < 0.761
ICC(2,4) (two-way, agreement, average):
icc(cbind(score_1, score_2, score_3, score_4), model = "twoway", type = "agreement", unit = "average") Average Score Intraclass Correlation Model: twoway Type : agreement Subjects = 6 Raters = 4 ICC(A,4) = 0.62 F-Test, H0: r0 = 0 ; H1: r0 > 0 F(5,4.19) = 11 , p = 0.0165 95%-Confidence Interval for ICC Population Values: 0.039 < ICC < 0.929
ICC(3,1) (two-way, consistency, single):
icc(cbind(score_1, score_2, score_3, score_4), model = "twoway", type = "consistency", unit = "single") Single Score Intraclass Correlation Model: twoway Type : consistency Subjects = 6 Raters = 4 ICC(C,1) = 0.715 F-Test, H0: r0 = 0 ; H1: r0 > 0 F(5,15) = 11 , p = 0.000135 95%-Confidence Interval for ICC Population Values: 0.342 < ICC < 0.946
ICC(3,4) (two-way, consistency, average):
icc(cbind(score_1, score_2, score_3, score_4), model = "twoway", type = "consistency", unit = "average") Average Score Intraclass Correlation Model: twoway Type : consistency Subjects = 6 Raters = 4 ICC(C,4) = 0.909 F-Test, H0: r0 = 0 ; H1: r0 > 0 F(5,15) = 11 , p = 0.000135 95%-Confidence Interval for ICC Population Values: 0.676 < ICC < 0.986