02-21-2024
Working through the intraclass correlation coefficients (ICC) by reading:
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420-428. http://dx.doi.org/10.1037/0033-2909.86.2.420, http://www.ncbi.nlm.nih.gov/pubmed/18839484
And using the irr package along with its documentation:
Gamer, Matthias. Lemon, Jim, Fellows, Ian, & Singh, Puspendra. (2012). irr: Various Coefficients of Interrater Reliability and Agreement. R package version 0.84. http://CRAN.R-project.org/package=irr
The toy data from the Shrout & Fleiss article, Table 2, p. 423:
|----------|------------------------|
| | Judge |
|----------|------------------------|
| Target | 1 | 2 | 3 | 4 |
|----------|------------------------|
| 1 | 9 | 2 | 5 | 8 |
| 2 | 6 | 1 | 3 | 2 |
| 3 | 8 | 4 | 6 | 8 |
| 4 | 7 | 1 | 2 | 6 |
| 5 | 10 | 5 | 6 | 9 |
| 6 | 6 | 2 | 4 | 7 |
|----------|------------------------|
Add the above data to R and structure for running an Anova:
scores <- c(9,6,8,7,10,6,2,1,4,1,5,2,5,3,6,2,6,4,8,2,8,6,9,7)
targets <- rep(c("target1", "target2", "target3", "target4", "target5", "target6"), 4)
judges <- c(rep("judge1", 6), rep("judge2", 6), rep("judge3", 6), rep("judge4", 6))
stj_df <- data.frame(scores, targets, judges)
Resulting data frame:
scores | judges | targets | |
---|---|---|---|
1 | 9 | judge1 | target1 |
2 | 6 | judge1 | target2 |
3 | 8 | judge1 | target3 |
4 | 7 | judge1 | target4 |
5 | 10 | judge1 | target5 |
6 | 6 | judge1 | target6 |
7 | 2 | judge2 | target1 |
8 | 1 | judge2 | target2 |
9 | 4 | judge2 | target3 |
10 | 1 | judge2 | target4 |
11 | 5 | judge2 | target5 |
12 | 2 | judge2 | target6 |
13 | 5 | judge3 | target1 |
14 | 3 | judge3 | target2 |
15 | 6 | judge3 | target3 |
16 | 2 | judge3 | target4 |
17 | 6 | judge3 | target5 |
18 | 4 | judge3 | target6 |
19 | 8 | judge4 | target1 |
20 | 2 | judge4 | target2 |
21 | 8 | judge4 | target3 |
22 | 6 | judge4 | target4 |
23 | 9 | judge4 | target5 |
24 | 7 | judge4 | target6 |
Relevant summary of statistics:
Group | N | Mean | Var |
---|---|---|---|
Target 1 | 4 | 6.00 | 10.00 |
Target 2 | 4 | 3.00 | 4.67 |
Target 3 | 4 | 6.50 | 3.67 |
Target 4 | 4 | 4.00 | 8.67 |
Target 5 | 4 | 7.50 | 5.67 |
Target 6 | 4 | 4.75 | 4.92 |
Total | 24 | 5.29 | 7.35 |
Group | N | Mean | Var |
---|---|---|---|
Judge 1 | 6 | 7.67 | 2.67 |
Judge 2 | 6 | 2.50 | 2.70 |
Judge 3 | 6 | 4.33 | 2.67 |
Judge 4 | 6 | 6.67 | 6.27 |
Total | 24 | 5.29 | 7.35 |
Shrout & Fleiss document six versions of the intraclass correlation coefficient (ICC). In deciding which version to use, they state:
The guidelines for choosing the appropriate form of the ICC call for three decisions: (a) Is a one-way or two-way analysis of variance (ANOVA) appropriate for the analysis of the reliability study? (b) Are differences between the judges’ mean ratings relevant to the reliability study? %%(c)%% Is the unit of analysis an individual rating or the mean of several ratings? (p. 420)
This results in the following six forms:
More specifically:
- Each target is rated by a different set of k judges, randomly selected from a larger population of judges (p. 421).
fit.1 <- aov(scores ~ targets, data = stj_df)
summary(fit.1)
Df Sum Sq Mean Sq F value Pr(>F)
targets 5 56.21 11.242 1.795 0.165
Residuals 18 112.75 6.264
The ICC(1,1) estimate (one-way, consistency, single):
Where:
Therefore:
The ICC(1,4) estimate (one-way, consistency, average):
Therefore:
- A random sample of k judges is selected from a larger population, and each judge rates each target, that is, each judge rates n targets altogether (p 421).
fit.2 <- aov(scores ~ targets + judges, data = stj_df)
summary(fit.2)
Df Sum Sq Mean Sq F value Pr(>F)
targets 5 56.21 11.24 11.03 0.000135 ***
judges 3 97.46 32.49 31.87 9.45e-07 ***
Residuals 15 15.29 1.02
The ICC(2,1) estimate (two-way, agreement, single):
Where:
Therefore:
The ICC(2,4) estimate (two-way, agreement, average):
Therefore:
- Each target is rated by each of the same k judges, who are the only judges of interest (p. 421).
fit.2 <- aov(scores ~ targets + judges, data = stj_df)
summary(fit.2)
Df Sum Sq Mean Sq F value Pr(>F)
targets 5 56.21 11.24 11.03 0.000135 ***
judges 3 97.46 32.49 31.87 9.45e-07 ***
Residuals 15 15.29 1.02
The ICC(3,1) estimate (two-way, consistency, single):
Where:
Therefore:
The ICC(3,4) estimate (two-way, consistency, average):
Therefore:
These agree with the ICC scores in Table 4 from Shrout & Fleiss (p. 424):
Version | Estimate | Model | Type | Unit of Analysis |
---|---|---|---|---|
ICC(1,1) | 0.17 | One-way | Consistency | Single |
ICC(1,4) | 0.44 | One-way | Consistency | Average |
ICC(2,1) | 0.29 | Two-way | Agreement | Single |
ICC(2,4) | 0.62 | Two-way | Agreement | Average |
ICC(3,1) | 0.71 | Two-way | Consistency | Single |
ICC(3,4) | 0.91 | Two-way | Consistency | Average |
Using the irr package, the data has to be reshaped (here just re-added into R):
library("irr")
score_1 <- c(9,6,8,7,10,6)
score_2 <- c(2,1,4,1,5,2)
score_3 <- c(5,3,6,2,6,4)
score_4 <- c(8,2,8,6,9,7)
Viewing the data (irr uses the data as it appears in the table at the top of this page):
cbind(score_1, score_2, score_3, score_4)
score_1 score_2 score_3 score_4
[1,] 9 2 5 8
[2,] 6 1 3 2
[3,] 8 4 6 8
[4,] 7 1 2 6
[5,] 10 5 6 9
[6,] 6 2 4 7
Then:
ICC(1,1) (one-way, consistency, single):
icc(cbind(score_1, score_2, score_3, score_4),
model = "oneway",
type = "consistency",
unit = "single")
Single Score Intraclass Correlation
Model: oneway
Type : consistency
Subjects = 6
Raters = 4
ICC(1) = 0.166
F-Test, H0: r0 = 0 ; H1: r0 > 0
F(5,18) = 1.79 , p = 0.165
95%-Confidence Interval for ICC Population Values:
-0.133 < ICC < 0.723
ICC(1,4) (one-way, consistency, average):
icc(cbind(score_1, score_2, score_3, score_4),
model = "oneway",
type = "consistency",
unit = "average")
Average Score Intraclass Correlation
Model: oneway
Type : consistency
Subjects = 6
Raters = 4
ICC(4) = 0.443
F-Test, H0: r0 = 0 ; H1: r0 > 0
F(5,18) = 1.79 , p = 0.165
95%-Confidence Interval for ICC Population Values:
-0.884 < ICC < 0.912
ICC(2,1) (two-way, agreement, single):
icc(cbind(score_1, score_2, score_3, score_4),
model = "twoway",
type = "agreement",
unit = "single")
Single Score Intraclass Correlation
Model: twoway
Type : agreement
Subjects = 6
Raters = 4
ICC(A,1) = 0.29
F-Test, H0: r0 = 0 ; H1: r0 > 0
F(5,4.79) = 11 , p = 0.0113
95%-Confidence Interval for ICC Population Values:
0.019 < ICC < 0.761
ICC(2,4) (two-way, agreement, average):
icc(cbind(score_1, score_2, score_3, score_4),
model = "twoway",
type = "agreement",
unit = "average")
Average Score Intraclass Correlation
Model: twoway
Type : agreement
Subjects = 6
Raters = 4
ICC(A,4) = 0.62
F-Test, H0: r0 = 0 ; H1: r0 > 0
F(5,4.19) = 11 , p = 0.0165
95%-Confidence Interval for ICC Population Values:
0.039 < ICC < 0.929
ICC(3,1) (two-way, consistency, single):
icc(cbind(score_1, score_2, score_3, score_4),
model = "twoway",
type = "consistency",
unit = "single")
Single Score Intraclass Correlation
Model: twoway
Type : consistency
Subjects = 6
Raters = 4
ICC(C,1) = 0.715
F-Test, H0: r0 = 0 ; H1: r0 > 0
F(5,15) = 11 , p = 0.000135
95%-Confidence Interval for ICC Population Values:
0.342 < ICC < 0.946
ICC(3,4) (two-way, consistency, average):
icc(cbind(score_1, score_2, score_3, score_4),
model = "twoway",
type = "consistency",
unit = "average")
Average Score Intraclass Correlation
Model: twoway
Type : consistency
Subjects = 6
Raters = 4
ICC(C,4) = 0.909
F-Test, H0: r0 = 0 ; H1: r0 > 0
F(5,15) = 11 , p = 0.000135
95%-Confidence Interval for ICC Population Values:
0.676 < ICC < 0.986