Understanding heterogeneity in meta-analysis
Effect sizes may differ across studies
Let’s consider the following case. The three effect sizes look different.
Why may they differ?
Let’s say they are SMDs between cases and controls, or measures of treatment efficacy.
Equal-Effects (or “Fixed effects”) models
If we disregard possible heterogeneity across effects and assume that all studies reflect the same true underlying effect, with variations only being due to sampling error, we get the following estimates.
Would you trust them?
fit = rma(eff, vi, method="EE")
forest(fit, xlab="Cohen's d", header=T)The point estimate (center of the diamond) is outside the confidence interval of any of the 3 effect sizes involved in the analysis.
The point estimate seems “overconfident” (the width of the diamond is very narrow): what would happen if +1 study were added?
Although the Study 1 effect has the highest precision, the other two are more consistent with each other, so Study 1 seems to have too much leverage on the meta-analytic result.
As shown below from the model summary, the test for heterogeneity is statistically significant, and the I^2 suggests that most variability is actually due to heterogeneity (rather than to sampling error).
summary(fit)
Equal-Effects Model (k = 3)
logLik deviance AIC BIC AICc
-75.1142 164.5625 152.2283 151.3269 156.2283
I^2 (total heterogeneity / total variability): 98.78%
H^2 (total variability / sampling variability): 82.28
Test for Heterogeneity:
Q(df = 2) = 164.5625, p-val < .0001
Model Results:
estimate se zval pval ci.lb ci.ub
0.5725 0.0173 33.0533 <.0001 0.5386 0.6064 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Random-Effects models
Now, here is a more plausible estimate, simply obtained adopting a “random-effects models” (which is default in metafor):
Equal- and Random-Effects models are almost identical if there is no true heterogeneity
In this case, there is no true heterogeneity in effect size. If we fit the model with Equal or Random effects, it is virtually the same.
- In the first case, we fit the model with random effects (as per default):
fit1 = rma(yi=z, vi=vi, data=df1)
par(mar=c(5,0,0,0))
forest(fit1, xlim=c(-0.3,0.85), xlab="Pearson's r", transf=fisherz2r, at=seq(-0.2,.6,.1), header=T)summary(fit1)
Random-Effects Model (k = 15; tau^2 estimator: REML)
logLik deviance AIC BIC AICc
16.2754 -32.5508 -28.5508 -27.2727 -27.4599
tau^2 (estimated amount of total heterogeneity): 0.0012 (SE = 0.0016)
tau (square root of estimated tau^2 value): 0.0350
I^2 (total heterogeneity / total variability): 28.79%
H^2 (total variability / sampling variability): 1.40
Test for Heterogeneity:
Q(df = 14) = 20.0452, p-val = 0.1287
Model Results:
estimate se zval pval ci.lb ci.ub
0.1940 0.0174 11.1451 <.0001 0.1598 0.2281 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- In the second case, we fit it with equal effects… results are virtually the same (not exactly the same because, by accident, some very small heterogeneity was estimated, although it is immaterial):
fit1 = rma(yi=z, vi=vi, data=df1, method = "EE")
par(mar=c(5,0,0,0))
forest(fit1, xlim=c(-0.3,0.85), xlab="Pearson's r", transf=fisherz2r, at=seq(-0.2,.6,.1), header=T)summary(fit1)
Equal-Effects Model (k = 15)
logLik deviance AIC BIC AICc
17.7685 20.0452 -33.5371 -32.8290 -33.2294
I^2 (total heterogeneity / total variability): 30.16%
H^2 (total variability / sampling variability): 1.43
Test for Heterogeneity:
Q(df = 14) = 20.0452, p-val = 0.1287
Model Results:
estimate se zval pval ci.lb ci.ub
0.1918 0.0140 13.6626 <.0001 0.1643 0.2193 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Equal- and Random-Effects models differe if there is true heterogeneity… and Random-Effects models should be preferred in this case!
- In the following case, the effect sizes of the studies are the same as before… but now each study is much more precise (has a larger sample). This means that observed variability across effects cannot be attributed to sampling error alone with a same underlying effect being valid for all studies. Here, heterogeneity is a real thing! Each study seems to reflect a slightly different underlying “true” effect. Here, if we fit a random effect model, significant heterogeneity emerge, the “tau” estimate is larger and non-negligible. Also, note that “paradoxically” the uncertainty of the meta-analytic estimate (width of the diamond) slightly increases! This is because, in presence of true heterogeneity, part of the uncertainty about the overall effect (diamond) is due to heterogeneity itself.
Random-Effects Model (k = 15; tau^2 estimator: REML)
logLik deviance AIC BIC AICc
15.0624 -30.1247 -26.1247 -24.8466 -25.0338
tau^2 (estimated amount of total heterogeneity): 0.0060 (SE = 0.0025)
tau (square root of estimated tau^2 value): 0.0774
I^2 (total heterogeneity / total variability): 95.22%
H^2 (total variability / sampling variability): 20.93
Test for Heterogeneity:
Q(df = 14) = 203.3087, p-val < .0001
Model Results:
estimate se zval pval ci.lb ci.ub
0.2092 0.0208 10.0388 <.0001 0.1684 0.2501 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Now, if we inappropriately fit an Equal-Effects model, we have the illusion that the model is more precise… but actually this is only due to neglecting true heterogeneity, so the estimates (especially the precision) are invalid:
Equal-Effects Model (k = 15)
logLik deviance AIC BIC AICc
-56.4788 203.3087 114.9576 115.6657 115.2653
I^2 (total heterogeneity / total variability): 93.11%
H^2 (total variability / sampling variability): 14.52
Test for Heterogeneity:
Q(df = 14) = 203.3087, p-val < .0001
Model Results:
estimate se zval pval ci.lb ci.ub
0.1920 0.0044 43.4135 <.0001 0.1833 0.2006 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In conclusion: always favor random effects models in meta-analysis! As argued by Borenstein et al. (2009), if no heterogeneity is there, this doesn’t substantially affect the estimates, but if true heterogeneity is there, only random effects models are valid! So, in either case, prefer random effects models! Quite appropriately, random effects models are set by default in metafor!
A note on multilevel models
In the case of multivariate/multilevel models (e.g., multiple Effect sizes nested in Samples nested in Studies nested in Research groups etc.) you had better test whether there is heterogeneity due to each of these levels of the hierarchical structure of dependencies before dropping that level.
Let’s consider the first few rows of the following meta-analytic dataset, in which you have coded not only the effect sizes, but also the IDs of Samples nested within Studies nested within Research groups…
ResGroupID StudyID SampleID EffSizeID eff vi
1 1 1 1 1 0.23942781 0.004580593
2 1 1 1 2 0.58950537 0.009601082
3 1 1 1 3 0.58001234 0.006880145
4 1 1 2 4 0.34182781 0.003958693
5 1 1 2 5 0.51915177 0.002774320
6 1 1 2 6 0.31348237 0.002038081
7 1 2 3 7 0.23854172 0.009963689
8 1 2 3 8 0.37920708 0.004413491
9 1 2 3 9 0.24893764 0.006057893
10 1 2 4 10 0.35503237 0.007594462
11 1 2 4 11 0.32563916 0.008837250
12 1 2 4 12 0.45813567 0.006149532
13 1 3 5 13 -0.17057887 0.001099325
14 1 3 5 14 0.04321122 0.009156837
15 1 3 5 15 -0.11931147 0.007935883
16 1 3 6 16 0.46126941 0.004442542
17 1 3 6 17 0.25115346 0.001846413
18 1 3 6 18 0.28984681 0.001446882
19 2 4 7 19 0.91599232 0.008390461
20 2 4 7 20 1.17559046 0.008463919
Since effect sizes nested within the same “Sample” share the exact same sampling error (because they are collected on the exact same group of participants), a good idea may be to impute a correlation structure across effect sizes nested within the same Sample. The “impute_covariance_matrix” function of the “clubSandwich” package does the job. You need to assume a correlation value between effect sizes (due to sampling error), however. Values between 0.50 and 0.70 may be reasonable, but sensitivity analysis is always recommended.
library(clubSandwich)
V = impute_covariance_matrix(df$vi, cluster=df$SampleID, r=0.60)Now you are ready to fit the multilevel model and see its summary:
library(metafor)
fit = rma.mv(yi=eff, V=V, random=~ 1|ResGroupID/StudyID/SampleID/EffSizeID, data=df)summary(fit)
Multivariate Meta-Analysis Model (k = 180; method: REML)
logLik Deviance AIC BIC AICc
-9.5472 19.0944 29.0944 45.0313 29.4412
Variance Components:
estim sqrt nlvls fixed factor
sigma^2.1 0.0000 0.0000 10 no ResGroupID
sigma^2.2 0.0744 0.2728 30 no ResGroupID/StudyID
sigma^2.3 0.0022 0.0467 60 no ResGroupID/StudyID/SampleID
sigma^2.4 0.0386 0.1965 180 no ResGroupID/StudyID/SampleID/EffSizeID
Test for Heterogeneity:
Q(df = 179) = 5718.9770, p-val < .0001
Model Results:
estimate se zval pval ci.lb ci.ub
0.4348 0.0529 8.2196 <.0001 0.3312 0.5385 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What level(s) account for more heterogeneity in this particular case?
What interpretation would you offer in this case?
What may be your further steps of investigation?