Understanding heterogeneity in meta-analysis

Author

Enrico Toffalini

Effect sizes may differ across studies

Let’s consider the following case. The three effect sizes look different.

Why may they differ?

Let’s say they are SMDs between cases and controls, or measures of treatment efficacy.

Equal-Effects (or “Fixed effects”) models

If we disregard possible heterogeneity across effects and assume that all studies reflect the same true underlying effect, with variations only being due to sampling error, we get the following estimates.

Would you trust them?

fit = rma(eff, vi, method="EE")

forest(fit, xlab="Cohen's d", header=T)

The point estimate (center of the diamond) is outside the confidence interval of any of the 3 effect sizes involved in the analysis.
The point estimate seems “overconfident” (the width of the diamond is very narrow): what would happen if +1 study were added?
Although the Study 1 effect has the highest precision, the other two are more consistent with each other, so Study 1 seems to have too much leverage on the meta-analytic result.
As shown below from the model summary, the test for heterogeneity is statistically significant, and the I^2 suggests that most variability is actually due to heterogeneity (rather than to sampling error).

summary(fit)


Equal-Effects Model (k = 3)

  logLik  deviance       AIC       BIC      AICc   
-75.1142  164.5625  152.2283  151.3269  156.2283   

I^2 (total heterogeneity / total variability):   98.78%
H^2 (total variability / sampling variability):  82.28

Test for Heterogeneity:
Q(df = 2) = 164.5625, p-val < .0001

Model Results:

estimate      se     zval    pval   ci.lb   ci.ub      
  0.5725  0.0173  33.0533  <.0001  0.5386  0.6064  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Random-Effects models

Now, here is a more plausible estimate, simply obtained adopting a “random-effects models” (which is default in metafor):

Equal- and Random-Effects models are almost identical if there is no true heterogeneity

In this case, there is no true heterogeneity in effect size. If we fit the model with Equal or Random effects, it is virtually the same.

In the first case, we fit the model with random effects (as per default):

fit1 = rma(yi=z, vi=vi, data=df1)
par(mar=c(5,0,0,0))
forest(fit1, xlim=c(-0.3,0.85), xlab="Pearson's r", transf=fisherz2r, at=seq(-0.2,.6,.1), header=T)

summary(fit1)


Random-Effects Model (k = 15; tau^2 estimator: REML)

  logLik  deviance       AIC       BIC      AICc   
 16.2754  -32.5508  -28.5508  -27.2727  -27.4599   

tau^2 (estimated amount of total heterogeneity): 0.0012 (SE = 0.0016)
tau (square root of estimated tau^2 value):      0.0350
I^2 (total heterogeneity / total variability):   28.79%
H^2 (total variability / sampling variability):  1.40

Test for Heterogeneity:
Q(df = 14) = 20.0452, p-val = 0.1287

Model Results:

estimate      se     zval    pval   ci.lb   ci.ub      
  0.1940  0.0174  11.1451  <.0001  0.1598  0.2281  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the second case, we fit it with equal effects… results are virtually the same (not exactly the same because, by accident, some very small heterogeneity was estimated, although it is immaterial):

fit1 = rma(yi=z, vi=vi, data=df1, method = "EE")
par(mar=c(5,0,0,0))
forest(fit1, xlim=c(-0.3,0.85), xlab="Pearson's r", transf=fisherz2r, at=seq(-0.2,.6,.1), header=T)

summary(fit1)


Equal-Effects Model (k = 15)

  logLik  deviance       AIC       BIC      AICc   
 17.7685   20.0452  -33.5371  -32.8290  -33.2294   

I^2 (total heterogeneity / total variability):   30.16%
H^2 (total variability / sampling variability):  1.43

Test for Heterogeneity:
Q(df = 14) = 20.0452, p-val = 0.1287

Model Results:

estimate      se     zval    pval   ci.lb   ci.ub      
  0.1918  0.0140  13.6626  <.0001  0.1643  0.2193  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Equal- and Random-Effects models differe if there is true heterogeneity… and Random-Effects models should be preferred in this case!

In the following case, the effect sizes of the studies are the same as before… but now each study is much more precise (has a larger sample). This means that observed variability across effects cannot be attributed to sampling error alone with a same underlying effect being valid for all studies. Here, heterogeneity is a real thing! Each study seems to reflect a slightly different underlying “true” effect. Here, if we fit a random effect model, significant heterogeneity emerge, the “tau” estimate is larger and non-negligible. Also, note that “paradoxically” the uncertainty of the meta-analytic estimate (width of the diamond) slightly increases! This is because, in presence of true heterogeneity, part of the uncertainty about the overall effect (diamond) is due to heterogeneity itself.


Random-Effects Model (k = 15; tau^2 estimator: REML)

  logLik  deviance       AIC       BIC      AICc   
 15.0624  -30.1247  -26.1247  -24.8466  -25.0338   

tau^2 (estimated amount of total heterogeneity): 0.0060 (SE = 0.0025)
tau (square root of estimated tau^2 value):      0.0774
I^2 (total heterogeneity / total variability):   95.22%
H^2 (total variability / sampling variability):  20.93

Test for Heterogeneity:
Q(df = 14) = 203.3087, p-val < .0001

Model Results:

estimate      se     zval    pval   ci.lb   ci.ub      
  0.2092  0.0208  10.0388  <.0001  0.1684  0.2501  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now, if we inappropriately fit an Equal-Effects model, we have the illusion that the model is more precise… but actually this is only due to neglecting true heterogeneity, so the estimates (especially the precision) are invalid:


Equal-Effects Model (k = 15)

  logLik  deviance       AIC       BIC      AICc   
-56.4788  203.3087  114.9576  115.6657  115.2653   

I^2 (total heterogeneity / total variability):   93.11%
H^2 (total variability / sampling variability):  14.52

Test for Heterogeneity:
Q(df = 14) = 203.3087, p-val < .0001

Model Results:

estimate      se     zval    pval   ci.lb   ci.ub      
  0.1920  0.0044  43.4135  <.0001  0.1833  0.2006  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In conclusion: always favor random effects models in meta-analysis! As argued by Borenstein et al. (2009), if no heterogeneity is there, this doesn’t substantially affect the estimates, but if true heterogeneity is there, only random effects models are valid! So, in either case, prefer random effects models! Quite appropriately, random effects models are set by default in metafor!

A note on multilevel models

In the case of multivariate/multilevel models (e.g., multiple Effect sizes nested in Samples nested in Studies nested in Research groups etc.) you had better test whether there is heterogeneity due to each of these levels of the hierarchical structure of dependencies before dropping that level.

Let’s consider the first few rows of the following meta-analytic dataset, in which you have coded not only the effect sizes, but also the IDs of Samples nested within Studies nested within Research groups…

   ResGroupID StudyID SampleID EffSizeID         eff          vi
1           1       1        1         1  0.23942781 0.004580593
2           1       1        1         2  0.58950537 0.009601082
3           1       1        1         3  0.58001234 0.006880145
4           1       1        2         4  0.34182781 0.003958693
5           1       1        2         5  0.51915177 0.002774320
6           1       1        2         6  0.31348237 0.002038081
7           1       2        3         7  0.23854172 0.009963689
8           1       2        3         8  0.37920708 0.004413491
9           1       2        3         9  0.24893764 0.006057893
10          1       2        4        10  0.35503237 0.007594462
11          1       2        4        11  0.32563916 0.008837250
12          1       2        4        12  0.45813567 0.006149532
13          1       3        5        13 -0.17057887 0.001099325
14          1       3        5        14  0.04321122 0.009156837
15          1       3        5        15 -0.11931147 0.007935883
16          1       3        6        16  0.46126941 0.004442542
17          1       3        6        17  0.25115346 0.001846413
18          1       3        6        18  0.28984681 0.001446882
19          2       4        7        19  0.91599232 0.008390461
20          2       4        7        20  1.17559046 0.008463919

Since effect sizes nested within the same “Sample” share the exact same sampling error (because they are collected on the exact same group of participants), a good idea may be to impute a correlation structure across effect sizes nested within the same Sample. The “impute_covariance_matrix” function of the “clubSandwich” package does the job. You need to assume a correlation value between effect sizes (due to sampling error), however. Values between 0.50 and 0.70 may be reasonable, but sensitivity analysis is always recommended.

library(clubSandwich)
V = impute_covariance_matrix(df$vi, cluster=df$SampleID, r=0.60)

Now you are ready to fit the multilevel model and see its summary:

library(metafor)
fit = rma.mv(yi=eff, V=V, random=~ 1|ResGroupID/StudyID/SampleID/EffSizeID, data=df)

summary(fit)


Multivariate Meta-Analysis Model (k = 180; method: REML)

  logLik  Deviance       AIC       BIC      AICc   
 -9.5472   19.0944   29.0944   45.0313   29.4412   

Variance Components:

            estim    sqrt  nlvls  fixed                                 factor 
sigma^2.1  0.0000  0.0000     10     no                             ResGroupID 
sigma^2.2  0.0744  0.2728     30     no                     ResGroupID/StudyID 
sigma^2.3  0.0022  0.0467     60     no            ResGroupID/StudyID/SampleID 
sigma^2.4  0.0386  0.1965    180     no  ResGroupID/StudyID/SampleID/EffSizeID 

Test for Heterogeneity:
Q(df = 179) = 5718.9770, p-val < .0001

Model Results:

estimate      se    zval    pval   ci.lb   ci.ub      
  0.4348  0.0529  8.2196  <.0001  0.3312  0.5385  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What level(s) account for more heterogeneity in this particular case?
What interpretation would you offer in this case?
What may be your further steps of investigation?