Logo

Meta-analysis of Cohen’s d - From structuring the dataset to (not actually so) basic analyses

Author

Enrico Toffalini

Dataset for a meta-analysis of a Cohen’s d / Standardized Mean Difference

Here is an example of a small dataset for meta-analyzing a Standardized Mean Difference (Cohen’s d). Data from 20 studies is reported. This may represent on the efficacy of a treatment (for simplicity here considering only the post-test scores) or on any other case involving comparing mean scores. “yearPub” and “weeks” might be moderators of interest.

 StudyID yearPub N_TG   M_TG SD_TG N_CG  M_CG SD_CG weeks
       1    2011   88 104.41  8.65   88 97.08  7.71    13
       2    2022   15  87.54  9.92   15 76.06  9.08    21
       3    2008   87   6.14  0.53   85  5.59  0.52    18
       4    2012   49  72.96  6.44   48 63.06  7.44    21
       5    2009   43  43.49  3.88   43 40.74  3.85    16
       6    2016  100  10.45  0.97  100  9.64  1.09    16
       7    2008  113  92.64  7.95  113 87.87  7.45    21
       8    2013   42  55.36  5.06   43 54.63  5.91     5
       9    2019   34   3.27  0.28   33  2.98  0.24    15
      10    2004   75  37.50  2.93   73 33.93  3.19    20
      11    2021  101  36.83  3.08  101 33.64  2.96    19
      12    2009   21   3.44  0.28   22  3.38  0.36    15
      13    2020  110  95.21  9.86  109 84.69  9.93    22
      14    2010  113  43.38  4.22  113 40.99  4.38    14
      15    2009  104  25.81  2.45  104 24.10  2.97    10
      16    2012  114   0.92  0.08  115  0.90  0.08     6
      17    2022  104  37.72  3.32  103 34.64  3.27     7
      18    2021   65  89.38  7.13   62 83.99  8.14    11
      19    2011  113  12.10  1.01  114 11.29  1.18    15
      20    2019   65   7.79  0.63   63  7.37  0.66    18

1. Compute the effect sizes

Typically, adequate estimates of Standardized Mean Differences are not already reported in the articles, so you have to compute them yourself. Luckily, they can be computed from descriptive statistics that are generally reported (N, M, SD).

Since it is very clear that different studies used completely different instruments (note the variability across descriptive statistics), effect size must be standardized. Also, since some studies present very small samples, Hedges’ correction of Cohen’s d (Hedges’ g is preferred). Luckily, the “measure="SMD"” argument in the “escalc” function of the “metafor” package already applies Hedges’ correction for small samples.


 StudyID yearPub N_TG   M_TG SD_TG N_CG  M_CG SD_CG weeks     yi     vi 
       1    2011   88 104.41  8.65   88 97.08  7.71    13 0.8907 0.0250 
       2    2022   15  87.54  9.92   15 76.06  9.08    21 1.1746 0.1563 
       3    2008   87   6.14  0.53   85  5.59  0.52    18 1.0428 0.0264 
       4    2012   49  72.96  6.44   48 63.06  7.44    21 1.4126 0.0515 
       5    2009   43  43.49  3.88   43 40.74  3.85    16 0.7051 0.0494 
       6    2016  100  10.45  0.97  100  9.64  1.09    16 0.7821 0.0215 
       7    2008  113  92.64  7.95  113 87.87  7.45    21 0.6171 0.0185 
       8    2013   42  55.36  5.06   43 54.63  5.91     5 0.1314 0.0472 
       9    2019   34   3.27  0.28   33  2.98  0.24    15 1.0979 0.0687 
      10    2004   75  37.50  2.93   73 33.93  3.19    20 1.1603 0.0316 
      11    2021  101  36.83  3.08  101 33.64  2.96    19 1.0521 0.0225 
      12    2009   21   3.44  0.28   22  3.38  0.36    15 0.1821 0.0935 
      13    2020  110  95.21  9.86  109 84.69  9.93    22 1.0595 0.0208 
      14    2010  113  43.38  4.22  113 40.99  4.38    14 0.5539 0.0184 
      15    2009  104  25.81  2.45  104 24.10  2.97    10 0.6258 0.0202 
      16    2012  114   0.92  0.08  115  0.90  0.08     6 0.2492 0.0176 
      17    2022  104  37.72  3.32  103 34.64  3.27     7 0.9313 0.0214 
      18    2021   65  89.38  7.13   62 83.99  8.14    11 0.7013 0.0334 
      19    2011  113  12.10  1.01  114 11.29  1.18    15 0.7348 0.0188 
      20    2019   65   7.79  0.63   63  7.37  0.66    18 0.6473 0.0329 

2. Intercept-only model

Traditionally, a first good step is to test the “intercept-only model”. This computes nothing but a meta-analytic “weighted average”, plus the estimated heterogeneity (unless you set “method="EE"” for a fixed-effects [Equal-Effects] model, which is generally suboptimal).

Just doing this is very simple:

fit0 = rma(yi, vi, data=df)

summary(fit0)

Random-Effects Model (k = 20; tau^2 estimator: REML)

  logLik  deviance       AIC       BIC      AICc   
 -5.8247   11.6493   15.6493   17.5382   16.3993   

tau^2 (estimated amount of total heterogeneity): 0.0655 (SE = 0.0315)
tau (square root of estimated tau^2 value):      0.2558
I^2 (total heterogeneity / total variability):   69.87%
H^2 (total variability / sampling variability):  3.32

Test for Heterogeneity:
Q(df = 19) = 60.9459, p-val < .0001

Model Results:

estimate      se     zval    pval   ci.lb   ci.ub      
  0.7799  0.0703  11.0920  <.0001  0.6421  0.9177  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Any comment about…

… the overall mean effect?

… the heterogeneity?

Lastly, let’s have a look at the forest plot:

par(mar=c(5,0,0,0))

forest(fit0, header=T)

3. Testing moderator(s)

Would you proceed to test moderators in this case? why?

Let’s fit a few alternative models with moderators. Practically, “moderators” are just (linear) predictors of the effect size. They are generally called “moderators” rather than “predictors” because the dependent variable is an effect size, not a score (if you “predict” a SMD you are practically saying that this predictor “moderates” the effect of Group -> Score).

These may be some alternative models including moderators:

fit1 = rma(yi, vi, mods=~weeks+yearPub, data=df)
fit2 = rma(yi, vi, mods=~yearPub, data=df)
fit3 = rma(yi, vi, mods=~weeks, data=df)

First, let’s see whether adding both moderators “in block” improves the model (compared to the original intercept-only model). We need to set “refit=TRUE” because only models fitted with Maximum Likelihood can be directly compared, while models fitted with Restricted Maximum Likelihood (which is the default) cannot be directly compared.

anova(fit0, fit1, refit=T)

        df     AIC     BIC    AICc  logLik     LRT   pval      QE  tau^2 
Full     4  6.4718 10.4547  9.1385  0.7641                33.4510 0.0185 
Reduced  2 15.1431 17.1345 15.8490 -5.5715 12.6713 0.0018 60.9459 0.0599 
             R^2 
Full             
Reduced 69.1291% 

Full” model means the one with more parameters (i.e., the one including the predictors/moderators). “Reduced” model measn the one with less parameters (i.e., the one including no predictors/moderators, or fewer of them).

After having established that entering moderators contributes to meaningfully explain heterogeneity (BTW, how much?), we can see whether each of them is “significant” when taken alone. So, we separately remove them from the “full” model. IMPORTANT: this procedure is just one among many possible alternatives! You may proceed to test moderators in the exact same way as you would proceed when testing predictors in any linear regression models!

anova(fit1, fit2, refit=T)

        df     AIC     BIC    AICc  logLik     LRT   pval      QE  tau^2 
Full     4  6.4718 10.4547  9.1385  0.7641                33.4510 0.0185 
Reduced  3 16.2224 19.2096 17.7224 -5.1112 11.7506 0.0006 57.4867 0.0561 
             R^2 
Full             
Reduced 66.9883% 

In the above comparison, we tested “fit1” (with both “weeks” and “yearPub”) against “fit2” (with only “yearPub”). Any comment?

anova(fit1, fit3, refit=T)

        df    AIC     BIC   AICc  logLik    LRT   pval      QE  tau^2      R^2 
Full     4 6.4718 10.4547 9.1385  0.7641               33.4510 0.0185          
Reduced  3 6.8112  9.7984 8.3112 -0.4056 2.3394 0.1261 38.3251 0.0252 26.5672% 

In the above comparison, we tested “fit1” (with both “weeks” and “yearPub”) against “fit3” (with only “weeks”). Any comment?

Now, let’s see the summary of the final “best-fitting” model (note that some parameters may differ from those shown via the “anova” comparisons, because those were fitted via “ML”, while the original model is fitted via “REML”):

summary(fit3)

Mixed-Effects Model (k = 20; tau^2 estimator: REML)

  logLik  deviance       AIC       BIC      AICc   
 -1.3988    2.7975    8.7975   11.4686   10.5118   

tau^2 (estimated amount of residual heterogeneity):     0.0318 (SE = 0.0206)
tau (square root of estimated tau^2 value):             0.1783
I^2 (residual heterogeneity / unaccounted variability): 52.62%
H^2 (unaccounted variability / sampling variability):   2.11
R^2 (amount of heterogeneity accounted for):            51.45%

Test for Residual Heterogeneity:
QE(df = 18) = 38.3251, p-val = 0.0035

Test of Moderators (coefficient 2):
QM(df = 1) = 11.6179, p-val = 0.0007

Model Results:

         estimate      se    zval    pval    ci.lb   ci.ub      
intrcpt    0.2166  0.1737  1.2465  0.2126  -0.1240  0.5571      
weeks      0.0376  0.0110  3.4085  0.0007   0.0160  0.0593  *** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Any comment about…

… the overall mean effect? (is it still there?!)

… the meaning of the “weeks” coefficient?

… the (residual) heterogeneity?


If your model is not too complex and especially if you moderators are quantitative, showing the regression plot might be a good idea:

regplot(fit3)


And what about the forest plot now?

par(mar=c(5,0,0,0))

forest(fit3, header=T)