(Non)Normal distributions

toffa et al. x psicostat

a typical (mixed-effects) linear model

fit = lmer(y ~ group + cond + (1|id), data=df)

Generally, random effects and residuals are taken as normally-distributed:

random intercepts ~ \(N(0,\tau)\)

residuals ~ \(N(0,\sigma)\)

But why normally-distributed?

tendency towards “normality”?

underlying normality of responses (residuals) and/or individual differences (random intercepts) arise from the sum of a great number of independent factors/causes (e.g., environmental, genetic, contextual), which makes a lot of sense in psychology and beyond!

tendency towards “normality”?

this is a generalization of the Central Limit Theorem (CLT) to the sum of (many) independent and identically distributed varaibles (Lindeberg–Lévy CLT) or even non-identically distributed variables (Lyapunov CLT)

but then, why are so many variables skewed?

in many (most) cases, observed scores are reflective of the underlying dimension of interest; but we do not directly observe the dimension of interest, we only observe scores that generally:

are aggregate of non-normally distributed responses that may only approximate normality (e.g., sum scores from binomial/ordinal responses of a limited number of items)
have a lower bound (e.g., zero for times, errors), or
have both a lower and an upper bound (e.g., accuracies, sum scores)

TIME

while the underlying ability of interest might be normally distributed, observed times cannot be, because they present a lower bound on zero

IMPORTANT: equal intervals on the right panel do NOT reflect equal interval on the left panel

TIME: mean vs variance

true underlying (maybe normally-distributed) scores: experimental condition (or group) purple is more difficult (less able) than experimental condition (group) orange there is only a shift in mean value;
observed scores: as mean increases, variance also increases
the LINK FUNCTION links the observed scores to the true underlying scores; in the case above it is logarithm (link="log"; typical for times)

ERRORS

the case of error is very similar: a lower bound on zero again exists, with the difference that the observations are discrete (not continuous)

IMPORTANT: equal intervals on the right panel do NOT reflect equal interval on the left panel

ACCURACIES, BOUNDED SUM SCORES

`link = "probit"`

this is very typical of distributions arising from binomial processes (e.g., accuracies) but also ordinal processes (e.g., sum scores of scales, questionnaires)

IMPORTANT: once again, note that equal intervals on the right panel do NOT reflect equal interval on the left panel

`binomial(link = "probit")`

In binomial and ordinal processes, even if underlying individual differences (random intercepts) are normally-distributed, observed sum scores may not be “normal”.

Accuracies or sum scores computed on 5, 10, 20 items must consider that the underlying data-generating process is binomial.

Only with very many items/trials the error term (residuals) is normally distributed (family = gaussian(link = "probit")). This is the main reason why we (should) use family = binomial instead of family = gaussian when dealing with accuracies…

Differences between differences

In all previous cases, we noted that equal differences on the observed scores do NOT reflect equal differences on the underlying ability/trait.

➜ This may have devastating consequences when testing interactions, because interactions can be seen as tests of whether there are differences between differences (i.e., whether a difference is equal to another difference)

Note that the link function transforms equal intervals into unequal intervals