All Interactions Are Wrong (?)

part 2: family vs link function

toffa et al. x psicostat

Problems with interactions?

  • GLM family ensures the data distribution is well approximated, and the error term is correctly specified (so you avoid inflating coefficients precision)

  • …but it’s the link function that ensures that the linear predictor \(X\) is appropriately mapped on the response scale \(Y\), and determines how differences in \(X\) translate to differences in \(Y\), which is critical for interactions!

  • Each GLM family typical has an optimal choice, aka the canonical link function

a “time” (rt) scenario

Show code
set.seed(10)
library(ggplot2)
library(effects)

N = 5e4
A = rbinom(N,1,.5)
B = rbinom(N,1,.5)

b0 = 1
b1 = 1
b2 = 0.5
X = b0 + b1*A + b2*B

rt = rgamma(N, shape = 5, scale = 20*exp(X))
  • A e B sono fattori a 2 livelli

  • shape = 5 determina una skewness = 2/sqrt(shape) = 0.89 in ciascuna condizione

  • rgamma stabilisce che la distribuzione degli rt sarà una Gamma

  • exp(X) determina che la link function è "log"

observed distributions

Show code
df = data.frame(rt,A,B)
condition = paste(paste0("A",A),paste0("B",B))
df$condition = factor(condition)
df$A = as.factor(df$A)
df$B = as.factor(df$B)
ggplot(df)+
  geom_density(aes(x=rt,group=condition,color=condition),linewidth=2)+
  theme(text=element_text(size=24))+
  scale_color_manual(values=c("darkorange","darkturquoise","coral3","deepskyblue4"))+
  coord_cartesian(xlim=c(min(df$rt),quantile(df$rt,.9975)))

the model, Gamma(link="log")

fitGLog = glm(rt ~ A * B, family=Gamma(link="log"), data=df)
summary(fitGLog)

Call:
glm(formula = rt ~ A * B, family = Gamma(link = "log"), data = df)

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  5.602104   0.003988 1404.634   <2e-16 ***
A1           1.000393   0.005648  177.126   <2e-16 ***
B1           0.500826   0.005661   88.464   <2e-16 ***
A1:B1       -0.004565   0.008020   -0.569    0.569    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.2009792)

    Null deviance: 25370  on 49999  degrees of freedom
Residual deviance: 10331  on 49996  degrees of freedom
AIC: 689236

Number of Fisher Scoring iterations: 4

predicted effects, Gamma(link="log")

Show code
eff = data.frame(allEffects(fitGLog)$"A:B")
ggplot(eff,aes(x=A,y=fit,group=B,color=B,shape=B,linetype=B))+
  geom_point(size=6)+
  geom_line(linewidth=1.5)+
  geom_errorbar(aes(ymin=lower,ymax=upper),width=0)+
  theme(text=element_text(size=24))+
  ylab("rt")

reproduced data, Gamma(link="log")

Show code
df$sim = simulate(fitGLog)$sim_1
ggplot(df)+
  geom_density(aes(x=rt,group=condition,color=condition),linewidth=1)+
  geom_density(aes(x=sim,group=condition,fill=condition),color=NA,alpha=.4)+
  scale_color_manual(values=c("darkorange","darkturquoise","coral3","deepskyblue4"))+
  scale_fill_manual(values=c("darkorange","darkturquoise","coral3","deepskyblue4"))+
  theme(text=element_text(size=24))+
  coord_cartesian(xlim=c(min(df$rt),quantile(df$rt,.9975)))

the model, Gamma(link="identity")

fitGId = glm(rt ~ A * B, family=Gamma(link="identity"), data=df)
summary(fitGId)

Call:
glm(formula = rt ~ A * B, family = Gamma(link = "identity"), 
    data = df)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  270.996      1.081  250.73   <2e-16 ***
A1           465.937      3.139  148.44   <2e-16 ***
B1           176.170      2.097   84.02   <2e-16 ***
A1:B1        297.360      6.077   48.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.2009792)

    Null deviance: 25370  on 49999  degrees of freedom
Residual deviance: 10331  on 49996  degrees of freedom
AIC: 689236

Number of Fisher Scoring iterations: 3

predicted effects, Gamma(link="identity")

Show code
eff = data.frame(allEffects(fitGId)$"A:B")
ggplot(eff,aes(x=A,y=fit,group=B,color=B,shape=B,linetype=B))+
  geom_point(size=6)+
  geom_line(linewidth=1.5)+
  geom_errorbar(aes(ymin=lower,ymax=upper),width=0)+
  theme(text=element_text(size=24))+
  ylab("rt")

reproduced data, Gamma(link="identity")

Show code
df$sim = simulate(fitGId)$sim_1
ggplot(df)+
  geom_density(aes(x=rt,group=condition,color=condition),linewidth=1)+
  geom_density(aes(x=sim,group=condition,fill=condition),color=NA,alpha=.4)+
  scale_color_manual(values=c("darkorange","darkturquoise","coral3","deepskyblue4"))+
  scale_fill_manual(values=c("darkorange","darkturquoise","coral3","deepskyblue4"))+
  theme(text=element_text(size=24))+
  coord_cartesian(xlim=c(min(df$rt),quantile(df$rt,.9975)))

BIC doesn’t tell you

Model with Gamma(link="log")

BIC(fitGLog) # ok, but could be further improved removing interaction
[1] 689279.7

Model with Gamma(link="identity")

BIC(fitGId) # ok, but you must keep a non-existent interaction term
[1] 689279.7

comments, Gamma(link="identity")

  • … so, setting the wrong link function leads to discovering a false interaction (because different differences on the response are taken as reflecting different differences on the linear predictor \(X\), needing an interaction term to accomodate them)

  • however, using the correct distribution (Gamma) leads to perfectly fitting the data anyways!

  • problem is: yes, you fit the data, but you need an interaction term that was NOT in the data-generating process

comments, Gamma(link="identity")

IMPORTANT!

As shown here, wrong link function can still be detected because it causes poor fit with the data and a larger BIC if:

  1. random effects are there (and they are fitted)

  2. some predictors are continuous instead of categorical

but even in these cases… how many observations do we need to confidently tell the link function? Worst scenario is if N that leads to discovering false interaction is lower than N needed to tell the correct link function. Not tested this yet.

the model, gaussian(link="log")

fitNLog = glm(rt ~ A * B, family=gaussian(link="log"), data=df)
summary(fitNLog)

Call:
glm(formula = rt ~ A * B, family = gaussian(link = "log"), data = df)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.602104   0.011043 507.314   <2e-16 ***
A1           1.000393   0.011769  84.999   <2e-16 ***
B1           0.500826   0.012938  38.709   <2e-16 ***
A1:B1       -0.004565   0.013792  -0.331    0.741    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 113148.9)

    Null deviance: 1.1948e+10  on 49999  degrees of freedom
Residual deviance: 5.6570e+09  on 49996  degrees of freedom
AIC: 723723

Number of Fisher Scoring iterations: 5

predicted effects, gaussian(link="log")

Show code
eff = data.frame(allEffects(fitNLog)$"A:B")
ggplot(eff,aes(x=A,y=fit,group=B,color=B,shape=B,linetype=B))+
  geom_point(size=6)+
  geom_line(linewidth=1.5)+
  geom_errorbar(aes(ymin=lower,ymax=upper),width=0)+
  theme(text=element_text(size=24))+
  ylab("rt")

reproduced data, gaussian(link="log")

Show code
df$sim = simulate(fitNLog)$sim_1
ggplot(df)+
  geom_density(aes(x=rt,group=condition,color=condition),linewidth=1)+
  geom_density(aes(x=sim,group=condition,fill=condition),color=NA,alpha=.4)+
  scale_color_manual(values=c("darkorange","darkturquoise","coral3","deepskyblue4"))+
  scale_fill_manual(values=c("darkorange","darkturquoise","coral3","deepskyblue4"))+
  theme(text=element_text(size=24))+
  coord_cartesian(xlim=c(min(df$rt),quantile(df$rt,.9975)))

comments, gaussian(link="log")

  • … so, setting the correct link function avoids discovering false interactions even if the distribution is wrong

  • however, using the wrong distribution (gaussian) leads to poorly fitting the observed data!

  • problems are 1) poor fit with the data; 2) residuals will be skewed and heteroscedastic

  • BIC cannot be compared across families

Paradox - Interaction or not?

Show code
set.seed(11)
N = 2e4
b0 = 0
b1 = 1.00
b2 = 0.70
b1b2 = -0.347
A = rbinom(N,1,.5)
B = rbinom(N,1,.5)
X = b0 + b1*A + b2*B + b1b2*A*B
rt = rgamma(N,shape=5,scale=100*exp(X))
df = data.frame(rt,A,B)
df$A = as.factor(df$A)
df$B = as.factor(df$B)

fitGLog = glm(rt ~ A * B, family=Gamma(link="log"), data=df)
eff = data.frame(allEffects(fitGLog)$"A:B")

ggplot(eff,aes(x=A,y=fit,group=B,color=B,shape=B,linetype=B))+
  geom_point(size=6)+
  geom_line(linewidth=1.5)+
  geom_errorbar(aes(ymin=lower,ymax=upper),width=0)+
  theme(text=element_text(size=24))+
  ylab("rt")

Gamma(link="identity")positive interaction

fitGId = glm(rt ~ A * B, family=Gamma(link="identity"), data=df)
summary(fitGId)

Call:
glm(formula = rt ~ A * B, family = Gamma(link = "identity"), 
    data = df)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  509.253      3.246 156.910  < 2e-16 ***
A1           856.896      9.228  92.856  < 2e-16 ***
B1           500.933      7.172  69.842  < 2e-16 ***
A1:B1         45.467     16.496   2.756  0.00585 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.2007665)

    Null deviance: 8448.1  on 19999  degrees of freedom
Residual deviance: 4152.8  on 19996  degrees of freedom
AIC: 301261

Number of Fisher Scoring iterations: 3

Gamma(link="log")negative interaction

fitGLog = glm(rt ~ A * B, family=Gamma(link="log"), data=df)
summary(fitGLog)

Call:
glm(formula = rt ~ A * B, family = Gamma(link = "log"), data = df)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.232945   0.006373  978.01   <2e-16 ***
A1           0.986806   0.008978  109.92   <2e-16 ***
B1           0.684945   0.008984   76.24   <2e-16 ***
A1:B1       -0.348504   0.012674  -27.50   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.2007665)

    Null deviance: 8448.1  on 19999  degrees of freedom
Residual deviance: 4152.8  on 19996  degrees of freedom
AIC: 301261

Number of Fisher Scoring iterations: 4