Show code
set.seed(10)
library(ggplot2)
library(effects)
= 5e4
N = rbinom(N,1,.5)
A = rbinom(N,1,.5)
B
= 1
b0 = 1
b1 = 0.5 b2
part 2: family vs link function
GLM family ensures the data distribution is well approximated, and the error term is correctly specified (so you avoid inflating coefficients precision)
…but it’s the link function that ensures that the linear predictor \(X\) is appropriately mapped on the response scale \(Y\), and determines how differences in \(X\) translate to differences in \(Y\), which is critical for interactions!
Each GLM family typical has an optimal choice, aka the canonical link function
A
e B
sono fattori a 2 livelli
shape = 5
determina una skewness = 2/sqrt(shape)
= 0.89
in ciascuna condizione
rgamma
stabilisce che la distribuzione degli rt
sarà una Gamma
exp(X)
determina che la link function è "log"
df = data.frame(rt,A,B)
condition = paste(paste0("A",A),paste0("B",B))
df$condition = factor(condition)
df$A = as.factor(df$A)
df$B = as.factor(df$B)
ggplot(df)+
geom_density(aes(x=rt,group=condition,color=condition),linewidth=2)+
theme(text=element_text(size=24))+
scale_color_manual(values=c("darkorange","darkturquoise","coral3","deepskyblue4"))+
coord_cartesian(xlim=c(min(df$rt),quantile(df$rt,.9975)))
Gamma(link="log")
Call:
glm(formula = rt ~ A * B, family = Gamma(link = "log"), data = df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.602104 0.003988 1404.634 <2e-16 ***
A1 1.000393 0.005648 177.126 <2e-16 ***
B1 0.500826 0.005661 88.464 <2e-16 ***
A1:B1 -0.004565 0.008020 -0.569 0.569
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 0.2009792)
Null deviance: 25370 on 49999 degrees of freedom
Residual deviance: 10331 on 49996 degrees of freedom
AIC: 689236
Number of Fisher Scoring iterations: 4
Gamma(link="log")
Gamma(link="log")
df$sim = simulate(fitGLog)$sim_1
ggplot(df)+
geom_density(aes(x=rt,group=condition,color=condition),linewidth=1)+
geom_density(aes(x=sim,group=condition,fill=condition),color=NA,alpha=.4)+
scale_color_manual(values=c("darkorange","darkturquoise","coral3","deepskyblue4"))+
scale_fill_manual(values=c("darkorange","darkturquoise","coral3","deepskyblue4"))+
theme(text=element_text(size=24))+
coord_cartesian(xlim=c(min(df$rt),quantile(df$rt,.9975)))
Gamma(link="identity")
Call:
glm(formula = rt ~ A * B, family = Gamma(link = "identity"),
data = df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 270.996 1.081 250.73 <2e-16 ***
A1 465.937 3.139 148.44 <2e-16 ***
B1 176.170 2.097 84.02 <2e-16 ***
A1:B1 297.360 6.077 48.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 0.2009792)
Null deviance: 25370 on 49999 degrees of freedom
Residual deviance: 10331 on 49996 degrees of freedom
AIC: 689236
Number of Fisher Scoring iterations: 3
Gamma(link="identity")
Gamma(link="identity")
df$sim = simulate(fitGId)$sim_1
ggplot(df)+
geom_density(aes(x=rt,group=condition,color=condition),linewidth=1)+
geom_density(aes(x=sim,group=condition,fill=condition),color=NA,alpha=.4)+
scale_color_manual(values=c("darkorange","darkturquoise","coral3","deepskyblue4"))+
scale_fill_manual(values=c("darkorange","darkturquoise","coral3","deepskyblue4"))+
theme(text=element_text(size=24))+
coord_cartesian(xlim=c(min(df$rt),quantile(df$rt,.9975)))
BIC
doesn’t tell youModel with Gamma(link="log")
Gamma(link="identity")
IMPORTANT!
As shown here, wrong link function can still be detected because it causes poor fit with the data and a larger BIC if:
random effects are there (and they are fitted)
some predictors are continuous instead of categorical
but even in these cases… how many observations do we need to confidently tell the link function? Worst scenario is if N that leads to discovering false interaction is lower than N needed to tell the correct link function. Not tested this yet.
gaussian(link="log")
Call:
glm(formula = rt ~ A * B, family = gaussian(link = "log"), data = df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.602104 0.011043 507.314 <2e-16 ***
A1 1.000393 0.011769 84.999 <2e-16 ***
B1 0.500826 0.012938 38.709 <2e-16 ***
A1:B1 -0.004565 0.013792 -0.331 0.741
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 113148.9)
Null deviance: 1.1948e+10 on 49999 degrees of freedom
Residual deviance: 5.6570e+09 on 49996 degrees of freedom
AIC: 723723
Number of Fisher Scoring iterations: 5
gaussian(link="log")
gaussian(link="log")
df$sim = simulate(fitNLog)$sim_1
ggplot(df)+
geom_density(aes(x=rt,group=condition,color=condition),linewidth=1)+
geom_density(aes(x=sim,group=condition,fill=condition),color=NA,alpha=.4)+
scale_color_manual(values=c("darkorange","darkturquoise","coral3","deepskyblue4"))+
scale_fill_manual(values=c("darkorange","darkturquoise","coral3","deepskyblue4"))+
theme(text=element_text(size=24))+
coord_cartesian(xlim=c(min(df$rt),quantile(df$rt,.9975)))
gaussian(link="log")
… so, setting the correct link function avoids discovering false interactions even if the distribution is wrong
however, using the wrong distribution (gaussian
) leads to poorly fitting the observed data!
problems are 1) poor fit with the data; 2) residuals will be skewed and heteroscedastic
BIC
cannot be compared across families
set.seed(11)
N = 2e4
b0 = 0
b1 = 1.00
b2 = 0.70
b1b2 = -0.347
A = rbinom(N,1,.5)
B = rbinom(N,1,.5)
X = b0 + b1*A + b2*B + b1b2*A*B
rt = rgamma(N,shape=5,scale=100*exp(X))
df = data.frame(rt,A,B)
df$A = as.factor(df$A)
df$B = as.factor(df$B)
fitGLog = glm(rt ~ A * B, family=Gamma(link="log"), data=df)
eff = data.frame(allEffects(fitGLog)$"A:B")
ggplot(eff,aes(x=A,y=fit,group=B,color=B,shape=B,linetype=B))+
geom_point(size=6)+
geom_line(linewidth=1.5)+
geom_errorbar(aes(ymin=lower,ymax=upper),width=0)+
theme(text=element_text(size=24))+
ylab("rt")
Gamma(link="identity")
→ positive interaction
Call:
glm(formula = rt ~ A * B, family = Gamma(link = "identity"),
data = df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 509.253 3.246 156.910 < 2e-16 ***
A1 856.896 9.228 92.856 < 2e-16 ***
B1 500.933 7.172 69.842 < 2e-16 ***
A1:B1 45.467 16.496 2.756 0.00585 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 0.2007665)
Null deviance: 8448.1 on 19999 degrees of freedom
Residual deviance: 4152.8 on 19996 degrees of freedom
AIC: 301261
Number of Fisher Scoring iterations: 3
Gamma(link="log")
→ negative interaction
Call:
glm(formula = rt ~ A * B, family = Gamma(link = "log"), data = df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.232945 0.006373 978.01 <2e-16 ***
A1 0.986806 0.008978 109.92 <2e-16 ***
B1 0.684945 0.008984 76.24 <2e-16 ***
A1:B1 -0.348504 0.012674 -27.50 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 0.2007665)
Null deviance: 8448.1 on 19999 degrees of freedom
Residual deviance: 4152.8 on 19996 degrees of freedom
AIC: 301261
Number of Fisher Scoring iterations: 4
comments,
Gamma(link="identity")
… so, setting the wrong link function leads to discovering a false interaction (because different differences on the response are taken as reflecting different differences on the linear predictor \(X\), needing an interaction term to accomodate them)
however, using the correct distribution (
Gamma
) leads to perfectly fitting the data anyways!problem is: yes, you fit the data, but you need an interaction term that was NOT in the data-generating process