The identical output of lm() and aov() and lm() and t.test() in R

2022/11/21

This one hardly counts as a blog post; it is more a reference for myself.

For some time, I’ve known on a conceptual level that a linear regression model (one “fit” using lm() in the statistical software and programming language R), ANOVA (aov()), and t-test (t.test()) can be conceptualized as being closely-related—though they’re often taught as separate techniques.

I’m going to add the code now, hoping to return to and expand on this later.

Let’s use a data set about penguins.

library(tidyverse)
library(palmerpenguins)

penguins %>% 
    glimpse()
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

lm() and aov()

Note how the `F-value, df, and p-value are identical.

summary(lm(bill_length_mm ~ island, data = penguins))
## 
## Call:
## lm(formula = bill_length_mm ~ island, data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0677  -3.8559   0.2958   3.8175  14.3425 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      45.2575     0.3897 116.127  < 2e-16 ***
## islandDream      -1.0897     0.5970  -1.825   0.0688 .  
## islandTorgersen  -6.3065     0.8057  -7.827 6.44e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.036 on 339 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.154,  Adjusted R-squared:  0.149 
## F-statistic: 30.86 on 2 and 339 DF,  p-value: 4.86e-13
summary(aov(bill_length_mm ~ island, data = penguins))
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## island        2   1566   782.8   30.86 4.86e-13 ***
## Residuals   339   8599    25.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness

lm() and t.test()

Note the t-value and its df and p-value are identical (for the sex variable).

summary(lm(bill_length_mm ~ sex, data = penguins))
## 
## Call:
## lm(formula = bill_length_mm ~ sex, data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2548  -4.7548   0.8452   4.3030  15.9030 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.0970     0.4003 105.152  < 2e-16 ***
## sexmale       3.7578     0.5636   6.667 1.09e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.143 on 331 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.1184, Adjusted R-squared:  0.1157 
## F-statistic: 44.45 on 1 and 331 DF,  p-value: 1.094e-10
t.test(bill_length_mm ~ sex, var.equal = TRUE,data = penguins)
## 
##  Two Sample t-test
## 
## data:  bill_length_mm by sex
## t = -6.667, df = 331, p-value = 1.094e-10
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -4.866557 -2.649027
## sample estimates:
## mean in group female   mean in group male 
##             42.09697             45.85476

I may come back to this later to flesh this out more…