This one hardly counts as a blog post; it is more a reference for myself.
For some time, I’ve known on a conceptual level that a linear regression model (one “fit” using lm()
in the statistical software and programming language R), ANOVA (aov()
), and t-test (t.test()
) can be conceptualized as being closely-related—though they’re often taught as separate techniques.
I’m going to add the code now, hoping to return to and expand on this later.
Let’s use a data set about penguins.
library(tidyverse)
library(palmerpenguins)
penguins %>%
glimpse()
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
lm()
and aov()
Note how the `F-value, df, and p-value are identical.
summary(lm(bill_length_mm ~ island, data = penguins))
##
## Call:
## lm(formula = bill_length_mm ~ island, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0677 -3.8559 0.2958 3.8175 14.3425
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.2575 0.3897 116.127 < 2e-16 ***
## islandDream -1.0897 0.5970 -1.825 0.0688 .
## islandTorgersen -6.3065 0.8057 -7.827 6.44e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.036 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.154, Adjusted R-squared: 0.149
## F-statistic: 30.86 on 2 and 339 DF, p-value: 4.86e-13
summary(aov(bill_length_mm ~ island, data = penguins))
## Df Sum Sq Mean Sq F value Pr(>F)
## island 2 1566 782.8 30.86 4.86e-13 ***
## Residuals 339 8599 25.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness
lm()
and t.test()
Note the t-value and its df and p-value are identical (for the sex variable).
summary(lm(bill_length_mm ~ sex, data = penguins))
##
## Call:
## lm(formula = bill_length_mm ~ sex, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2548 -4.7548 0.8452 4.3030 15.9030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.0970 0.4003 105.152 < 2e-16 ***
## sexmale 3.7578 0.5636 6.667 1.09e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.143 on 331 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.1184, Adjusted R-squared: 0.1157
## F-statistic: 44.45 on 1 and 331 DF, p-value: 1.094e-10
t.test(bill_length_mm ~ sex, var.equal = TRUE,data = penguins)
##
## Two Sample t-test
##
## data: bill_length_mm by sex
## t = -6.667, df = 331, p-value = 1.094e-10
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
## -4.866557 -2.649027
## sample estimates:
## mean in group female mean in group male
## 42.09697 45.85476
I may come back to this later to flesh this out more…