<!DOCTYPE html>
T-tests in R
T-test for one mean
The simplest form of the t-test is when we want to test the distribution of one sample against some predetermined value. For example, say we designed a study to determine if the average student at Penn State has a normal blood cholesterol value (<200 mg/dL). At the beginning of the semester we recruited 12 students who donated their blood and we measured their cholesterol levels. We can manually input the results into R as follows:
<- c(170, 230, 175, 145, 149, 165, 164, 150, 167, 151, 178, 183) PSU_student
The stripchart on the left shows the distribution of our data, from which we will employ a t-test to assess if the mean blood cholesterol from these 12 students at the beginning of the semester (blue dashed line) is significantly lower than the normal threshold of 200 mg/dL (red dotted line). We will decide to use a significance threshold of α = 0.05.
To do so, we will wrap our sample, PSU_student
, in the
t.test()
function. Because we are interested in testing if
the blood cholesterol levels from these students are significantly below
200, we set the mu
option to 200 (the default is 0).
Additionally, since we are specifically interested in testing whether
the mean student blood cholesterol levels (x̄) are below that threshold,
we set the alternative
option to "less"
. This
option sets our null (H0) and alternative (HA)
hypotheses to:
If we instead hypothesized that the mean blood cholesterol would be
either more or less than 200, we would set alternative
to
"two.sided"
(the default value). We can then run our code
as follows:
t.test(PSU_student, mu = 200, alternative = "less")
##
## One Sample t-test
##
## data: PSU_student
## t = -4.7133, df = 11, p-value = 0.0003182
## alternative hypothesis: true mean is less than 200
## 95 percent confidence interval:
## -Inf 180.7602
## sample estimates:
## mean of x
## 168.9167
From the printout of results we see that the p-value for our t-test
(p = 3.18 x 10-4) is well below our threshold of α = 0.05.
Therefore, we can reject the null hypothesis in favor of the alternative
hypothesis to conclude the mean blood cholesterol level for PSU students
at the beginning of the semester is less than 200 mg/dL. The
t.test()
function also returns the mean of our sample,
168.9 mg/dL.
Two-sample t-test
Now let us say that we want to compare the average PSU student’s blood cholesterol with the faculty at Penn State. We similarly input their data into R as follows:
<- c(189, 165, 191, 193, 177, 200, 179, 189, 202, 235, 178, 226) PSU_faculty
In the above stripchart it appears that the distribution and mean of
blood cholesterol levels for students (blue) are mostly below those of
faculty (red), although there is some overlap. To statistically
determine whether these two groups vary we can employ the two-sample
t-test, which can be performed in R using the same function as before
but this time we wrap both data sets instead of defining
mu
. Additionally, because we are hypothesizing that the two
groups vary but not in which direction they vary we leave the
alternative
option to its default, two.sided
,
to test the null and alternative hypotheses:
Another thing we may want to consider before performing a t-test is
whether the samples have equal variance. While we can change whether R
runs an equal or unequal variance t-test (known as Welch’s t-test) using
the var.equal
option (set to FALSE
as
default), if the variances are equal then we will receive the same
result regardless of which option we set, and if they are not equal then
we would violate the assumption of the equal variance t-test. Therefore,
we can choose to leave it alone and run our code as follows:
t.test(PSU_student, PSU_faculty)
##
## Welch Two Sample t-test
##
## data: PSU_student and PSU_faculty
## t = -2.8134, df = 21.667, p-value = 0.01022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -43.010549 -6.489451
## sample estimates:
## mean of x mean of y
## 168.9167 193.6667
From the output we can see that there is a statistically significant
difference in the blood cholesterol levels between Penn State students
and faculty (p = 0.01022). Additionally, the output provides a
confidence interval from which we can conclude with 95% certainty that
the mean difference between the two groups is somewhere between -43 and
-6.5. Higher (or lower) ranges for the confidence interval can be set
through the conf.level
option.
Paired t-test
Now let us say that we collected samples from the same group of students during finals week to see if their blood cholesterol levels might have changed due to the stress of the semester. We can compare the means from the beginning of the semester and those at finals week, but now we will be performing a paired t-test because our samples are coming from the same individuals. Note that it is important that we put the data in the same order!
<- c(179, 224, 175, 140, 148, 181, 179, 142, 180, 169, 170, 189) PSU_student2
In the stripchart above there is overlap between our samples from the beginning of the semester (blue) and finals week (brown), but the mean appears to have slightly increased so maybe the stress of the semester has increased the student’s blood cholesterol overall.
We can again use the same t.test()
function setting the
paired
option to TRUE
. Again, we are only
hypothesizing that the blood cholesterol levels are different but not in
any specific direction, so we keep the alternative
option
to its default, "two.sided"
. We then perform a paired
t-test as follows:
t.test(PSU_student, PSU_student2, paired = TRUE)
##
## Paired t-test
##
## data: PSU_student and PSU_student2
## t = -1.4269, df = 11, p-value = 0.1814
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -10.381676 2.215009
## sample estimates:
## mean difference
## -4.083333
The test results suggest that there is not a statistically significant difference in student blood cholesterol levels from the beginning of the semester to finals week. The estimated p-value is above our threshold of α = 0.05 and the upper and lower limits of the 95% confidence interval around the mean of the differences (-4.08) includes 0. Therefore, we accept the null hypothesis that there is not a difference in student blood cholesterol between the beginning of the semester and final’s week.
Full code block
# Input student blood cholesterol data from the beginning of the semester
<- c(170, 230, 175, 145, 149, 165, 164, 150, 167, 151, 178, 183)
PSU_student
# Perform a one-mean t-test to determine if the average blood pressure is less than 200
t.test(PSU_student, mu = 200, alternative = "less")
# Add the blood cholesterol data from faculty
<- c(189, 165, 191, 193, 177, 200, 179, 189, 202, 235, 178, 226)
PSU_faculty
# Compare student and faculty cholesterol levels with a two-sample t-test
t.test(PSU_student, PSU_faculty)
# Add cholesterol levels from the same students during finals week
<- c(179, 224, 175, 140, 148, 181, 179, 142, 180, 169, 170, 189)
PSU_student2
# Use a paired t-test to compare student cholesterol levels at the beginning and end of
# the semester
t.test(PSU_student, PSU_student2, paired = TRUE)
T-tests using tidyverse
and rstatix
library(rstatix)
library(tidyverse)
T-test for one mean
<- tibble(stu_bp = c(170, 230, 175, 145, 149, 165, 164, 150, 167, 151, 178, 183))
bps
%>% t_test(stu_bp ~ 1, mu = 200, alternative = "less") bps
## # A tibble: 1 × 7
## .y. group1 group2 n statistic df p
## * <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 stu_bp 1 null model 12 -4.71 11 0.000318
Two-sample t-test
<- bps %>%
bps mutate(fac_bp = c(189, 165, 191, 193, 177, 200, 179, 189, 202, 235, 178, 226))
%>% pivot_longer(cols = everything()) %>%
bps t_test(value ~ name, detailed = TRUE)
## # A tibble: 1 × 15
## estimate estimate1 estimate2 .y. group1 group2 n1 n2 statistic p df conf.low
## * <dbl> <dbl> <dbl> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 24.8 194. 169. value fac_bp stu_bp 12 12 2.81 0.0102 21.7 6.49
## # … with 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>
Paired t-test
<- bps %>%
bps mutate(stu_bp2 = c(179, 224, 175, 140, 148, 181, 179, 142, 180, 169, 170, 189))
%>% pivot_longer(cols = c(stu_bp, stu_bp2)) %>%
bps t_test(value ~ name, paired = TRUE)
## # A tibble: 1 × 8
## .y. group1 group2 n1 n2 statistic df p
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 value stu_bp stu_bp2 12 12 -1.43 11 0.181
Full code block
# T-test for one mean
## Input student blood cholesterol data from the beginning of the semester as a tibble
<- tibble(stu_bp = c(170, 230, 175, 145, 149, 165, 164, 150, 167, 151, 178, 183))
bps
## Perform a one-mean t-test to determine if the average blood pressure is less than 200
%>% t_test(stu_bp ~ 1, mu = 200, alternative = "less") bps
## # A tibble: 1 × 7
## .y. group1 group2 n statistic df p
## * <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 stu_bp 1 null model 12 -4.71 11 0.000318
# Two-sample t-test
## Add the blood cholesterol data from faculty
<- bps %>%
bps mutate(fac_bp = c(189, 165, 191, 193, 177, 200, 179, 189, 202, 235, 178, 226))
## Compare student and faculty cholesterol levels with a two-sample t-test
%>% pivot_longer(cols = everything()) %>%
bps t_test(value ~ name, detailed = TRUE)
## # A tibble: 1 × 15
## estimate estimate1 estimate2 .y. group1 group2 n1 n2 statistic p df conf.low
## * <dbl> <dbl> <dbl> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 24.8 194. 169. value fac_bp stu_bp 12 12 2.81 0.0102 21.7 6.49
## # … with 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>
# Paired t-test
## Add cholesterol levels from the same students during finals week
<- bps %>%
bps mutate(stu_bp2 = c(179, 224, 175, 140, 148, 181, 179, 142, 180, 169, 170, 189))
## Use a paired t-test to compare student cholesterol levels at the beginning and end of
## the semester
%>% pivot_longer(cols = c(stu_bp, stu_bp2)) %>%
bps t_test(value ~ name, paired = TRUE)
## # A tibble: 1 × 8
## .y. group1 group2 n1 n2 statistic df p
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 value stu_bp stu_bp2 12 12 -1.43 11 0.181