One-way Analysis of Variance (ANOVA1)
For this tutorial we will be using a sample dataset from the PSU STAT 500 Applied Statistics course. The dataset contains measurements of tar content as measured by Lab Precise from three cigarette brands. A barplot presenting the means and standard deviations of the tar content from each brand is shown below.
From briefly observing the barplot, there does appear to be a difference in tar content among the three cigarette brands. However, we should use a statistical test, in this case a one-way ANOVA, to quantitatively determine if there are differences in tar content among the brands based on the following statistical hypotheses and assumptions. We will use a threshold of 0.05 (sometimes shown as α = 0.05) to determine statistical significance.
Since each of the tar measurements comes from a different brand of
cigarette, we could safely assume that there is no dependence between
the three brands and accept the assumption of independent observations.
We can check the assumption of normally distributed observations after
we fit the model, but for the assumption of equal variances we can apply
the Bartlett Test of Homogeneity of Variances using
bartlett.test()
. Before we can do so however, we will need
to read our data into R.
Entering the data in R
Typically we would load our data into R from a file, however for this example we will manually input the data into a data frame with the following code.
<- data.frame(Tar_mg = c(10.21, 10.25, 10.24, 9.80, 9.77, 9.73,
lab_precise 11.32, 11.20, 11.40, 10.50, 10.68, 10.90,
11.60, 11.90, 11.80, 12.30, 12.20, 12.20),
Brand = factor(c(rep("Brand A", 6),
rep("Brand B", 6),
rep("Brand C", 6))))
Using the data.frame()
function we created a data frame
with two columns, one labeled Tar_mg with the tar content from
each sample and the second titled Brand that indicates which
cigarette brand each measurement came from. We will also assign the
Brand column as a factor using factor()
which is
not always necessary but a good habit. (Note that the data frame is
currently in long format which is preferred in many R functions compared
to the wide format as shown in the PSU STAT 500
example.)
Now that our data has been read into R we can use the
bartlett.test()
function to determine if the tar content
among the three cigarette brands have equal variances.
bartlett.test(Tar_mg ~ Brand, data = lab_precise)
##
## Bartlett test of homogeneity of variances
##
## data: Tar_mg by Brand
## Bartlett's K-squared = 0.6594, df = 2, p-value = 0.7191
The test returned a p-value of 0.7191 that is well above our threshold of 0.05, therefore we can go ahead and assume that the variances of the tar content are equal among the three brands and continue with our analysis.
Fitting an one-way ANOVA
Now that we have our data, we can fit an ANOVA model with
Tar_mg as the response to the factor Brand using the
aov()
function as follows.
<- aov(Tar_mg ~ Brand, data = lab_precise) aov.fit
Before we begin interpreting the results of the model we should first
check to see how well the model fits. A simple qualitative way to do so
is to check diagnostic plots of
the residuals. Using the plot()
function we can observe
four of these types of diagnostic plots. We can observe them separately
or put them together in one plot using
par(mfrow = c(2,2))
.
par(mfrow = c(2,2))
plot(aov.fit)
The four plots provided are the Residuals vs Fitted, Normal Q-Q, Scale-Location, and Leverage plots. While each plot gives us important information on how our data fit in the ANOVA model, the important one to assess our assumption of normality is the Normal Q-Q plot. Ideally, the points will sit along the dashed line. However, in our plot there does seem to be a number of points that may be deviating away from that line. This could be a cause for concern if we wanted to be certain that this assumption holds.
To quantitatively confirm if our data meets the assumption of
normality we could employ a statistical test, such as the Shapiro-Wilk
Normality Test using the shapiro.test()
function. However,
we have some leeway in this assumption and for now can assume that the
data was sampled from an approximately normal distribution and fits the
ANOVA model well to move on to interpreting the results of the
model.
Interpreting the ANOVA results
To summarize the results of the ANOVA model we can use the
summary()
, or alternatively the anova()
,
function.
summary(aov.fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## Brand 2 12.000 6.000 65.46 3.89e-08 ***
## Residuals 15 1.375 0.092
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the summary of the ANOVA model we see that our factor,
Brand, has a p-value well below our threshold of 0.05. This
tells us that the mean tar content for at least one of the brands of
cigarettes is different from the others. To determine which of these
means are significantly different we follow up the ANOVA with a post-hoc
test. In this case we will use a Tukey’s honestly significant difference
test (Tukey HSD) with the TukeyHSD()
function.
TukeyHSD(aov.fit)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Tar_mg ~ Brand, data = lab_precise)
##
## $Brand
## diff lwr upr p adj
## Brand B-Brand A 1 0.5459914 1.454009 0.0001132
## Brand C-Brand A 2 1.5459914 2.454009 0.0000000
## Brand C-Brand B 1 0.5459914 1.454009 0.0001132
The TukeyHSD()
command provides the difference between
each comparison of means (diff), lower and upper confidence
intervals (lwr and upr), and the adjusted p-value of
those comparisons (p adj). From the adjusted p-values, we can
see that for each group-wise comparison the p-values are below 0.05,
meaning that each brand of cigarettes is significantly different from
one another. Therefore, we can conclude that there are differences in
the tar content among these three brands of cigarettes, namely with
Brand A having the lowest, Brand C having the highest,
and Brand B being between the two other brands in milligrams of
tar.
Full code block
# Enter the data as a dataframe
<- data.frame(Tar_mg = c(10.21, 10.25, 10.24, 9.80, 9.77, 9.73,
lab_precise 11.32, 11.20, 11.40, 10.50, 10.68, 10.90,
11.60, 11.90, 11.80, 12.30, 12.20, 12.20),
Brand = factor(c(rep("Brand A", 6),
rep("Brand B", 6),
rep("Brand C", 6))))
# Check assumption of equal variances among the factors
bartlett.test(Tar_mg ~ Brand, data = lab_precise)
# Fit a one-way ANOVA model
<- aov(Tar_mg ~ Brand, data = lab_precise)
aov.fit
# Generate diagnostic plots to assess the fit of the ANOVA model
par(mfrow = c(2,2))
plot(aov.fit)
# Summarize the results of the ANOVA model
summary(aov.fit)
# Perform and summarize a Tukey HSD post-hoc test
TukeyHSD(aov.fit)