Multiple Linear Regression (MLR)
In this example we will be using this data set provided by the Example 5-1 of the STAT 501 Regression Methods course course that includes data on participant’s performance IQ score (PIQ), brain size (Brain) measured from MRI scans and given as counts/10,000, and two variables on body size, Height in inches and Weight in pounds. Plotting each of the variables against one another we can see that Height, Weight, and Brain each appear to have some positive relationship with one another, but what about with PIQ?
Using MLR we can determine what relationships, if any, these three
physical features have with the PIQ scores. Our data set is in
a tabular .txt format, so we will first load the data into R using the
read.table()
function. We can view summary statistics for
each of the variables and the structure of the dataset with the
summary()
and str()
functions.
<- read.table("../../dat/iqsize.txt", header = TRUE)
iqsize
summary(iqsize)
## PIQ Brain Height Weight
## Min. : 72.00 Min. : 79.06 Min. :62.00 Min. :106.0
## 1st Qu.: 89.25 1st Qu.: 85.48 1st Qu.:66.00 1st Qu.:135.2
## Median :115.00 Median : 90.54 Median :68.00 Median :146.5
## Mean :111.34 Mean : 90.68 Mean :68.42 Mean :151.1
## 3rd Qu.:128.00 3rd Qu.: 94.95 3rd Qu.:70.38 3rd Qu.:172.0
## Max. :150.00 Max. :107.95 Max. :77.00 Max. :192.0
str(iqsize)
## 'data.frame': 38 obs. of 4 variables:
## $ PIQ : int 124 150 128 134 110 131 98 84 147 124 ...
## $ Brain : num 81.7 103.8 96.5 95.2 92.9 ...
## $ Height: num 64.5 73.3 68.8 65 69 64.5 66 66.3 68.8 64.5 ...
## $ Weight: int 118 143 172 147 146 138 175 134 172 118 ...
Each of the variables appear to have an appropriate spread of values with no obvious outliers or issues such as missing data or incorrectly coded values. Although PIQ and Weight are of the “integer” type we can treat them as continuous and do not need to make any changes, so we can continue with fitting our model.
To fit a MLR model we can use the lm()
function where
the response variable, PIQ, is to the left of of a
~
and the predictor variables, Brain,
Height, and Weight, are to the right with each
separated by a +
.
<- lm(PIQ ~ Brain + Height + Weight, data = iqsize) iqsize.fit
Alternatively, instead of listing each predictor variable we could
use .
on the right of the ~
to tell R to
include all other variables in the data frame as a predictor variable.
This becomes very helpful when we have a lot of predictors.
<- lm(PIQ ~ ., data = iqsize) iqsize.fit
Before jumping in and interpreting the results of the model we should assess how well the MLR fit our data and if any of the assumptions have been violated. One of these assumptions, little to no multicollinearity, can be assessed by calculating variance inflation factors (VIFs). VIFs measure how much the variance of the coefficient for each variable is increased due to collinearity with other variables and range from a minimum of 1 to infinity. A VIF greater than 10 (or some suggest even 5) is indicative of a high degree of multicollinearity and a violation of our assumption.
Unlike some other statistical software base R does not automatically
calculate VIFs, so we can turn to the vif()
function from
the car
package to do it for us.
library(car)
vif(iqsize.fit)
## Brain Height Weight
## 1.578524 2.276641 2.021541
&emps;Each of the VIFs are quite low, therefore was can accept that we have little to no multicollinearity in our model.
Next, we can use diagnostic plots of the residuals to assess many of
the other assumptions. Using the plot()
command we can
graph four different diagnostic plots for our model. Using
par(mfrow=c(2,2))
we can set the graphical parameters to
have 2 rows and 2 columns so that we can see all four diagnostic plots
in one figure.
par(mfrow=c(2,2))
plot(iqsize.fit)
There are many assessments that can be made from these plots, but the important features for this example are:
- The points in the Residuals vs Fitted and the Scale-Location plots do not appear to have any patterns and randomly bounce along the horizonal 0-line
- In the Normal Q-Q plot the points line mostly along the diagnoal line
- None of the points in the Residuals vs Leverage plot are extreme (outside the dashed bands)
Together, we can be confident that our assumptions of no outliers or
high leverage points, normally distributed errors, and equal variances
are valid. We can now use the summary()
function to print a
summary of the results from our MLR model.
summary(iqsize.fit)
##
## Call:
## lm(formula = PIQ ~ ., data = iqsize)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.74 -12.09 -3.84 14.17 51.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.114e+02 6.297e+01 1.768 0.085979 .
## Brain 2.060e+00 5.634e-01 3.657 0.000856 ***
## Height -2.732e+00 1.229e+00 -2.222 0.033034 *
## Weight 5.599e-04 1.971e-01 0.003 0.997750
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.79 on 34 degrees of freedom
## Multiple R-squared: 0.2949, Adjusted R-squared: 0.2327
## F-statistic: 4.741 on 3 and 34 DF, p-value: 0.007215
From the summary of the results we can first note that we have a
statistical significant MLR fit as the p-value for the MLR model is
below 0.05 (p-value: 0.007215
). Further, the Brain
and Height variables are statistically significant predictors
of PIQ given by the p-values for their coefficients
(Pr(>|t|)
column) being below 0.05. Conversely, the
Weight variable is not statistically significant and even close
to 1. We can conclude that the weight of the participant has no
relationship, their height has a negative relationship, and brain size
has a positive relationship with their performance IQ score.
The summary()
function also gives us the multiple
R-squared and adjusted R-squared. Note that the multiple R-squared can
be misleading in MLR models as it will always increase (or stay
relatively the same) as we add independent variables, even if they are
not significant predictors of our response varaible. The adjusted
R-squared corrects for this by penalizing additional variables without
relationships to the response, but its interpretation is more useful
when we are comparing models to determine which has the better
predictive fit.
Full code block
# Load the data set and view its structure and summary statistics of the variables
<- read.table("../../dat/iqsize.txt", header = TRUE)
iqsize
str(iqsize)
summary(iqsize)
# Fit a linear model with PIQ as the response and Brain, Height, and Weight as the predictor variables
<- lm(PIQ ~ ., data = iqsize)
iqsize.fit
# Load the car package to calculate the VIFs to assess multicollinearity
library(car)
vif(iqsize.fit)
# View diagnostic plots of the residuals from the fit MLR model
par(mfrow=c(2,2))
plot(iqsize.fit)
# Print a summary of the results for the MLR model
summary(fit.lm)