multiple-linear-regression.knit

Multiple Linear Regression (MLR)

In this example we will be using this data set provided by the Example 5-1 of the STAT 501 Regression Methods course course that includes data on participant’s performance IQ score (PIQ), brain size (Brain) measured from MRI scans and given as counts/10,000, and two variables on body size, Height in inches and Weight in pounds. Plotting each of the variables against one another we can see that Height, Weight, and Brain each appear to have some positive relationship with one another, but what about with PIQ?

Using MLR we can determine what relationships, if any, these three physical features have with the PIQ scores. Our data set is in a tabular .txt format, so we will first load the data into R using the read.table() function. We can view summary statistics for each of the variables and the structure of the dataset with the summary() and str() functions.

iqsize <- read.table("../../dat/iqsize.txt", header = TRUE)

summary(iqsize)

##       PIQ             Brain            Height          Weight     
##  Min.   : 72.00   Min.   : 79.06   Min.   :62.00   Min.   :106.0  
##  1st Qu.: 89.25   1st Qu.: 85.48   1st Qu.:66.00   1st Qu.:135.2  
##  Median :115.00   Median : 90.54   Median :68.00   Median :146.5  
##  Mean   :111.34   Mean   : 90.68   Mean   :68.42   Mean   :151.1  
##  3rd Qu.:128.00   3rd Qu.: 94.95   3rd Qu.:70.38   3rd Qu.:172.0  
##  Max.   :150.00   Max.   :107.95   Max.   :77.00   Max.   :192.0

str(iqsize)

## 'data.frame':    38 obs. of  4 variables:
##  $ PIQ   : int  124 150 128 134 110 131 98 84 147 124 ...
##  $ Brain : num  81.7 103.8 96.5 95.2 92.9 ...
##  $ Height: num  64.5 73.3 68.8 65 69 64.5 66 66.3 68.8 64.5 ...
##  $ Weight: int  118 143 172 147 146 138 175 134 172 118 ...

Each of the variables appear to have an appropriate spread of values with no obvious outliers or issues such as missing data or incorrectly coded values. Although PIQ and Weight are of the “integer” type we can treat them as continuous and do not need to make any changes, so we can continue with fitting our model.

To fit a MLR model we can use the lm() function where the response variable, PIQ, is to the left of of a ~ and the predictor variables, Brain, Height, and Weight, are to the right with each separated by a +.

iqsize.fit <- lm(PIQ ~ Brain + Height + Weight, data = iqsize)

Alternatively, instead of listing each predictor variable we could use . on the right of the ~ to tell R to include all other variables in the data frame as a predictor variable. This becomes very helpful when we have a lot of predictors.

iqsize.fit <- lm(PIQ ~ ., data = iqsize)

Before jumping in and interpreting the results of the model we should assess how well the MLR fit our data and if any of the assumptions have been violated. One of these assumptions, little to no multicollinearity, can be assessed by calculating variance inflation factors (VIFs). VIFs measure how much the variance of the coefficient for each variable is increased due to collinearity with other variables and range from a minimum of 1 to infinity. A VIF greater than 10 (or some suggest even 5) is indicative of a high degree of multicollinearity and a violation of our assumption.

Unlike some other statistical software base R does not automatically calculate VIFs, so we can turn to the vif() function from the car package to do it for us.

library(car)

vif(iqsize.fit)

##    Brain   Height   Weight 
## 1.578524 2.276641 2.021541

&emps;Each of the VIFs are quite low, therefore was can accept that we have little to no multicollinearity in our model.

Next, we can use diagnostic plots of the residuals to assess many of the other assumptions. Using the plot() command we can graph four different diagnostic plots for our model. Using par(mfrow=c(2,2)) we can set the graphical parameters to have 2 rows and 2 columns so that we can see all four diagnostic plots in one figure.

par(mfrow=c(2,2))
plot(iqsize.fit)

There are many assessments that can be made from these plots, but the important features for this example are:

The points in the Residuals vs Fitted and the Scale-Location plots do not appear to have any patterns and randomly bounce along the horizonal 0-line
In the Normal Q-Q plot the points line mostly along the diagnoal line
None of the points in the Residuals vs Leverage plot are extreme (outside the dashed bands)

Together, we can be confident that our assumptions of no outliers or high leverage points, normally distributed errors, and equal variances are valid. We can now use the summary() function to print a summary of the results from our MLR model.

summary(iqsize.fit)

## 
## Call:
## lm(formula = PIQ ~ ., data = iqsize)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32.74 -12.09  -3.84  14.17  51.69 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.114e+02  6.297e+01   1.768 0.085979 .  
## Brain        2.060e+00  5.634e-01   3.657 0.000856 ***
## Height      -2.732e+00  1.229e+00  -2.222 0.033034 *  
## Weight       5.599e-04  1.971e-01   0.003 0.997750    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.79 on 34 degrees of freedom
## Multiple R-squared:  0.2949, Adjusted R-squared:  0.2327 
## F-statistic: 4.741 on 3 and 34 DF,  p-value: 0.007215

From the summary of the results we can first note that we have a statistical significant MLR fit as the p-value for the MLR model is below 0.05 (p-value: 0.007215). Further, the Brain and Height variables are statistically significant predictors of PIQ given by the p-values for their coefficients (Pr(>|t|) column) being below 0.05. Conversely, the Weight variable is not statistically significant and even close to 1. We can conclude that the weight of the participant has no relationship, their height has a negative relationship, and brain size has a positive relationship with their performance IQ score.

The summary() function also gives us the multiple R-squared and adjusted R-squared. Note that the multiple R-squared can be misleading in MLR models as it will always increase (or stay relatively the same) as we add independent variables, even if they are not significant predictors of our response varaible. The adjusted R-squared corrects for this by penalizing additional variables without relationships to the response, but its interpretation is more useful when we are comparing models to determine which has the better predictive fit.

Full code block

# Load the data set and view its structure and summary statistics of the variables
iqsize <- read.table("../../dat/iqsize.txt", header = TRUE)

str(iqsize)
summary(iqsize)

# Fit a linear model with PIQ as the response and Brain, Height, and Weight as the predictor variables
iqsize.fit <- lm(PIQ ~ ., data = iqsize)

# Load the car package to calculate the VIFs to assess multicollinearity
library(car)

vif(iqsize.fit)

# View diagnostic plots of the residuals from the fit MLR model
par(mfrow=c(2,2))
plot(iqsize.fit)

# Print a summary of the results for the MLR model
summary(fit.lm)