Multinomial Logistic Regression in R
For this guide we will assess what factors influence an employee’s choice on one of three health insurance plans offered by a company. The data set, which is a reduced version of the data set provided in the Handbook of Regression Modeling in People Analytics, has the following 4 factors:
- plan - the health insurance plan chosen (plan A, B, or C).
- age - the person’s age when they chose their healthcare plan.
- gender - the gender of the person (Male or Female).
- absent - the number of absentee days from work the year prior to choosing their health insurance plan.
There are 50 observations evenly split between Male and Female employees.
To begin, we will load in the data using read.csv()
,
then take a quick look at the first few rows with head()
and get a summary of the data set with summary()
.
<- read.csv("../../dat/health_insurance.csv")
health
head(health)
## age gender absent plan
## 1 58 Female 14 B
## 2 26 Female 19 A
## 3 41 Female 1 B
## 4 31 Female 20 A
## 5 43 Female 9 B
## 6 34 Female 23 A
summary(health)
## age gender absent plan
## Min. :21.00 Length:50 Min. : 0.00 Length:50
## 1st Qu.:30.00 Class :character 1st Qu.: 9.00 Class :character
## Median :35.00 Mode :character Median :14.50 Mode :character
## Mean :38.42 Mean :14.68
## 3rd Qu.:43.75 3rd Qu.:21.00
## Max. :66.00 Max. :31.00
In the output from summary()
we can see that the
gender and plan variables were read as character
variables, however we would want them to be factor where there are 2
levels in gender, Male and Female, and 3 levels in
plan, A, B, and C. We can easily change these variables to
factors using factor()
to reassign each variable. We can
then use str()
to see the structure of the data, which will
allow us to see the variable types and some other basic information on
those variables.
$gender <- factor(health$gender)
health$plan <- factor(health$plan)
health
str(health)
## 'data.frame': 50 obs. of 4 variables:
## $ age : int 58 26 41 31 43 34 35 24 50 30 ...
## $ gender: Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
## $ absent: int 14 19 1 20 9 23 28 13 9 9 ...
## $ plan : Factor w/ 3 levels "A","B","C": 2 1 2 1 2 1 3 1 3 1 ...
We see that gender and plan are now factors with 2 and 3 levels, respectively. We can also note that the age and absent variables were set as integers rather than numeric, which is fine for our purposes here.
Now we will fit our multinomial model where we have the health
insurance plans, plan, as our dependent variable with
age, gender, and absent as predictors.
R
does not have a base multinomial regression function so
we will need to import a library to fit our model. Two popular packages
for multinomial logistic regression are nnet
and
VGAM
. In this example we will use VGAM
which
is more friendly with some of the functions we used to assess model
results in our other guides.
After installing and loading the VGAM
package we will
use vglm()
, a function to fit multiple types of Generalized
Linear Models. In vglm()
we put the model formula in the
same format as other models where the dependent variable is on the left
and predictor variables on the right of a ~
, with
.
as a shortcut to include all variables not included on
the left side of the equation. We then assign our data set to
data
and assign multinomial
to
family
to sepecify we aim to fit a multinomial regression
model.
In addition, we will want to fit a null model (one withour our
predictor variables and only an intercept) that we can use to test the
fit of our model. We could either write a null model in
vglm()
or we can use update()
, where we
include the object that is our original model followed by the desired
model formula, which for a null model we put a 1
on the
right side of the ~
to indicate we only want an intercept.
After both models are fit we will use lrtest()
, a function
provided with VGAM
, to perform a likelihood ratio test to
compare the two models.
library(VGAM)
<- vglm(plan ~ .,
health_mlr data = health,
family = multinomial)
<- update(health_mlr, . ~ 1)
health_mlr0
lrtest(health_mlr0, health_mlr)
## Likelihood ratio test
##
## Model 1: plan ~ 1
## Model 2: plan ~ .
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 98 -53.418
## 2 92 -29.460 -6 47.916 1.228e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results of the likelihood ratio test we can see that the
log-likelihood (LogLik
) is higher in the model
that includes our variables (second model in the output) compared to the
null model. Importantly, this difference is significant with a very low
p-value (Pr(>Chisq)
=
1.228e-08
). We then have evidence that our
model explains some of the variance compared to the null model and
follow-up with the summary()
function to assess some of the
model’s results.
summary(health_mlr)
##
## Call:
## vglm(formula = plan ~ ., family = multinomial, data = health)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 14.14238 3.91706 3.610 0.000306 ***
## (Intercept):2 4.15785 2.38323 1.745 0.081050 .
## age:1 -0.35782 0.10373 -3.449 0.000562 ***
## age:2 -0.05263 0.04522 -1.164 0.244489
## genderMale:1 -0.11025 1.08986 -0.101 0.919426
## genderMale:2 -2.36652 0.99481 -2.379 0.017366 *
## absent:1 -0.06022 0.06557 -0.918 0.358409
## absent:2 -0.07664 0.05873 -1.305 0.191932
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
##
## Residual deviance: 58.9196 on 92 degrees of freedom
##
## Log-likelihood: -29.4598 on 92 degrees of freedom
##
## Number of Fisher scoring iterations: 7
##
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):1', 'age:1'
##
##
## Reference group is level 3 of the response
In the output we should first note the bottom line that says
Reference group is level 3 of the response
,
which corresponds to plan C
. This line is telling us that
the other levels, plans A
(level 1) and B
(level 2), are being compared to plan C
in the
Coefficients
table. In that table we see that
two coefficients are statistically significant,
age:1
and
genderMale:2
. As both have negative
coefficients we can interpret these results as:
- As age increases the log-likelood of an employee choosing
plan
A
over planC
decreases. Male
employees are less likely to choose planB
over planC
compared toFemale
employees.
We can look at plots of the data to see these differences more clearly.
In the plot to the left we see how employees choose each health
insurance plan by their age, where the black points represent the means
and the bars the standard deviations for each plan and each
grey point is an individual employee. It is pretty clear from the plot
and with statistical evidence from our multinomial model that older
employees favor plan C
over plan A
, while
plans B
and C
are relatively similar in their
age distributions. In the right bar plot, we can also clearly see, with
statistical evidence, that Male
employees tend to choose
plan C
over plan B
compared to
Female
employees.
What about our other variable, absent, which did not have a significant coefficient? We can also graph it against plan to see if our statsitical results hold against observational scrutiny.
Again, it appears that the plot of our variables against one another agree with our statistical results, where the number of days absent does not significantly influence the choice of health insurance plan.
We should note that so far we have only compared two of our
plan levels, A
and B
, against
C
, but we have not compared plans A
and
B
for any of the three predictor variables. Since our
levels are unordered in multinomial regression we likely want to compare
each level of our response variable against all other levels. To do so,
we can change the contrasts, which is a linear combination of parameters
(that add to 0) which allow different comparisons of treatments.
To change our contrasts to compare plans A
and
B
we will make a vector of parameters for each variable,
then use the ghlt()
function from the multcomp
package to print out test results on those comparisons.
First, to make the vector of parameters we with wrap
c()
, which lists each parameter separated by a comma, into
matrix()
with 1
to indicate we want one 1 row.
The parameters we include are going to be based on which levels we want
to compare (A
vs. B
), the predictor variable
to compare on, and where they are ‘located.’ If we look back at the
coefficients table from the summary of our multinomial regression model
we can see a list of 8 coefficients (the intercepts are included). So,
to compare age between plans A
and B
we will need to change the 3rd and 4th contrast parameters, which we
will make -1
and 1
, respectively, so that
their sum = 0. All of the other parameters we will leave at 0, so our
final vectore to compare plans A
and B
by
age is: c(0, 0, -1, 1, 0, 0, 0, 0)
.
After loading the multcomp
package, we will then make
vectors to compare plans A
and B
on the
gender and absent variables as well, then assign it to
linfct
in the ghlt()
function which also wraps
our original model. We will also wrap ghlt()
in
summary()
to print out the results of each test.
library(multcomp)
<- matrix(c(0, 0, -1, 1, 0, 0, 0, 0), 1)
con_age <- matrix(c(0, 0, 0, 0, -1, 1, 0, 0), 1)
con_gender <- matrix(c(0, 0, 0, 0, 0, 0, -1, 1), 1)
con_absent
summary(glht(health_mlr, linfct = con_age))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: vglm(formula = plan ~ ., family = multinomial, data = health)
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## 1 == 0 0.3052 0.1007 3.029 0.00245 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
summary(glht(health_mlr, linfct = con_gender))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: vglm(formula = plan ~ ., family = multinomial, data = health)
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## 1 == 0 -2.256 1.136 -1.987 0.0469 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
summary(glht(health_mlr, linfct = con_absent))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: vglm(formula = plan ~ ., family = multinomial, data = health)
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## 1 == 0 -0.01642 0.06406 -0.256 0.798
## (Adjusted p values reported -- single-step method)
From these results we can see that only the first two comparisons, for age and gender, are statitically significant and therefore can condlue:
- Younger employees have a higher log-likelihood of choosing plan
A
over planB
. Male
employees are more likely to choose planA
over planB
.- The number of days absent does not affect how likely an employee
will choose between plans
A
andB
.