Logo

Data Learning Center Statistical Guides

multinomial-logistic-regression.knit

Multinomial Logistic Regression in R

For this guide we will assess what factors influence an employee’s choice on one of three health insurance plans offered by a company. The data set, which is a reduced version of the data set provided in the Handbook of Regression Modeling in People Analytics, has the following 4 factors:

  • plan - the health insurance plan chosen (plan A, B, or C).
  • age - the person’s age when they chose their healthcare plan.
  • gender - the gender of the person (Male or Female).
  • absent - the number of absentee days from work the year prior to choosing their health insurance plan.

There are 50 observations evenly split between Male and Female employees.

To begin, we will load in the data using read.csv(), then take a quick look at the first few rows with head() and get a summary of the data set with summary().

health <- read.csv("../../dat/health_insurance.csv")

head(health)
##   age gender absent plan
## 1  58 Female     14    B
## 2  26 Female     19    A
## 3  41 Female      1    B
## 4  31 Female     20    A
## 5  43 Female      9    B
## 6  34 Female     23    A
summary(health)
##       age           gender              absent          plan          
##  Min.   :21.00   Length:50          Min.   : 0.00   Length:50         
##  1st Qu.:30.00   Class :character   1st Qu.: 9.00   Class :character  
##  Median :35.00   Mode  :character   Median :14.50   Mode  :character  
##  Mean   :38.42                      Mean   :14.68                     
##  3rd Qu.:43.75                      3rd Qu.:21.00                     
##  Max.   :66.00                      Max.   :31.00

In the output from summary() we can see that the gender and plan variables were read as character variables, however we would want them to be factor where there are 2 levels in gender, Male and Female, and 3 levels in plan, A, B, and C. We can easily change these variables to factors using factor() to reassign each variable. We can then use str() to see the structure of the data, which will allow us to see the variable types and some other basic information on those variables.

health$gender <- factor(health$gender)
health$plan <- factor(health$plan)

str(health)
## 'data.frame':    50 obs. of  4 variables:
##  $ age   : int  58 26 41 31 43 34 35 24 50 30 ...
##  $ gender: Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ absent: int  14 19 1 20 9 23 28 13 9 9 ...
##  $ plan  : Factor w/ 3 levels "A","B","C": 2 1 2 1 2 1 3 1 3 1 ...

We see that gender and plan are now factors with 2 and 3 levels, respectively. We can also note that the age and absent variables were set as integers rather than numeric, which is fine for our purposes here.

Now we will fit our multinomial model where we have the health insurance plans, plan, as our dependent variable with age, gender, and absent as predictors. R does not have a base multinomial regression function so we will need to import a library to fit our model. Two popular packages for multinomial logistic regression are nnet and VGAM. In this example we will use VGAM which is more friendly with some of the functions we used to assess model results in our other guides.

After installing and loading the VGAM package we will use vglm(), a function to fit multiple types of Generalized Linear Models. In vglm() we put the model formula in the same format as other models where the dependent variable is on the left and predictor variables on the right of a ~, with . as a shortcut to include all variables not included on the left side of the equation. We then assign our data set to data and assign multinomial to family to sepecify we aim to fit a multinomial regression model.

In addition, we will want to fit a null model (one withour our predictor variables and only an intercept) that we can use to test the fit of our model. We could either write a null model in vglm() or we can use update(), where we include the object that is our original model followed by the desired model formula, which for a null model we put a 1 on the right side of the ~ to indicate we only want an intercept. After both models are fit we will use lrtest(), a function provided with VGAM, to perform a likelihood ratio test to compare the two models.

library(VGAM)

health_mlr <- vglm(plan ~ .,
                   data = health,
                   family = multinomial)

health_mlr0 <- update(health_mlr, . ~ 1)

lrtest(health_mlr0, health_mlr)
## Likelihood ratio test
## 
## Model 1: plan ~ 1
## Model 2: plan ~ .
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1  98 -53.418                         
## 2  92 -29.460 -6 47.916  1.228e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results of the likelihood ratio test we can see that the log-likelihood (LogLik) is higher in the model that includes our variables (second model in the output) compared to the null model. Importantly, this difference is significant with a very low p-value (Pr(>Chisq) = 1.228e-08). We then have evidence that our model explains some of the variance compared to the null model and follow-up with the summary() function to assess some of the model’s results.

summary(health_mlr)
## 
## Call:
## vglm(formula = plan ~ ., family = multinomial, data = health)
## 
## Coefficients: 
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1 14.14238    3.91706   3.610 0.000306 ***
## (Intercept):2  4.15785    2.38323   1.745 0.081050 .  
## age:1         -0.35782    0.10373  -3.449 0.000562 ***
## age:2         -0.05263    0.04522  -1.164 0.244489    
## genderMale:1  -0.11025    1.08986  -0.101 0.919426    
## genderMale:2  -2.36652    0.99481  -2.379 0.017366 *  
## absent:1      -0.06022    0.06557  -0.918 0.358409    
## absent:2      -0.07664    0.05873  -1.305 0.191932    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
## 
## Residual deviance: 58.9196 on 92 degrees of freedom
## 
## Log-likelihood: -29.4598 on 92 degrees of freedom
## 
## Number of Fisher scoring iterations: 7 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):1', 'age:1'
## 
## 
## Reference group is level  3  of the response

In the output we should first note the bottom line that says Reference group is level 3 of the response, which corresponds to plan C. This line is telling us that the other levels, plans A (level 1) and B (level 2), are being compared to plan C in the Coefficients table. In that table we see that two coefficients are statistically significant, age:1 and genderMale:2. As both have negative coefficients we can interpret these results as:

  • As age increases the log-likelood of an employee choosing plan A over plan C decreases.
  • Male employees are less likely to choose plan B over plan C compared to Female employees.

We can look at plots of the data to see these differences more clearly.

In the plot to the left we see how employees choose each health insurance plan by their age, where the black points represent the means and the bars the standard deviations for each plan and each grey point is an individual employee. It is pretty clear from the plot and with statistical evidence from our multinomial model that older employees favor plan C over plan A, while plans B and C are relatively similar in their age distributions. In the right bar plot, we can also clearly see, with statistical evidence, that Male employees tend to choose plan C over plan B compared to Female employees.

What about our other variable, absent, which did not have a significant coefficient? We can also graph it against plan to see if our statsitical results hold against observational scrutiny.

Again, it appears that the plot of our variables against one another agree with our statistical results, where the number of days absent does not significantly influence the choice of health insurance plan.

We should note that so far we have only compared two of our plan levels, A and B, against C, but we have not compared plans A and B for any of the three predictor variables. Since our levels are unordered in multinomial regression we likely want to compare each level of our response variable against all other levels. To do so, we can change the contrasts, which is a linear combination of parameters (that add to 0) which allow different comparisons of treatments.

To change our contrasts to compare plans A and B we will make a vector of parameters for each variable, then use the ghlt() function from the multcomp package to print out test results on those comparisons.

First, to make the vector of parameters we with wrap c(), which lists each parameter separated by a comma, into matrix() with 1 to indicate we want one 1 row. The parameters we include are going to be based on which levels we want to compare (A vs. B), the predictor variable to compare on, and where they are ‘located.’ If we look back at the coefficients table from the summary of our multinomial regression model we can see a list of 8 coefficients (the intercepts are included). So, to compare age between plans A and B we will need to change the 3rd and 4th contrast parameters, which we will make -1 and 1, respectively, so that their sum = 0. All of the other parameters we will leave at 0, so our final vectore to compare plans A and B by age is: c(0, 0, -1, 1, 0, 0, 0, 0).

After loading the multcomp package, we will then make vectors to compare plans A and B on the gender and absent variables as well, then assign it to linfct in the ghlt() function which also wraps our original model. We will also wrap ghlt() in summary() to print out the results of each test.

library(multcomp)

con_age <- matrix(c(0, 0, -1, 1, 0, 0, 0, 0), 1)
con_gender <- matrix(c(0, 0, 0, 0, -1, 1, 0, 0), 1)
con_absent <- matrix(c(0, 0, 0, 0, 0, 0, -1, 1), 1)

summary(glht(health_mlr, linfct = con_age))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: vglm(formula = plan ~ ., family = multinomial, data = health)
## 
## Linear Hypotheses:
##        Estimate Std. Error z value Pr(>|z|)   
## 1 == 0   0.3052     0.1007   3.029  0.00245 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
summary(glht(health_mlr, linfct = con_gender))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: vglm(formula = plan ~ ., family = multinomial, data = health)
## 
## Linear Hypotheses:
##        Estimate Std. Error z value Pr(>|z|)  
## 1 == 0   -2.256      1.136  -1.987   0.0469 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
summary(glht(health_mlr, linfct = con_absent))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Fit: vglm(formula = plan ~ ., family = multinomial, data = health)
## 
## Linear Hypotheses:
##        Estimate Std. Error z value Pr(>|z|)
## 1 == 0 -0.01642    0.06406  -0.256    0.798
## (Adjusted p values reported -- single-step method)

From these results we can see that only the first two comparisons, for age and gender, are statitically significant and therefore can condlue:

  • Younger employees have a higher log-likelihood of choosing plan A over plan B.
  • Male employees are more likely to choose plan A over plan B.
  • The number of days absent does not affect how likely an employee will choose between plans A and B.
DLC_statistical_guides