We will explore logistic regression by examining the classic mtcars
data set.
library(car)
library(broom)
library(mosaic)
library(caret)
package ‘caret’ was built under R version 3.5.2
mtcars
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Consider the am
variable that indicates automatic or manual transmission.
tally(~ am, data = mtcars)
am
0 1
19 13
Note that the variable is not coded as a factor variable. We would normally rectify this situation by using the factor
command, but we will delay that step for the time being. The reason is that we’re going to create scatterplots below, and that requires variables to be recorded numerically.
Let’s try to predict am
(the transmission type—automatic or manual) from mpg
. (Remember, this was 1974!) What about linear regression?
ggplot(mtcars, aes(y = am, x = mpg)) +
geom_jitter(height = 0) +
geom_smooth(method = "lm")
This clearly doesn’t work.
The plot below uses a loess curve to try to “fit” the data.
ggplot(mtcars, aes(y = am, x = mpg)) +
geom_jitter(height = 0) +
geom_smooth()
Finally, we fit a “logistic regression” curve.
ggplot(mtcars, aes(y = am, x = mpg)) +
geom_jitter(height = 0) +
geom_smooth(method = "glm", method.args = list(family = binomial))
Logistic regression doesn’t have too many assumptions (unlike standard linear regression). Obviously, the response variable must be binary and the observations in the data must be independent. Other than that, there’s not much else! One exception is a “non-separability” rule that states that there must be some “overlap” in the outcomes relative to the explanatory variable. As an example, here is some fake data:
fake_data <- data.frame(y = c(rep(0, 10), rep(1, 10)), x = 1:20)
ggplot(fake_data, aes(y = y, x = x)) + geom_point()
We can see that there is a boundary between 10 and 11 for which everything to the left is classified as “0” and everything to the right is “1”. This will break logistic regression:
glm(y ~ x, data = fake_data, family = binomial)
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
Call: glm(formula = y ~ x, family = binomial, data = fake_data)
Coefficients:
(Intercept) x
-433.39 41.28
Degrees of Freedom: 19 Total (i.e. Null); 18 Residual
Null Deviance: 27.73
Residual Deviance: 4.357e-09 AIC: 4
The glm
function boldly tried to calculate the regression anyway (with a warning), but the answer makes no sense.
For this section, we will go ahead and change am
to a factor variable.
am <- factor(mtcars$am, levels = c(0, 1), labels = c("Automatic", "Manual"))
mtcars2 <- data.frame(am, mpg = mtcars$mpg)
Running the model…
am_mpg <- glm(am ~ mpg, data = mtcars2, family = binomial)
am_mpg_tidy <- tidy(am_mpg)
am_mpg_tidy
Mathematically, the model is
\[\log\left(\frac{\hat{p}}{1 - \hat{p}}\right) = -6.6 + 0.3 mpg,\] or equivalently,
\[\hat{p} = \frac{1}{1 + e^{-(-6.6 + 0.3 mpg)}}\]
Another mathematically equivalent way to express it:
\[\hat{p} = \frac{e^{-6.6 + 0.3 mpg}}{1 + e^{-6.6 + 0.3 mpg}}\]
The intercept -6.6035267 is the log odds of having manual transmission for a car with 0 mpg. This clearly makes no sense.
The coefficient of mpg
, 0.3070282, is the predicted change in the log odds corresponding to an increase of one mpg. (But how easy it is to understand this?)
An argument to the tidy
command will give the coefficients in the much more convenient exponentiated form:
am_mpg_tidy_exp <- tidy(am_mpg, exponentiate = TRUE)
am_mpg_tidy_exp
Mathematically, we have exponentiated both sides, giving us the odds instead of the log odds.
\[\frac{\hat{p}}{1 - \hat{p}} = e^{-6.6 + 0.3 mpg} = e^{-6.6} e^{0.3 mpg}.\]
Note that \(e^{-6.6}\) is 0.0013556 and \(e^{0.3}\) is 1.3593793.
If we substitute these exponentiated values into the above expression, we get
\[\frac{\hat{p}}{1 - \hat{p}} = 0.001 \left(1.36^{mpg}\right).\]
The exponentiated intercept is the odds of a car having manual transmission if mpg
is zero. Again, this makes no sense. (However, it does make sense that this is what the model would predict since the logistic function will be asymptotically close to zero that far to the left of the data.)
The exponentiated coefficient tells us what happens to the odds if we increase the mpg
by one unit. Here’s why. Suppose \(\hat{p}'\) is the probability corresponding to adding one to mpg:
\[\frac{\hat{p}'}{1 - \hat{p}'} = 0.001 \left(1.36^{(mpg + 1)}\right).\]
The right hand side is equal to
\[0.001 \left(1.36^{mpg}(1.36)\right).\]
and this is just 1.36 times the odds for the original mpg value. In other words, adding one to mpg simply multiplies the odds by 1.36 (the value of the exponentiated coefficient).
Mean centering can help by giving an interpretable intercept.
mtcars2 <- mtcars2 %>%
mutate(mpg_cent = mpg - mean(mpg))
am_mpg_mc <- glm(am ~ mpg_cent, data = mtcars2, family = binomial)
am_mpg_mc_tidy <- tidy(am_mpg_mc)
am_mpg_mc_tidy
The coefficient for mpg
has not changed, but the intercept now has a reasonable interpretation. When the mean-centered mpg is zero—in other words, when a car gets 20.090625 mpg
—the predicted log odds of having manual transmission are -0.4351385, or, alternatively, the odds are 0.647175 (\(e^{-0.4}\)).
The coefficient of an explanatory variable has a convenient (if a bit difficult to interpret) interpretation on the log-odds scale or the odds scale. (In the former case, one unit of increase in the explanatory variable predicts an additive effect on the log-odds; in the latter case, one unit of increase in the explanatory variable predicts a multiplicative effect on the odds.) However, no such interpretation exists for the probability of success.
For example, let’s see what happens to the probability of having manual transmission when we go from 10 mpg to 11 mpg:
This will give us the log odds at 10 mpg:
log_odds_10 <- predict(am_mpg, data.frame(mpg = 10))
log_odds_10
1
-3.533245
We convert it to a probability by applying the logistic function. There are two forms that are mathematically equivalent:
\[\frac{1}{1 + e^{-x}} = \frac{e^{x}}{1 + e^{x}}.\] We’ll use the first one since we only have to put in the value of \(x\) in one spot.
1/(1 + exp(-log_odds_10))
1
0.02838097
This appears in R as plogis
.
prob_10 <- plogis(log_odds_10)
prob_10
1
0.02838097
An easier way to do this is to use an optional argument to the predict
function:
prob_10 <- predict(am_mpg, data.frame(mpg = 10), type = "response")
prob_10
1
0.02838097
We do the same for 11 mpg:
prob_11 <- predict(am_mpg, data.frame(mpg = 11), type = "response")
prob_11
1
0.03819098
The change in probability is
prob_11 - prob_10
1
0.009810004
Okay, now look at a change from 20 mpg to 21 mpg:
prob_20 <- predict(am_mpg, data.frame(mpg = 20), type = "response")
prob_21 <- predict(am_mpg, data.frame(mpg = 21), type = "response")
prob_21 - prob_20
1
0.07481195
This is a much larger change. That’s because the logistic function is much steeper toward the middle of the graph than at the edges.
If you do happen to need a prediction near the middle of the graph, though, it is nice that the graph is approximately linear through a fairly wide range of values (here, maybe from 15 to 25 or so). The \(\beta/4\) rule tells us that the slope of the straight line approximation near the middle of the graph is \(\beta/4\) where \(\beta\) is the coefficient of the explanatory variable in the model.1 Let’s calculate that here:
am_mpg_tidy$estimate[2]/4
[1] 0.07675705
That’s pretty close to the answer we got when predicting the difference in probability between 20 and 21 mpg.
Since the goal of logistic regression is to make predictions of a binary response variable, we can evaluate the model by comparing the predictions made by the model to the actual values of the response variable. For example, a sensible thing to do would be to say that if the model predicts a probability of greater than 50% (log odds greater than 0), then the car is more likely to have manual transmission, whereas if the probability is less than 50% (log odds less than 0), then the car is more likely to have automatic transmission.
Another way to look at it is to find the mpg value where the model crosses the 50% probability mark. Since 50% probability is the same as log odds of zero, all we need to do is set log odds equal to zero in the model:
\[0 = -6.6 + 0.3 mpg\] Solving this precisely, we get
-am_mpg_tidy$estimate[1]/am_mpg_tidy$estimate[2]
[1] 21.50788
You can check the graph above to see that the 50% mark does, indeed, correspond to about 21 or 22 mpg. Therefore, any car that gets less than 21.5 mpg will be predicted to have automatic transmission, while any car that gets greater than 21.5 mpg will be predicted to have manual transmission.
We can easily make a variable that is “Automatic” when the predicted log odds are negative and “Manual” when they are positive.
am_pred <- factor(ifelse(predict(am_mpg) < 0, "Automatic", "Manual"))
am_pred
1 2 3 4 5 6 7 8 9 10 11
Automatic Automatic Manual Automatic Automatic Automatic Automatic Manual Manual Automatic Automatic
12 13 14 15 16 17 18 19 20 21 22
Automatic Automatic Automatic Automatic Automatic Automatic Manual Manual Manual Automatic Automatic
23 24 25 26 27 28 29 30 31 32
Automatic Automatic Automatic Manual Manual Manual Automatic Automatic Automatic Automatic
Levels: Automatic Manual
Compare this to the actual values of am
from the original data:
mtcars2$am
[1] Manual Manual Manual Automatic Automatic Automatic Automatic Automatic Automatic Automatic
[11] Automatic Automatic Automatic Automatic Automatic Automatic Automatic Manual Manual Manual
[21] Automatic Automatic Automatic Automatic Automatic Manual Manual Manual Manual Manual
[31] Manual Manual
Levels: Automatic Manual
The caret
package has a convenient function called confusionMatrix
that creates a table of predicted values versus actual (reference) values.
confusionMatrix(am_pred, mtcars2$am)
Confusion Matrix and Statistics
Reference
Prediction Automatic Manual
Automatic 17 6
Manual 2 7
Accuracy : 0.75
95% CI : (0.566, 0.8854)
No Information Rate : 0.5938
P-Value [Acc > NIR] : 0.04978
Kappa : 0.4553
Mcnemar's Test P-Value : 0.28884
Sensitivity : 0.8947
Specificity : 0.5385
Pos Pred Value : 0.7391
Neg Pred Value : 0.7778
Prevalence : 0.5938
Detection Rate : 0.5312
Detection Prevalence : 0.7188
Balanced Accuracy : 0.7166
'Positive' Class : Automatic
Therefore, the logistic regression model has a 75% accuracy rate in classifying cars as automatic or manual based solely on mpg. Since most of the cars are automatics (19/32, or about 60%), a completely naive thing to do would be just to say all the cars have automatic transmission, with accuracy 60%. This is called the “No Information Rate” or NIR. But using mpg
as a predictor improves the accuracy to 75%, and that’s a statistically significant improvement!
We can use more than one predictor variable. Let’s try predicting am
from mpg_cent
and vs
. (Note that vs
is a dichotomous categorical variable indicating the engine type, “V” vs “Straight”.) First, we’ll include vs
in our mtcars
data frame.
mtcars2 <- mtcars2 %>%
mutate(vs = mtcars$vs)
str(mtcars2)
'data.frame': 32 obs. of 4 variables:
$ am : Factor w/ 2 levels "Automatic","Manual": 2 2 2 1 1 1 1 1 1 1 ...
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ mpg_cent: num 0.909 0.909 2.709 1.309 -1.391 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
am_mpg_vs <- glm(am ~ mpg_cent + vs, data = mtcars2, family = binomial)
am_mpg_vs_tidy <- tidy(am_mpg_vs)
am_mpg_vs_tidy
am_mpg_vs_tidy_exp <- tidy(am_mpg_vs, exponentiate = TRUE)
am_mpg_vs_tidy_exp
The interpretations of the intercept and coefficients do not change except that now we must take care—as in any multiple regression setting—to mention that they predict changes assuming all other variables are held constant.
Note that the coefficient of vs
is not statistically significant.
Here is the confusion matrix for this model:
am_pred2 <- factor(ifelse(predict(am_mpg_vs) < 0, "Automatic", "Manual"))
confusionMatrix(am_pred2, mtcars2$am)
Confusion Matrix and Statistics
Reference
Prediction Automatic Manual
Automatic 16 4
Manual 3 9
Accuracy : 0.7812
95% CI : (0.6003, 0.9072)
No Information Rate : 0.5938
P-Value [Acc > NIR] : 0.02102
Kappa : 0.541
Mcnemar's Test P-Value : 1.00000
Sensitivity : 0.8421
Specificity : 0.6923
Pos Pred Value : 0.8000
Neg Pred Value : 0.7500
Prevalence : 0.5938
Detection Rate : 0.5000
Detection Prevalence : 0.6250
Balanced Accuracy : 0.7672
'Positive' Class : Automatic
The accuracy has improved a little to about 78%, but that’s not much higher than 75%. Due to the small sample size, this actually means that only one additional car was predicted correctly. (In fact, one fewer automatic was classified correctly by this model, but because two more manuals were correctly classified, the net effect is a slightly better accuracy score.)
As with other forms of regression, ANOVA can tell us if the addition of predictors is significant.
# This is Type II ANOVA
Anova(am_mpg_vs)
Analysis of Deviance Table (Type II tests)
Response: am
LR Chisq Df Pr(>Chisq)
mpg_cent 17.3788 1 3.062e-05 ***
vs 4.7314 1 0.02962 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Although the coefficient of vs
was not significant, the ANOVA table tells us that the addition of vs
to the model adds some predictive power.
For those with some calculus background, \(\beta/4\) is the derivative of the logistic function evaluated at the point where the function is equal to 0.5.↩