Load the mtcars data
In the past we have seen an incorrect assignment of factor levels to the data in our data frame. The “two table” approach is aimed at preventing such errors. I am omitting several of the tables for brevity. You will get the idea.
data("mtcars")
table(mtcars$cyl)
##
## 4 6 8
## 11 7 14
mtcars$cyl <- as.factor(mtcars$cyl)
levels(mtcars$cyl) <- c('four','six','eight')
table(mtcars$cyl)
##
## four six eight
## 11 7 14
#
table(mtcars$carb)
##
## 1 2 3 4 6 8
## 7 10 3 10 1 1
mtcars$carb <- as.factor(mtcars$carb)
levels(mtcars$carb) <- c('one','two','three','four','six','eight')
table(mtcars$carb)
##
## one two three four six eight
## 7 10 3 10 1 1
Use glm and predict ‘am’ with predictors carb, mpg, disp. It is important to use the factor version of the categorical variables. An example, not from this problem, male could be coded 1 and female coded 2, but these are still categorical variables. There is no numeric significance to female = 2. To do a regression using these values as factors accounts for their role as categorical variables.
trans.prediction.2 <- glm(am ~ carb+mpg+disp,data = mtcars,
family = 'binomial')
names(trans.prediction.2)
## [1] "coefficients" "residuals" "fitted.values"
## [4] "effects" "R" "rank"
## [7] "qr" "family" "linear.predictors"
## [10] "deviance" "aic" "null.deviance"
## [13] "iter" "weights" "prior.weights"
## [16] "df.residual" "df.null" "y"
## [19] "converged" "boundary" "model"
## [22] "call" "formula" "terms"
## [25] "data" "offset" "control"
## [28] "method" "contrasts" "xlevels"
The project directions specified a glm() to predict ‘am’ using as predictors carb, mpg, and disp. The variable ‘carb’ must be treated as a factor.
The output from glm() contains the probabilities in “fitted.values” and the log(odds) in “linear.predictions”.
Are we trying to predict auto or manual, i.e. .98 means the trannie is likely, which? Use contrasts(mtcars$am). The value of the variable coded ‘1’ is the value being predicted. Probabilities are stored in the glm output variable ‘fitted.values’ and the log(odds) are stored in ‘linear.predictors’. Probabilities can also be recreated with the function predict().
You can also use the model to predict transmission type for combinations of carb, mpg, and disp of your choice. Below we will put the values carb = 1, mpg = 35 and disp = 90 into a data frame “newdata” and let our model predict the probability of a manual transmission.
contrasts(mtcars$am)
## manual
## auto 0
## manual 1
# probabilities are for manual transmission, see '?contrasts'
summary(trans.prediction.2)
##
## Call:
## glm(formula = am ~ carb + mpg + disp, family = "binomial", data = mtcars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.35673 -0.35073 -0.05023 0.18404 1.91985
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.420e+01 9.596e+00 -1.480 0.1389
## carbtwo -4.890e-01 1.616e+00 -0.303 0.7622
## carbthree -1.454e+01 3.701e+03 -0.004 0.9969
## carbfour 2.757e+00 1.856e+00 1.485 0.1374
## carbsix 2.060e+01 6.523e+03 0.003 0.9975
## carbeight 2.349e+01 6.523e+03 0.004 0.9971
## mpg 6.169e-01 3.712e-01 1.662 0.0966 .
## disp 8.067e-05 1.051e-02 0.008 0.9939
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 17.576 on 24 degrees of freedom
## AIC: 33.576
##
## Number of Fisher Scoring iterations: 17
newdata <- data.frame(carb='one',mpg=35,disp=90)
predict(trans.prediction.2,newdata,type="response")
## 1
## 0.9993871
Residuals: In a “good” model residuals will be centered on zero, will be symmetrical, and will have small spread. We can look for these characteristics, or the lack of them, in the entry “Deviance Residuals:” in the model summary. Notice also that one level of each categorical variable is selected as “base” and the other levels receive coefficient values. We can also note that none of the predictors are statistically significant which suggest that we may be able to find a better set of predictors.
That said, look at the reported values: “Null deviance” and “Residual deviance.” Null deviance is based on the intercept value only and in this model is 43.23. Residual deviance is based on the intercept as well as all the predictor values. Residual deviance is 17.576 in this model which is a very worthwhile reduction indicating that the predictors have value, even if their individual p-value is not significant.
#
# SEE BELOW: trans.prediction.2$fitted.values
# will give the same results.
mtcars$prob.manual <- round(predict(trans.prediction.2,mtcars,type = 'response'),4)
pobserv <- as.factor(mtcars$am=='manual')
# prob.manual is P(manual), probserv is actual
# transmission type (TRUE if manual)
#
# If P(manual)>0.5 say the prediction is for "manual transmission"
# how many right, how many wrong
# I am cutting output to ten lines
head(mtcars[,c('am','prob.manual')],10)
## am prob.manual
## Mazda RX4 manual 0.8208
## Mazda RX4 Wag manual 0.8208
## Datsun 710 manual 0.4680
## Hornet 4 Drive auto 0.2729
## Hornet Sportabout auto 0.0420
## Valiant auto 0.0466
## Duster 360 auto 0.0694
## Merc 240D auto 0.5922
## Merc 230 auto 0.3510
## Merc 280 auto 0.6016
A ‘confusion matrix’ will be a big help in deciding how useful the glm classifications really are.
“We can not rely on a single value of accuracy in classification when the classes are imbalanced. For example, suppose we have a data set of 100 patients in which 5 have diabetes and 95 are healthy. If our model only predicts the majority class i.e. predicts all 100 people are healthy, we will have a classification accuracy of 95% but no real useful information. Therefore, we need a confusion matrix.”
confusionMatrix(data=as.factor(mtcars$prob.manual >0.5),
reference = pobserv,
dnn =
c('Manual Predicted','Actual Type\n\t\t\t\tis manual'))
## Confusion Matrix and Statistics
##
## Actual Type
## is manual
## Manual Predicted FALSE TRUE
## FALSE 17 3
## TRUE 2 10
##
## Accuracy : 0.8438
## 95% CI : (0.6721, 0.9472)
## No Information Rate : 0.5938
## P-Value [Acc > NIR] : 0.002273
##
## Kappa : 0.6721
##
## Mcnemar's Test P-Value : 1.000000
##
## Sensitivity : 0.8947
## Specificity : 0.7692
## Pos Pred Value : 0.8500
## Neg Pred Value : 0.8333
## Prevalence : 0.5938
## Detection Rate : 0.5312
## Detection Prevalence : 0.6250
## Balanced Accuracy : 0.8320
##
## 'Positive' Class : FALSE
##
Now try predicting a second variable. The techniques used are a repeat of what was done above for the variable ‘am’.
Predict ‘vs’ using predictors mpg, am, and hp. Be sure you are using the factor ‘am’.
Are we trying to predict V or S, i.e. .98 means the block is likely, which? use contrasts(mtcars$vs)
contrasts(mtcars$vs)
## S
## V 0
## S 1
# indicates we are predicting Straight block.
block.model <- glm(vs ~ mpg+am+hp,data = mtcars,
family = 'binomial')
# a package "caret" contains a function 'ConfusionMatrix'
# We have ran library(caret) above.
summary(block.model)
##
## Call:
## glm(formula = vs ~ mpg + am + hp, family = "binomial", data = mtcars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.09166 -0.13017 -0.00135 0.09913 1.66052
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.96351 9.33802 1.067 0.2860
## mpg 0.26884 0.25554 1.052 0.2928
## ammanual -5.17703 2.82366 -1.833 0.0667 .
## hp -0.10668 0.05455 -1.955 0.0505 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.860 on 31 degrees of freedom
## Residual deviance: 10.844 on 28 degrees of freedom
## AIC: 18.844
##
## Number of Fisher Scoring iterations: 8
# probabilities are for straight block, see 'contrasts'
# create variables useful for contrast matrix
predict.block<-round(predict(block.model,mtcars,type = 'response'),4)
observed.block <- as.factor(mtcars$vs=='S')
# predict.block is P(straight), observed.block is actual block type,
# TRUE for Straight, FALSE for V block.
# If P(straight)>0.5 the prediction is for "Straight Block"
# How many right, how many wrong?
confusionMatrix(data=as.factor(predict.block>0.5),
reference = observed.block,
dnn =
c('Straight Predicted','Actual Type\n\t\t\t\tis Straight'))
## Confusion Matrix and Statistics
##
## Actual Type
## is Straight
## Straight Predicted FALSE TRUE
## FALSE 17 1
## TRUE 1 13
##
## Accuracy : 0.9375
## 95% CI : (0.7919, 0.9923)
## No Information Rate : 0.5625
## P-Value [Acc > NIR] : 3.289e-06
##
## Kappa : 0.873
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9444
## Specificity : 0.9286
## Pos Pred Value : 0.9444
## Neg Pred Value : 0.9286
## Prevalence : 0.5625
## Detection Rate : 0.5312
## Detection Prevalence : 0.5625
## Balanced Accuracy : 0.9365
##
## 'Positive' Class : FALSE
##
#
mtcars$predict.block <- predict.block
# I am cutting output to ten lines
head(mtcars[,c('vs','predict.block')],10)
## vs predict.block
## Mazda RX4 V 0.2137
## Mazda RX4 Wag V 0.2137
## Datsun 710 S 0.7300
## Hornet 4 Drive S 0.9817
## Hornet Sportabout V 0.0246
## Valiant S 0.9741
## Duster 360 V 0.0000
## Merc 240D S 1.0000
## Merc 230 S 0.9974
## Merc 280 S 0.8812
Both these models do a fairly good job of classification. I suggest you read the reference to Confusion Matrix that I quote above. It does a good job of defining the confusion matrix and explaining why it is needed and how to use it.
-jts 11-01-2020