Exam Two Proj 11-6 DAST 536

KEY for Exam Two Proj 11-06

Load the mtcars data

Create some factors

In the past we have seen an incorrect assignment of factor levels to the data in our data frame. The “two table” approach is aimed at preventing such errors. I am omitting several of the tables for brevity. You will get the idea.

data("mtcars")
table(mtcars$cyl)

## 
##  4  6  8 
## 11  7 14

mtcars$cyl <- as.factor(mtcars$cyl)
levels(mtcars$cyl) <- c('four','six','eight')
table(mtcars$cyl)

## 
##  four   six eight 
##    11     7    14

#
table(mtcars$carb)

## 
##  1  2  3  4  6  8 
##  7 10  3 10  1  1

mtcars$carb <- as.factor(mtcars$carb)
levels(mtcars$carb) <- c('one','two','three','four','six','eight')
table(mtcars$carb)

## 
##   one   two three  four   six eight 
##     7    10     3    10     1     1

Use glm and predict ‘am’ with predictors carb, mpg, disp. It is important to use the factor version of the categorical variables. An example, not from this problem, male could be coded 1 and female coded 2, but these are still categorical variables. There is no numeric significance to female = 2. To do a regression using these values as factors accounts for their role as categorical variables.

trans.prediction.2 <- glm(am ~ carb+mpg+disp,data = mtcars,
                          family = 'binomial')
names(trans.prediction.2)

##  [1] "coefficients"      "residuals"         "fitted.values"    
##  [4] "effects"           "R"                 "rank"             
##  [7] "qr"                "family"            "linear.predictors"
## [10] "deviance"          "aic"               "null.deviance"    
## [13] "iter"              "weights"           "prior.weights"    
## [16] "df.residual"       "df.null"           "y"                
## [19] "converged"         "boundary"          "model"            
## [22] "call"              "formula"           "terms"            
## [25] "data"              "offset"            "control"          
## [28] "method"            "contrasts"         "xlevels"

The project directions specified a glm() to predict ‘am’ using as predictors carb, mpg, and disp. The variable ‘carb’ must be treated as a factor.

The output from glm() contains the probabilities in “fitted.values” and the log(odds) in “linear.predictions”.

Explore predictions about transmission type.

Are we trying to predict auto or manual, i.e. .98 means the trannie is likely, which? Use contrasts(mtcars$am). The value of the variable coded ‘1’ is the value being predicted. Probabilities are stored in the glm output variable ‘fitted.values’ and the log(odds) are stored in ‘linear.predictors’. Probabilities can also be recreated with the function predict().

You can also use the model to predict transmission type for combinations of carb, mpg, and disp of your choice. Below we will put the values carb = 1, mpg = 35 and disp = 90 into a data frame “newdata” and let our model predict the probability of a manual transmission.

contrasts(mtcars$am)

##        manual
## auto        0
## manual      1

# probabilities are for manual transmission, see '?contrasts'
summary(trans.prediction.2)

## 
## Call:
## glm(formula = am ~ carb + mpg + disp, family = "binomial", data = mtcars)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.35673  -0.35073  -0.05023   0.18404   1.91985  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -1.420e+01  9.596e+00  -1.480   0.1389  
## carbtwo     -4.890e-01  1.616e+00  -0.303   0.7622  
## carbthree   -1.454e+01  3.701e+03  -0.004   0.9969  
## carbfour     2.757e+00  1.856e+00   1.485   0.1374  
## carbsix      2.060e+01  6.523e+03   0.003   0.9975  
## carbeight    2.349e+01  6.523e+03   0.004   0.9971  
## mpg          6.169e-01  3.712e-01   1.662   0.0966 .
## disp         8.067e-05  1.051e-02   0.008   0.9939  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 17.576  on 24  degrees of freedom
## AIC: 33.576
## 
## Number of Fisher Scoring iterations: 17

newdata <- data.frame(carb='one',mpg=35,disp=90)
predict(trans.prediction.2,newdata,type="response")

##         1 
## 0.9993871

About the summary

Residuals: In a “good” model residuals will be centered on zero, will be symmetrical, and will have small spread. We can look for these characteristics, or the lack of them, in the entry “Deviance Residuals:” in the model summary. Notice also that one level of each categorical variable is selected as “base” and the other levels receive coefficient values. We can also note that none of the predictors are statistically significant which suggest that we may be able to find a better set of predictors.

That said, look at the reported values: “Null deviance” and “Residual deviance.” Null deviance is based on the intercept value only and in this model is 43.23. Residual deviance is based on the intercept as well as all the predictor values. Residual deviance is 17.576 in this model which is a very worthwhile reduction indicating that the predictors have value, even if their individual p-value is not significant.

#
# SEE BELOW: trans.prediction.2$fitted.values 
# will give the same  results.
mtcars$prob.manual <- round(predict(trans.prediction.2,mtcars,type = 'response'),4)
pobserv <- as.factor(mtcars$am=='manual')

# prob.manual is P(manual), probserv is actual 
# transmission type (TRUE if manual)
#
# If P(manual)>0.5 say the prediction is for "manual transmission"
# how many right, how many wrong
# I am cutting output to ten lines
head(mtcars[,c('am','prob.manual')],10)

##                       am prob.manual
## Mazda RX4         manual      0.8208
## Mazda RX4 Wag     manual      0.8208
## Datsun 710        manual      0.4680
## Hornet 4 Drive      auto      0.2729
## Hornet Sportabout   auto      0.0420
## Valiant             auto      0.0466
## Duster 360          auto      0.0694
## Merc 240D           auto      0.5922
## Merc 230            auto      0.3510
## Merc 280            auto      0.6016

A ‘confusion matrix’ will be a big help in deciding how useful the glm classifications really are.

Something from the WEB

https://towardsdatascience.com/decoding-the-confusion-matrix-bb4801decbb

“We can not rely on a single value of accuracy in classification when the classes are imbalanced. For example, suppose we have a data set of 100 patients in which 5 have diabetes and 95 are healthy. If our model only predicts the majority class i.e. predicts all 100 people are healthy, we will have a classification accuracy of 95% but no real useful information. Therefore, we need a confusion matrix.”

confusionMatrix(data=as.factor(mtcars$prob.manual >0.5),
                reference = pobserv,
                dnn = 
  c('Manual Predicted','Actual Type\n\t\t\t\tis manual'))

## Confusion Matrix and Statistics
## 
##                 Actual Type
##              is manual
## Manual Predicted FALSE TRUE
##            FALSE    17    3
##            TRUE      2   10
##                                           
##                Accuracy : 0.8438          
##                  95% CI : (0.6721, 0.9472)
##     No Information Rate : 0.5938          
##     P-Value [Acc > NIR] : 0.002273        
##                                           
##                   Kappa : 0.6721          
##                                           
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 0.8947          
##             Specificity : 0.7692          
##          Pos Pred Value : 0.8500          
##          Neg Pred Value : 0.8333          
##              Prevalence : 0.5938          
##          Detection Rate : 0.5312          
##    Detection Prevalence : 0.6250          
##       Balanced Accuracy : 0.8320          
##                                           
##        'Positive' Class : FALSE           
##

A Second Model

Now try predicting a second variable. The techniques used are a repeat of what was done above for the variable ‘am’.

Predict V block or Straight block

Predict ‘vs’ using predictors mpg, am, and hp. Be sure you are using the factor ‘am’.

Are we trying to predict V or S, i.e. .98 means the block is likely, which? use contrasts(mtcars$vs)

contrasts(mtcars$vs)

##   S
## V 0
## S 1

# indicates we are predicting Straight block.
block.model <- glm(vs ~ mpg+am+hp,data = mtcars,
                          family = 'binomial')
# a package "caret" contains a function 'ConfusionMatrix'
# We have ran library(caret) above.
summary(block.model)

## 
## Call:
## glm(formula = vs ~ mpg + am + hp, family = "binomial", data = mtcars)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.09166  -0.13017  -0.00135   0.09913   1.66052  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  9.96351    9.33802   1.067   0.2860  
## mpg          0.26884    0.25554   1.052   0.2928  
## ammanual    -5.17703    2.82366  -1.833   0.0667 .
## hp          -0.10668    0.05455  -1.955   0.0505 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.860  on 31  degrees of freedom
## Residual deviance: 10.844  on 28  degrees of freedom
## AIC: 18.844
## 
## Number of Fisher Scoring iterations: 8

# probabilities are for straight block, see 'contrasts'
# create variables useful for contrast matrix
predict.block<-round(predict(block.model,mtcars,type = 'response'),4)
observed.block <- as.factor(mtcars$vs=='S')
# predict.block is P(straight), observed.block is actual block type,
# TRUE for Straight, FALSE for V block.
# If P(straight)>0.5 the prediction is for "Straight Block"
# How many right, how many wrong?
confusionMatrix(data=as.factor(predict.block>0.5),
                reference = observed.block,
                dnn = 
  c('Straight Predicted','Actual Type\n\t\t\t\tis Straight'))

## Confusion Matrix and Statistics
## 
##                   Actual Type
##              is Straight
## Straight Predicted FALSE TRUE
##              FALSE    17    1
##              TRUE      1   13
##                                           
##                Accuracy : 0.9375          
##                  95% CI : (0.7919, 0.9923)
##     No Information Rate : 0.5625          
##     P-Value [Acc > NIR] : 3.289e-06       
##                                           
##                   Kappa : 0.873           
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9444          
##             Specificity : 0.9286          
##          Pos Pred Value : 0.9444          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.5625          
##          Detection Rate : 0.5312          
##    Detection Prevalence : 0.5625          
##       Balanced Accuracy : 0.9365          
##                                           
##        'Positive' Class : FALSE           
##

# 
mtcars$predict.block <- predict.block
# I am cutting output to ten lines
head(mtcars[,c('vs','predict.block')],10)

##                   vs predict.block
## Mazda RX4          V        0.2137
## Mazda RX4 Wag      V        0.2137
## Datsun 710         S        0.7300
## Hornet 4 Drive     S        0.9817
## Hornet Sportabout  V        0.0246
## Valiant            S        0.9741
## Duster 360         V        0.0000
## Merc 240D          S        1.0000
## Merc 230           S        0.9974
## Merc 280           S        0.8812

Conclusions

Both these models do a fairly good job of classification. I suggest you read the reference to Confusion Matrix that I quote above. It does a good job of defining the confusion matrix and explaining why it is needed and how to use it.

-jts 11-01-2020