Finding Optimal Threshold using ROC Curve for Classification

Tags:

Let’s assume two class(A and B) classification. Also assume that a classification algorithm predicts that the given data is class A with probability 0.8. To predict if it’s class A or B, we need a threshold parameter (i.e., cutoff).

If 0.8 is higher then the threshold \theta, we’ll predict that the data is class A. Otherwise, we’ll predict the data is from class B.

ROC graph can be used for this purpose. If we think sensitivity and specificity is equally important, we just need to set the threshold to the top left corner of the graph. Below is an example of this in R.

> library(caret)
> library(pROC)
> data(iris)

> # Make this two class classification.
> iris <- iris[iris$Species == "virginica" | iris$Species == "versicolor", ]
> iris$Species <- factor(iris$Species)  # setosa should be removed from factor
> levels(iris$Species)
[1] "versicolor" "virginica" 

> # Get train and test datasets.
> # I'm making train set very small intentionally so that we get some wrong classification
> # results. This will give ROC graph some curves.
> samples <- sample(NROW(iris), NROW(iris) * .5)
> data.train <- iris[samples, ]
> data.test <- iris[-samples, ]

> # Random forest
> forest.model <- train(Species ~., data.train)
> forest.model
50 samples
 4 predictors
 2 classes: 'versicolor', 'virginica' 

No pre-processing
Resampling: Bootstrap (25 reps) 

Summary of sample sizes: 50, 50, 50, 50, 50, 50, ... 

Resampling results across tuning parameters:

  mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
  2     0.928     0.855  0.0532       0.105   
  3     0.924     0.849  0.0569       0.113   
  4     0.917     0.834  0.0625       0.123   

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 2. 

> # Prediction.
> result.predicted.prob <- predict(forest.model, data.test, type="prob")

> # Draw ROC curve.
> result.roc <- roc(data.test$Species, result.predicted.prob$versicolor)
> plot(result.roc, print.thres="best", print.thres.best.method="closest.topleft")

Call:
roc.default(response = data.test$Species, predictor = result.predicted.prob$versicolor)

Data: result.predicted.prob$versicolor in 25 controls (data.test$Species versicolor) > 25 cases (data.test$Species virginica).
Area under the curve: 0.9872

> # Get some more values.
> result.coords <- coords(
+   result.roc, "best", best.method="closest.topleft", ret=c("threshold", "accuracy"))

> print(result.coords)
threshold  accuracy 
    0.752     0.960 

> # Make prediction using the best top-left cutoff.
> result.predicted.label <- factor(
    ifelse(result.predicted.prob[,1] > result.coords[1], "versicolor", "virginica"))

> xtabs(~  result.predicted.label + data.test$Species)
                      data.test$Species
result.predicted.label versicolor virginica
            versicolor         24         1
            virginica           1        24

The accuracy was 0.960 when the threshold was set to 0.752 which is the same with (24+24)/(24+1+1+24) in the cross tabulation. Note that this accuracy is different form the accuracy shown by ‘forest.model’. forest.model’s accuracy is estimated by cross validation of data.train, hence it’s smaller. Below is ROC curve.

Instead of using top left corner, we may pick the threshold that maximize ‘sensitivity + specificity’ or some weighted version of it. See help(coords) for this purpose.