Bagged tree imputation for missing values using caret

Tags:

library(caret)
library(doMC)  # For parallelism.

data(iris)
# 80% for training and 20% for verification.
# createDataPartition takes stratified samples, 
# i.e., it takes the equal number of samples from each Species. 
inTrain <- createDataPartition(iris$Species, p=0.8, list=FALSE)
training <- iris[inTrain, ]
verification <- iris[-inTrain, ]

# Make some data (incl. verification) missing on purpose.
fillInNa <- function(d) {
  naCount <- NROW(d) * 0.1
  for (i in sample(NROW(d), naCount)) {
     d[i, sample(4, 1)] <- NA
  }
  return(d)
}

training <- fillInNa(training)
verification <- fillInNa(verification)

# Because we have missing values across all columns, we need to 
# use bagged trees. If just one column had NAs, we can use 
# knnImpute which is faster. Also, note that preProcess is done 
# only for training. For verification, we use the 
# preProc generated from training.
preProc <- preProcess(method="bagImpute", training[, 1:4])
training[, 1:4] <- predict(preProc, training[, 1:4])
verification[, 1:4] <- predict(preProc, verification[, 1:4])

# SVM
# I have a quadcore processor, and train builds three different models
# with different parameters for parameter optimization purpose. 
# registerDoMC() makes it parallel, i.e., three model building 
# at the same time.
registerDoMC(cores=4)  
model <- train(training[, 1:4], training[, 5], method="svmRadial")
confusionMatrix(predict(model, verification[, 1:4]), verification[, 5])

Output:

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         0
  virginica       0          0        10

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.8843, 1)
    No Information Rate : 0.3333     
    P-Value [Acc > NIR] : 4.857e-15  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           1.0000
Specificity                 1.0000            1.0000           1.0000
Pos Pred Value              1.0000            1.0000           1.0000
Neg Pred Value              1.0000            1.0000           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.3333
Detection Prevalence        0.3333            0.3333           0.3333

It was just luck that we reached 100% accuracy. Running this multiple times may show 96% accuracy, too.