library(caret) library(doMC) # For parallelism. data(iris) # 80% for training and 20% for verification. # createDataPartition takes stratified samples, # i.e., it takes the equal number of samples from each Species. inTrain <- createDataPartition(iris$Species, p=0.8, list=FALSE) training <- iris[inTrain, ] verification <- iris[-inTrain, ] # Make some data (incl. verification) missing on purpose. fillInNa <- function(d) { naCount <- NROW(d) * 0.1 for (i in sample(NROW(d), naCount)) { d[i, sample(4, 1)] <- NA } return(d) } training <- fillInNa(training) verification <- fillInNa(verification) # Because we have missing values across all columns, we need to # use bagged trees. If just one column had NAs, we can use # knnImpute which is faster. Also, note that preProcess is done # only for training. For verification, we use the # preProc generated from training. preProc <- preProcess(method="bagImpute", training[, 1:4]) training[, 1:4] <- predict(preProc, training[, 1:4]) verification[, 1:4] <- predict(preProc, verification[, 1:4]) # SVM # I have a quadcore processor, and train builds three different models # with different parameters for parameter optimization purpose. # registerDoMC() makes it parallel, i.e., three model building # at the same time. registerDoMC(cores=4) model <- train(training[, 1:4], training[, 5], method="svmRadial") confusionMatrix(predict(model, verification[, 1:4]), verification[, 5])
Output:
Reference Prediction setosa versicolor virginica setosa 10 0 0 versicolor 0 10 0 virginica 0 0 10 Overall Statistics Accuracy : 1 95% CI : (0.8843, 1) No Information Rate : 0.3333 P-Value [Acc > NIR] : 4.857e-15 Kappa : 1 Mcnemar's Test P-Value : NA Statistics by Class: Class: setosa Class: versicolor Class: virginica Sensitivity 1.0000 1.0000 1.0000 Specificity 1.0000 1.0000 1.0000 Pos Pred Value 1.0000 1.0000 1.0000 Neg Pred Value 1.0000 1.0000 1.0000 Prevalence 0.3333 0.3333 0.3333 Detection Rate 0.3333 0.3333 0.3333 Detection Prevalence 0.3333 0.3333 0.3333
It was just luck that we reached 100% accuracy. Running this multiple times may show 96% accuracy, too.