SMOTE for handling class imbalance – Passion is like genius; a miracle.

Package DMwR which is heavily described by the book Data Mining with R: Learning with Case Studies has several interesting libraries.

Among them, SMOTE is easy to use function to handle class imbalance. To quote the example of the package, first, generate small sample example:

> data(iris)
> data <- iris[, c(1, 2, 5)]
> data$Species <- factor(ifelse(data$Species == "setosa", "rare", "common"))
> head(data)
  Sepal.Length Sepal.Width Species
1          5.1         3.5    rare
2          4.9         3.0    rare
3          4.7         3.2    rare
4          4.6         3.1    rare
5          5.0         3.6    rare
6          5.4         3.9    rare
> table(data$Species)

common   rare 
   100     50

Then, we generate new data set by 1) adding new examples for minority class based on k-nn and interpolation, and 2) under-sampling majority class examples:

> newData <- SMOTE(Species ~., data, perc.over=600, perc.under=100)
> table(newData$Species)

common   rare 
   300    350