k nearest neighbor – Passion is like genius; a miracle.

knn, k-nearest neighbor, algorithm is a non parametric classification using neighbor of data to test. The idea is simple, and we can simply use knn() in ‘class’ package. But one thing to remember is that we should normalize data before measuring distance. For example, suppose that we measured variable x in cm and y in mm. Then, 1 unit difference of x is 10 unit difference of variable y which is not desirable.

Thankfully, Package DMwR has an implementation for it:

> install.packages("DMwR")
> library(DMwR)

> kNN
function (form, train, test, norm = T, norm.stats = NULL, ...) 
{
    require(class, quietly = TRUE)
    tgtCol <- which(colnames(train) == as.character(form[[2]]))
    if (norm) {
        if (is.null(norm.stats)) 
            tmp <- scale(train[, -tgtCol], 
                         center = T, scale = T)
        else 
          # (1)
          tmp <- scale(train[, -tgtCol], center = norm.stats[[1]],
                       scale = norm.stats[[2]])
        train[, -tgtCol] <- tmp
        ms <- attr(tmp, "scaled:center")
        ss <- attr(tmp, "scaled:scale")
        # (2)
        test[, -tgtCol] <- scale(test[, -tgtCol], center = ms,
            scale = ss)
    }
    knn(train[, -tgtCol], test[, -tgtCol], train[, tgtCol], ...)
}

If you look at the code, if norm.stats is not given, kNN scales training data at (1). And then, it stores scaling factor (ms and ss), and apply it to test data at (2). By doing so, we normalize data by using training data only. See ?kNN for example using iris.

One drawback of kNN() is that scale() assumed independence of features. In other words, if data is spread, say, in two dimensional space, as circle, scale() is fine. But if data is spread like ellipsoid rotated 45 degree, the assumption does not hold. In such case, one may use PCA to rotate data.

References) Luis Torgo, Data mining with R, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series.