Random Forest in R – Passion is like genius; a miracle.

Random Forest is an ensemble classifier of decision trees.

Wikipedia has a nice explanation on the learning algorithm. The key idea is to use bagging which uses m models where each of which is fit from n’ samples picked from data of size n. Plus, when choosing criteria for decision tree node splits, one feature is chosen from random subset of features. Each decision tree is fully grown (no pruning).

When making a prediction, mode of predictions of all tress or average of them are used.

R has a randomForest package for it.

Here, we’ll use iris.

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Prepare training and test set.

> test = iris[ c(1:10, 51:60, 101:110), ]
> train = iris[ c(11:50, 61:100, 111:150), ]

Build a random forest using randomForest().

> r = randomForest(Species ~., data=train, importance=TRUE, do.trace=100)
ntree      OOB      1      2      3
  100:   5.83%  0.00%  7.50% 10.00%
  200:   5.83%  0.00% 10.00%  7.50%
  300:   5.83%  0.00% 10.00%  7.50%
  400:   5.00%  0.00% 10.00%  5.00%
  500:   5.83%  0.00% 10.00%  7.50%

> print(r)
Call:
 randomForest(formula = Species ~ ., data = train, importance = TRUE,      do.trace = 100) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 5.83%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         40          0         0       0.000
versicolor      0         36         4       0.100
virginica       0          3        37       0.075

We see that predicting setosa works great. Let’s see how it works with test set.

> iris.predict = predict(r, test)
> iris.predict
         1          2          3          4          5          6          7 
    setosa     setosa     setosa     setosa     setosa     setosa     setosa 
         8          9         10         51         52         53         54 
    setosa     setosa     setosa versicolor versicolor versicolor versicolor 
        55         56         57         58         59         60        101 
versicolor versicolor versicolor versicolor versicolor versicolor  virginica 
       102        103        104        105        106        107        108 
 virginica  virginica  virginica  virginica  virginica versicolor  virginica 
       109        110 
 virginica  virginica 
Levels: setosa versicolor virginica

> t = table(observed=test[,'Species'], predict=iris.predict)
> t
            predict
observed     setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         0
  virginica       0          1         9

> prop.table(t, 1)
            predict
observed     setosa versicolor virginica
  setosa        1.0        0.0       0.0
  versicolor    0.0        1.0       0.0
  virginica     0.0        0.1       0.9
>

As you can see, 10% of virginica was predicted as versicolor.

References)
Package ‘randomForest’, Briedman and Cutler’s random forests for classification and regression.: reference manual.
Andy Liaw and Matthew Wiener, Classification and Regression by randomForest, R News, pp. 18-22.: good examples on classification, regression and clustering(really!), and some practical advice.