Random Forest is an ensemble classifier of decision trees.
Wikipedia has a nice explanation on the learning algorithm. The key idea is to use bagging which uses m models where each of which is fit from n’ samples picked from data of size n. Plus, when choosing criteria for decision tree node splits, one feature is chosen from random subset of features. Each decision tree is fully grown (no pruning).
When making a prediction, mode of predictions of all tress or average of them are used.
R has a randomForest package for it.
Here, we’ll use iris.
> head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa
Prepare training and test set.
> test = iris[ c(1:10, 51:60, 101:110), ] > train = iris[ c(11:50, 61:100, 111:150), ]
Build a random forest using randomForest().
> r = randomForest(Species ~., data=train, importance=TRUE, do.trace=100) ntree OOB 1 2 3 100: 5.83% 0.00% 7.50% 10.00% 200: 5.83% 0.00% 10.00% 7.50% 300: 5.83% 0.00% 10.00% 7.50% 400: 5.00% 0.00% 10.00% 5.00% 500: 5.83% 0.00% 10.00% 7.50% > print(r) Call: randomForest(formula = Species ~ ., data = train, importance = TRUE, do.trace = 100) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 5.83% Confusion matrix: setosa versicolor virginica class.error setosa 40 0 0 0.000 versicolor 0 36 4 0.100 virginica 0 3 37 0.075
We see that predicting setosa works great. Let’s see how it works with test set.
> iris.predict = predict(r, test) > iris.predict 1 2 3 4 5 6 7 setosa setosa setosa setosa setosa setosa setosa 8 9 10 51 52 53 54 setosa setosa setosa versicolor versicolor versicolor versicolor 55 56 57 58 59 60 101 versicolor versicolor versicolor versicolor versicolor versicolor virginica 102 103 104 105 106 107 108 virginica virginica virginica virginica virginica versicolor virginica 109 110 virginica virginica Levels: setosa versicolor virginica > t = table(observed=test[,'Species'], predict=iris.predict) > t predict observed setosa versicolor virginica setosa 10 0 0 versicolor 0 10 0 virginica 0 1 9 > prop.table(t, 1) predict observed setosa versicolor virginica setosa 1.0 0.0 0.0 versicolor 0.0 1.0 0.0 virginica 0.0 0.1 0.9 >
As you can see, 10% of virginica was predicted as versicolor.
References)
Package ‘randomForest’, Briedman and Cutler’s random forests for classification and regression.: reference manual.
Andy Liaw and Matthew Wiener, Classification and Regression by randomForest, R News, pp. 18-22.: good examples on classification, regression and clustering(really!), and some practical advice.