## Random Forest in R

Random Forest is an ensemble classifier of decision trees.

Wikipedia has a nice explanation on the learning algorithm. The key idea is to use bagging which uses m models where each of which is fit from n’ samples picked from data of size n. Plus, when choosing criteria for decision tree node splits, one feature is chosen from random subset of features. Each decision tree is fully grown (no pruning).

When making a prediction, mode of predictions of all tress or average of them are used.

R has a randomForest package for it.

Here, we’ll use iris.

```> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
```

Prepare training and test set.

```> test = iris[ c(1:10, 51:60, 101:110), ]
> train = iris[ c(11:50, 61:100, 111:150), ]
```

Build a random forest using randomForest().

```> r = randomForest(Species ~., data=train, importance=TRUE, do.trace=100)
ntree      OOB      1      2      3
100:   5.83%  0.00%  7.50% 10.00%
200:   5.83%  0.00% 10.00%  7.50%
300:   5.83%  0.00% 10.00%  7.50%
400:   5.00%  0.00% 10.00%  5.00%
500:   5.83%  0.00% 10.00%  7.50%

> print(r)
Call:
randomForest(formula = Species ~ ., data = train, importance = TRUE,      do.trace = 100)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2

OOB estimate of  error rate: 5.83%
Confusion matrix:
setosa versicolor virginica class.error
setosa         40          0         0       0.000
versicolor      0         36         4       0.100
virginica       0          3        37       0.075
```

We see that predicting setosa works great. Let’s see how it works with test set.

```> iris.predict = predict(r, test)
> iris.predict
1          2          3          4          5          6          7
setosa     setosa     setosa     setosa     setosa     setosa     setosa
8          9         10         51         52         53         54
setosa     setosa     setosa versicolor versicolor versicolor versicolor
55         56         57         58         59         60        101
versicolor versicolor versicolor versicolor versicolor versicolor  virginica
102        103        104        105        106        107        108
virginica  virginica  virginica  virginica  virginica versicolor  virginica
109        110
virginica  virginica
Levels: setosa versicolor virginica

> t = table(observed=test[,'Species'], predict=iris.predict)
> t
predict
observed     setosa versicolor virginica
setosa         10          0         0
versicolor      0         10         0
virginica       0          1         9

> prop.table(t, 1)
predict
observed     setosa versicolor virginica
setosa        1.0        0.0       0.0
versicolor    0.0        1.0       0.0
virginica     0.0        0.1       0.9
>
```

As you can see, 10% of virginica was predicted as versicolor.

References)
Package ‘randomForest’, Briedman and Cutler’s random forests for classification and regression.: reference manual.
Andy Liaw and Matthew Wiener, Classification and Regression by randomForest, R News, pp. 18-22.: good examples on classification, regression and clustering(really!), and some practical advice.

Similar Posts: