Random Forest in R

Random Forest is an ensemble classifier of decision trees.

Wikipedia has a nice explanation on the learning algorithm. The key idea is to use bagging which uses m models where each of which is fit from n’ samples picked from data of size n. Plus, when choosing criteria for decision tree node splits, one feature is chosen from random subset of features. Each decision tree is fully grown (no pruning).

When making a prediction, mode of predictions of all tress or average of them are used.

R has a randomForest package for it.

Here, we’ll use iris.

```> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
```

Prepare training and test set.

```> test = iris[ c(1:10, 51:60, 101:110), ]
> train = iris[ c(11:50, 61:100, 111:150), ]
```

Build a random forest using randomForest().

```> r = randomForest(Species ~., data=train, importance=TRUE, do.trace=100)
ntree      OOB      1      2      3
100:   5.83%  0.00%  7.50% 10.00%
200:   5.83%  0.00% 10.00%  7.50%
300:   5.83%  0.00% 10.00%  7.50%
400:   5.00%  0.00% 10.00%  5.00%
500:   5.83%  0.00% 10.00%  7.50%

> print(r)
Call:
randomForest(formula = Species ~ ., data = train, importance = TRUE,      do.trace = 100)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2

OOB estimate of  error rate: 5.83%
Confusion matrix:
setosa versicolor virginica class.error
setosa         40          0         0       0.000
versicolor      0         36         4       0.100
virginica       0          3        37       0.075
```

We see that predicting setosa works great. Let’s see how it works with test set.

```> iris.predict = predict(r, test)
> iris.predict
1          2          3          4          5          6          7
setosa     setosa     setosa     setosa     setosa     setosa     setosa
8          9         10         51         52         53         54
setosa     setosa     setosa versicolor versicolor versicolor versicolor
55         56         57         58         59         60        101
versicolor versicolor versicolor versicolor versicolor versicolor  virginica
102        103        104        105        106        107        108
virginica  virginica  virginica  virginica  virginica versicolor  virginica
109        110
virginica  virginica
Levels: setosa versicolor virginica

> t = table(observed=test[,'Species'], predict=iris.predict)
> t
predict
observed     setosa versicolor virginica
setosa         10          0         0
versicolor      0         10         0
virginica       0          1         9

> prop.table(t, 1)
predict
observed     setosa versicolor virginica
setosa        1.0        0.0       0.0
versicolor    0.0        1.0       0.0
virginica     0.0        0.1       0.9
>
```

As you can see, 10% of virginica was predicted as versicolor.

References)
Package ‘randomForest’, Briedman and Cutler’s random forests for classification and regression.: reference manual.
Andy Liaw and Matthew Wiener, Classification and Regression by randomForest, R News, pp. 18-22.: good examples on classification, regression and clustering(really!), and some practical advice.

Similar Posts:

1. kalyani wrote:

CAN WE FIND R-SQUARE VALUE FOR A CLASSIFICATION PROBLEM OF RANDOM FOREST.IF SO HOW IT CAN BE DONE IN R.

Posted 29 Oct 2012 at 4:48 am
2. Minkoo Seo wrote:

Sorry, I don’t know an answer to that.

But, if you mean regression, pseudo r square is returned by randomForest. See http://cran.r-project.org/web/packages/randomForest/randomForest.pdf

If you mean pseudo r square as the improvement of likelihood compare to the null model, sorry, I don’t know how to do that with randomForest().

If you’re considering r square as a measure for comparing linear model(like glm) and random forest, why don’t you use cross validation? Doesn’t it work?

Posted 02 Nov 2012 at 2:14 pm
3. Dave Tang wrote:

Hi!

I was trying to look for a vignette for the randomForest package but couldn’t. I’m glad I came across this page and your blog! Thanks for sharing!

Dave

Posted 19 Dec 2012 at 3:35 pm
4. Minkoo Seo wrote:

Glad that it is useful article!

Posted 20 Dec 2012 at 10:19 am
5. Charles wrote:

Thanks for this! Like Dave above, I am new to Random Forests, and am looking for useful vignettes. This helped!

Feel free to post more ;)

Posted 30 Jun 2013 at 7:02 pm
6. daniel wrote:

Hi, I was wondering how we can use this to see how much of an influence the individual variables (Sepal.Length Sepal.Width Petal.Length Petal.Width) have on the species class.

Posted 07 Nov 2013 at 1:53 pm
7. Minkoo Seo wrote:

Daniel. As you know, you can use varImpPlot() kind of functions for showing variable importance. You may be able to measure gini or accuracy gain for each variable. However, I’m not sure if that’s what you want by “influence”.

Posted 27 Nov 2013 at 11:57 am
8. Berk wrote:

Thank you very much for this post.

Would you please help me understand the first argument you used when training the random forest (i.e. “Species ~.”). I would be grateful if you would kindly explain what the three terms within the first argument (“Species”, “~”, “.”) stand for?

Kind Regards,
Berk

Posted 10 Apr 2014 at 9:09 am
9. Sanjiv wrote:

Hi Minkoo..

Your post is very helpful but however I have not understood this part>> “t = table(observed=test[,’Species’], predict=iris.predict)”

Is it calculating the error in prediction while running the model on test data set ? or is it something else ?

Thanks

Posted 27 May 2014 at 7:45 am
10. Minkoo Seo wrote:

Sorry for late response. Sanjiv, yes, I’m calculating error for test data.

Posted 27 Jun 2014 at 3:28 pm
11. Minkoo Seo wrote:

@Berk: “Species ~ .” is a Formula. It’s format is “lhs ~ rhs”.

LHS stands for the target variable to predict.

~ is a symbol that connects lhs and rhs.

RHS side has ‘.’ It means all the variables except for what’s specified in the lhs.

Posted 27 Jun 2014 at 3:30 pm
12. sravan wrote:

Hi Minkoo.

do you have any other use case on retail data to apply this algorithm for better understanding. I am confusing about random foresht vs gredient boosting decession trees.

thanks

Posted 10 Jul 2014 at 1:01 pm
13. Minkoo Seo wrote:

@srave: Gradient boosting is iterative. It adds trees one by one. New tree improves the performance by handling the incorrect prediction of previous sets of trees. Final output is the sum of the output of all trees.

In random forest, trees are built independently. And each tree solves samples of the input data. Final output is the voting of the all trees.

Posted 29 Jul 2014 at 3:54 pm