Random forest for variable selection

Tags:

Package randomForest has importance() to estimate the importance of variables.

The example in the reference manual has this:

> library(randomForest)
> data(mtcars)
> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
> mtcars.rf <- randomForest(mpg ~ ., data=mtcars, ntree=1000, keep.forest=FALSE, importance=TRUE)
> importance(mtcars.rf)
       %IncMSE IncNodePurity
cyl  16.050788     171.09822
disp 18.868236     232.56372
hp   17.031602     198.29501
drat  7.728328      64.23068
wt   18.595598     260.77604
qsec  5.607246      33.88488
vs    5.124934      26.49292
am    3.938463      13.72707
gear  4.482608      18.85271
carb  7.823431      33.94279
> importance(mtcars.rf, type=1)
       %IncMSE
cyl  16.050788
disp 18.868236
hp   17.031602
drat  7.728328
wt   18.595598
qsec  5.607246
vs    5.124934
am    3.938463
gear  4.482608
carb  7.823431

In importance(), type=1 shows mean squared error increase if each variable is removed from the predictors. Type 2 shows increase in node impurity averaged over all trees.

To visualize:

> varImpPlot(mtcars.rf)

To get the top three important variables:

> mtcars.imp <- importance(mtcars.rf, type=1)
> mtcars.imp[order(mtcars.imp, decreasing=TRUE),]
     disp        wt        hp       cyl      carb      drat      qsec        vs 
18.868236 18.595598 17.031602 16.050788  7.823431  7.728328  5.607246  5.124934 
     gear        am 
 4.482608  3.938463 
> names(mtcars.imp[order(mtcars.imp, decreasing=TRUE),])[1:3]
[1] "disp" "wt"   "hp" 

Thus we get disp, wt, and hp.