Automatic model selection

Tags:

Leaps package has regsubsets function that automatically finds best model for each model size.

> install.packages("leaps")

Here’s example from ?regsubsets.

> a <- regsubsets(as.matrix(swiss[,-1]), swiss[,1])
> a.summary <- summary(a)
> a.summary
Subset selection object
5 Variables  (and intercept)
                 Forced in Forced out
Agriculture          FALSE      FALSE
Examination          FALSE      FALSE
Education            FALSE      FALSE
Catholic             FALSE      FALSE
Infant.Mortality     FALSE      FALSE
1 subsets of each size up to 5
Selection Algorithm: exhaustive
         Agriculture Examination Education Catholic Infant.Mortality
1  ( 1 ) " "         " "         "*"       " "      " "             
2  ( 1 ) " "         " "         "*"       "*"      " "             
3  ( 1 ) " "         " "         "*"       "*"      "*"             
4  ( 1 ) "*"         " "         "*"       "*"      "*"             
5  ( 1 ) "*"         "*"         "*"       "*"      "*"

Let’s find the best model using Adjusted R square.

> a.summary$adjr
[1] 0.4281849 0.5551665 0.6390004 0.6707140 0.6709710
> a.summary$which[which.max(a.summary$adjr), ]
     (Intercept)      Agriculture      Examination        Education 
            TRUE             TRUE             TRUE             TRUE 
        Catholic Infant.Mortality 
            TRUE             TRUE 

Build a model using them.

> b <- lm(Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality, swiss)
> summary(b)

Call:
lm(formula = Fertility ~ Agriculture + Examination + Education + 
    Catholic + Infant.Mortality, data = swiss)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2743  -5.2617   0.5032   4.1198  15.3213 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
Examination      -0.25801    0.25388  -1.016  0.31546    
Education        -0.87094    0.18303  -4.758 2.43e-05 ***
Catholic          0.10412    0.03526   2.953  0.00519 ** 
Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 7.165 on 41 degrees of freedom
Multiple R-squared: 0.7067,	Adjusted R-squared: 0.671 
F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10 

When doing model selection, instead of using automatic methods recklessly, one should consider if the model really makes sense based on the prior knowledge or if he/she can explain found model reasonably. Statistics like Cp or Adjusted R-square does not tell us about prediction accuracy for unseen data. Thus, it’s better to keep test set for evaluation of models.

References:
Variable selection and model building by Tiejun (Ty) Tong
Leaps package
Julian J. Faraway, Linear Models with R, Chapman & Hall/CRC