## Automatic model selection

Leaps package has regsubsets function that automatically finds best model for each model size.

```> install.packages("leaps")
```

Here’s example from ?regsubsets.

```> a <- regsubsets(as.matrix(swiss[,-1]), swiss[,1])
> a.summary <- summary(a)
> a.summary
Subset selection object
5 Variables  (and intercept)
Forced in Forced out
Agriculture          FALSE      FALSE
Examination          FALSE      FALSE
Education            FALSE      FALSE
Catholic             FALSE      FALSE
Infant.Mortality     FALSE      FALSE
1 subsets of each size up to 5
Selection Algorithm: exhaustive
Agriculture Examination Education Catholic Infant.Mortality
1  ( 1 ) " "         " "         "*"       " "      " "
2  ( 1 ) " "         " "         "*"       "*"      " "
3  ( 1 ) " "         " "         "*"       "*"      "*"
4  ( 1 ) "*"         " "         "*"       "*"      "*"
5  ( 1 ) "*"         "*"         "*"       "*"      "*"
```

Let’s find the best model using Adjusted R square.

```> a.summary\$adjr
[1] 0.4281849 0.5551665 0.6390004 0.6707140 0.6709710
(Intercept)      Agriculture      Examination        Education
TRUE             TRUE             TRUE             TRUE
Catholic Infant.Mortality
TRUE             TRUE
```

Build a model using them.

```> b <- lm(Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality, swiss)
> summary(b)

Call:
lm(formula = Fertility ~ Agriculture + Examination + Education +
Catholic + Infant.Mortality, data = swiss)

Residuals:
Min       1Q   Median       3Q      Max
-15.2743  -5.2617   0.5032   4.1198  15.3213

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
Agriculture      -0.17211    0.07030  -2.448  0.01873 *
Examination      -0.25801    0.25388  -1.016  0.31546
Education        -0.87094    0.18303  -4.758 2.43e-05 ***
Catholic          0.10412    0.03526   2.953  0.00519 **
Infant.Mortality  1.07705    0.38172   2.822  0.00734 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.165 on 41 degrees of freedom
Multiple R-squared: 0.7067,	Adjusted R-squared: 0.671
F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10
```

When doing model selection, instead of using automatic methods recklessly, one should consider if the model really makes sense based on the prior knowledge or if he/she can explain found model reasonably. Statistics like Cp or Adjusted R-square does not tell us about prediction accuracy for unseen data. Thus, it’s better to keep test set for evaluation of models.

References:
Variable selection and model building by Tiejun (Ty) Tong
Leaps package
Julian J. Faraway, Linear Models with R, Chapman & Hall/CRC

Similar Posts: