## Automatic model selection

Leaps package has regsubsets function that automatically finds best model for each model size.

> install.packages("leaps")

Here’s example from ?regsubsets.

> a <- regsubsets(as.matrix(swiss[,-1]), swiss[,1])
> a.summary <- summary(a)
> a.summary
Subset selection object
5 Variables (and intercept)
Forced in Forced out
Agriculture FALSE FALSE
Examination FALSE FALSE
Education FALSE FALSE
Catholic FALSE FALSE
Infant.Mortality FALSE FALSE
1 subsets of each size up to 5
Selection Algorithm: exhaustive
Agriculture Examination Education Catholic Infant.Mortality
1 ( 1 ) " " " " "*" " " " "
2 ( 1 ) " " " " "*" "*" " "
3 ( 1 ) " " " " "*" "*" "*"
4 ( 1 ) "*" " " "*" "*" "*"
5 ( 1 ) "*" "*" "*" "*" "*"

Let’s find the best model using Adjusted R square.

> a.summary$adjr
[1] 0.4281849 0.5551665 0.6390004 0.6707140 0.6709710
> a.summary$which[which.max(a.summary$adjr), ]
(Intercept) Agriculture Examination Education
TRUE TRUE TRUE TRUE
Catholic Infant.Mortality
TRUE TRUE

Build a model using them.

> b <- lm(Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality, swiss)
> summary(b)
Call:
lm(formula = Fertility ~ Agriculture + Examination + Education +
Catholic + Infant.Mortality, data = swiss)
Residuals:
Min 1Q Median 3Q Max
-15.2743 -5.2617 0.5032 4.1198 15.3213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
Agriculture -0.17211 0.07030 -2.448 0.01873 *
Examination -0.25801 0.25388 -1.016 0.31546
Education -0.87094 0.18303 -4.758 2.43e-05 ***
Catholic 0.10412 0.03526 2.953 0.00519 **
Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.165 on 41 degrees of freedom
Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10

When doing model selection, instead of using automatic methods recklessly, one should consider if the model really makes sense based on the prior knowledge or if he/she can explain found model reasonably. Statistics like Cp or Adjusted R-square does not tell us about prediction accuracy for unseen data. Thus, it’s better to keep test set for evaluation of models.

References:

Variable selection and model building by Tiejun (Ty) Tong

Leaps package

Julian J. Faraway, Linear Models with R, Chapman & Hall/CRC

## Post a Comment