Leaps package has regsubsets function that automatically finds best model for each model size.
> install.packages("leaps")
Here’s example from ?regsubsets.
> a <- regsubsets(as.matrix(swiss[,-1]), swiss[,1]) > a.summary <- summary(a) > a.summary Subset selection object 5 Variables (and intercept) Forced in Forced out Agriculture FALSE FALSE Examination FALSE FALSE Education FALSE FALSE Catholic FALSE FALSE Infant.Mortality FALSE FALSE 1 subsets of each size up to 5 Selection Algorithm: exhaustive Agriculture Examination Education Catholic Infant.Mortality 1 ( 1 ) " " " " "*" " " " " 2 ( 1 ) " " " " "*" "*" " " 3 ( 1 ) " " " " "*" "*" "*" 4 ( 1 ) "*" " " "*" "*" "*" 5 ( 1 ) "*" "*" "*" "*" "*"
Let’s find the best model using Adjusted R square.
> a.summary$adjr [1] 0.4281849 0.5551665 0.6390004 0.6707140 0.6709710 > a.summary$which[which.max(a.summary$adjr), ] (Intercept) Agriculture Examination Education TRUE TRUE TRUE TRUE Catholic Infant.Mortality TRUE TRUE
Build a model using them.
> b <- lm(Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality, swiss) > summary(b) Call: lm(formula = Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality, data = swiss) Residuals: Min 1Q Median 3Q Max -15.2743 -5.2617 0.5032 4.1198 15.3213 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 66.91518 10.70604 6.250 1.91e-07 *** Agriculture -0.17211 0.07030 -2.448 0.01873 * Examination -0.25801 0.25388 -1.016 0.31546 Education -0.87094 0.18303 -4.758 2.43e-05 *** Catholic 0.10412 0.03526 2.953 0.00519 ** Infant.Mortality 1.07705 0.38172 2.822 0.00734 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.165 on 41 degrees of freedom Multiple R-squared: 0.7067, Adjusted R-squared: 0.671 F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
When doing model selection, instead of using automatic methods recklessly, one should consider if the model really makes sense based on the prior knowledge or if he/she can explain found model reasonably. Statistics like Cp or Adjusted R-square does not tell us about prediction accuracy for unseen data. Thus, it’s better to keep test set for evaluation of models.
References:
Variable selection and model building by Tiejun (Ty) Tong
Leaps package
Julian J. Faraway, Linear Models with R, Chapman & Hall/CRC