Following my comment, I suggest using the `leaps`

package that provides an algorithm to test in an exhaustive way every combination of variables of a model formula, and returns some indicators (R-squared, BIC, etc.).

You can treat the results to get the list of variables that suits your criteria (here I took 0.85 limit to get a smaller list of models). First, fit the model and specify the limit of 4 variables, and we will keep only the 20 best models (there are in total 19380 possible models of 4 variables...):

```
library(leaps)
fit <- regsubsets(y~., df, nvmax=4, nbest=20)
```

Subset the output table (it's a boolean table with TRUE/FALSE for each variable kept inside the model) depending on the r-squared limit (it's stored in another output of the summary):

```
mytable <- data.frame(tail(summary(fit)$which, 20)[which(tail(summary(fit)$rsq, 20)>0.85), ])
```

Arrange it to get the variable names of the best models, in transposed format, the first one being the best one in R²:

```
output <- t(apply(mytable, 1, function(x) names(mytable)[x]))
#### [,1] [,2] [,3] [,4] [,5]
#### [1,] "X.Intercept." "X3" "X7" "X8" "X9"
#### [2,] "X.Intercept." "X2" "X7" "X8" "X9"
#### [3,] "X.Intercept." "X1" "X7" "X8" "X9"
#### [4,] "X.Intercept." "X7" "X8" "X9" "X15"
#### [5,] "X.Intercept." "X4" "X7" "X8" "X9"
```

If you need to use one of these models for a fit, you could retrieve the formula like this:

```
as.formula(paste("y ~ ", paste(output[1, -1], collapse=" + ")))
#### y ~ X3 + X7 + X8 + X9
```

Or simply using `reformulate`

, thanks to @Ben Bolker 's suggestion:

```
reformulate(output[1, -1], response="y")
```

**EDIT: modeling your data with lasso regression.**

I use the script adapted from Hastie&Tishirani, and you must also load these helper functions here. I suggest first you get your hands on this technique. First I create the data and split a training set:

```
library(glmnet); set.seed(6)
train.ratio <- 0.75
x <- model.matrix(y~., df)[, -1]
y <- df$y
train.ind <- sample(1:nrow(x), floor(train.ratio * nrow(x)))
x.train <- x[ train.ind, ]; y.train <- y[ train.ind ]
x.test <- x[-train.ind, ]; y.test <- y[-train.ind ]
n <- nrow(x); n.train <- nrow(x.train); n.test <- nrow(x.test)
```

Then I do the model calibration and compute test error

```
grid <- 10^seq(4, -2, length=100) # increase range if needed
lasso.mod <- glmnet(x.train, y.train, alpha=1, lambda=grid)
err.lasso <- 1/n.test * colSums((y.test - lasso.mod$a0[1] - x.test %*% lasso.mod$beta)^2)
```

Plot the different results

```
par(mfrow = c(2, 2))
frac.lasso <- plot.path(t(lasso.mod$beta), err = err.lasso)
plot.coef(t(lasso.mod$beta), lasso.mod$lambda, err.lasso)
plot.err(err.lasso, frac.lasso)
plot.err(err.lasso, lasso.mod$lambda)
par(mfrow = c(1, 1))
```

Eventually find out what the best model is (cool, it's the same than previously! but different coefs).

```
best <- lasso.mod$beta[, which.min(err.lasso)]; best[best!=0]
#### X3 X7 X8 X9
#### 0.2572322 1.4933962 1.7868181 2.4447500
```

You can also ask all cases containing 4 variables (here they are the same, differ only the lambda value and coefs)

```
lasso.mod$beta[, which(colSums(lasso.mod$beta!=0)==4)]
```

`regsubset`

from package`leaps`