We consider a linear regression model for which we assume that many of the observed regressors are irrelevant for the prediction. To avoid overfitting, we conduct a variable selection and only include the true predictors for the least square fitting. The best subset selection gained much interest in recent years for addressing this objective. For this method, a mixed-integer optimization problem is solved, which finds the optimal subset not larger than a given natural number k concerning the in-sample error. In practice, a best subset selection is computed for each k, and the ideal k is then chosen via a validation. We argue that the notion of the best subset selection might be misaligned with the statistical intention. Instead, we propose a subset selection formulation based on the cross-validation loss function. We present a discrete optimization formulation which fits coefficients to training data and decides to in- or exclude variables to minimize the cross-validation error. Hence, we do not require a fixed sparsity bound and do not have to solve successive discrete optimization problems. Moreover, we present bounds for the regression coefficients, which allows us to construct a tighter mixed-integer formulation. Finally, we conduct a simulation study and provide evidence that the novel mixed-integer formulation provides excellent predictions surpassing the results of competing state-of-the-art approaches.
View A mixed-integer optimization approach to an exhaustive cross-validated model selection for regression