Best Subset Selection via Cross-validation Criterion

This paper is concerned with the cross-validation criterion for best subset selection in a linear regression model. In contrast with the use of statistical criteria (e.g., Mallows' $C_p$, AIC, BIC, and various information criteria), the cross-validation only requires the mild assumptions, namely, samples are identically distributed, and training and validation samples are independent. For this reason, the cross-validation criterion is expected to work well in most situations for any predictive methods. The purpose of this paper is to establish a mixed-integer optimization (MIO) approach to selecting the best subset of explanatory variables via the cross-validation criterion. This subset selection problem can be formulated as a bilevel MIO problem. We then reduce it to a mixed-integer quadratic optimization problem, which can be solved exactly using optimization software. The efficacy of our method is evaluated through simulation experiments by comparison with statistical-criterion-based exhaustive search algorithms and the $L_1$-regularized regression. Simulation results demonstrate that our method delivered good performance in both the subset selection accuracy and the predictive performance when the signal-to-noise ratio was low.



View Best Subset Selection via Cross-validation Criterion