arxiv: 2605.07089 · v1 · submitted 2026-05-08 · 🧮 math.OC

Recognition: 2 theorem links

· Lean Theorem

Cross-validation-based optimal feature selection for linear SVM classification

Masaharu Mori, Ryuhei Miyashiro, Ryuta Tamura, Shunnosuke Ikeda, Yuichi Takano

Pith reviewed 2026-05-11 01:18 UTC · model grok-4.3

classification 🧮 math.OC

keywords feature selectionsupport vector machinecross-validationmixed-integer optimizationleast squares SVMclassificationbilevel optimizationlinear SVM

0 comments

The pith

Cross-validation can select optimal feature subsets for linear SVM classification by reformulating the problem as a single-level mixed-integer program.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends cross-validation-based best subset selection, previously used in regression, to linear SVM classification problems. It formulates the task as a bilevel mixed-integer optimization problem that minimizes the cross-validation error over possible feature subsets. To make the problem tractable, the authors substitute the least-squares SVM, whose optimality conditions yield a closed-form expression, thereby converting the bilevel structure into a single-level mixed-integer program solvable by standard software. Simulation experiments compare the approach against L1 regularization, recursive feature elimination, and mixed-integer optimization based on AIC or BIC, showing competitive or superior results in both prediction accuracy and correct identification of relevant features.

Core claim

The central claim is that feature subset selection for SVM classification can be performed by directly minimizing the cross-validation criterion. Substituting the least-squares SVM for the standard SVM allows the inner optimization to be expressed in closed form, reducing the original bilevel mixed-integer program to a single-level mixed-integer program that standard solvers can handle directly. Experiments confirm that the resulting selections achieve favorable classification accuracy and feature selection accuracy relative to regularization-based, sequential, and statistical-criterion alternatives.

What carries the argument

The closed-form optimality conditions of the least-squares SVM, which replace the inner SVM training step and convert the bilevel feature-selection problem into a single-level mixed-integer optimization program.

If this is right

Cross-validation becomes a practical criterion for feature selection in SVM classification without relying on asymptotic assumptions required by AIC or BIC.
The reformulated problem can be solved with off-the-shelf mixed-integer optimization software rather than specialized bilevel solvers.
The method yields feature subsets that recover relevant variables more accurately than L1 regularization or recursive feature elimination in the tested simulations.
Classification performance remains competitive while producing sparser, more interpretable models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reformulation strategy could be tested on other convex surrogate losses that admit closed-form solutions, potentially broadening CV-based selection beyond SVMs.
In very high-dimensional regimes the approach might reduce the need for ad-hoc regularization parameters by directly optimizing predictive error.
If the LS-SVM proxy holds, similar bilevel-to-single-level reductions may apply to other margin-based classifiers where feature selection is desired.

Load-bearing premise

The least-squares SVM provides a sufficiently accurate proxy for the standard SVM when the goal is to select features via the cross-validation criterion.

What would settle it

On small instances with known optimal feature subsets, an exact bilevel solver for the original SVM problem would select a different subset than the proposed single-level reformulation, and that difference would produce measurably worse held-out classification accuracy.

read the original abstract

This paper addresses feature subset selection for Support Vector Machines (SVMs) based on the cross-validation criterion. Unlike statistical criteria such as the Akaike information criterion (AIC) and the Bayesian information criterion (BIC), cross-validation requires only the mild assumption that samples are independently and identically distributed (i.i.d.). For this reason, the cross-validation criterion is expected to work well across a wide range of prediction problems, and it has already demonstrated its usefulness as a feature subset selection method for regression. The objective of this paper is to extend the framework of best feature subset selection via the cross-validation criterion to SVM classification problems. This subset-selection problem can be formulated as a bilevel mixed-integer optimization problem. Because bilevel optimization problems are generally hard to solve, we introduce the Least Squares Support Vector Machine (LS-SVM), whose optimality conditions admit a closed-form expression, and reduce the problem to a single-level mixed-integer optimization problem. This reformulation allows us to solve the problem using standard optimization software. We evaluate the proposed framework through simulation experiments that compare it with a regularization-based method (L1-regularization), a sequential search method (recursive feature elimination), and mixed-integer optimization (MIO) based on statistical criteria. The results show that the proposed framework achieves favorable performance both in classification accuracy and feature selection accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They reformulate CV-based feature selection for linear SVM as a tractable single-level MIO by swapping in LS-SVM, which works in simulations but leaves the optimality claim for the actual hinge-loss SVM open.

read the letter

The paper's core move is to take the bilevel problem of picking a feature subset that minimizes cross-validation error for linear SVM classification and turn it into a single-level mixed-integer program. They do this by replacing the inner SVM with LS-SVM, whose KKT conditions give a closed-form expression for the weights, so the whole thing becomes solvable with off-the-shelf MIP solvers. That is the actual new piece: extending their earlier regression version to classification while keeping the CV criterion instead of AIC/BIC-style penalties.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a cross-validation-based framework for optimal feature subset selection in linear SVM classification. It formulates the task as a bilevel mixed-integer optimization problem and replaces the inner SVM with its least-squares variant (LS-SVM) to obtain closed-form optimality conditions, thereby reducing the problem to a single-level MIO solvable by standard optimization software. Simulation experiments compare the approach to L1-regularized SVM, recursive feature elimination, and MIO using statistical criteria such as AIC/BIC, with the claim that it achieves favorable performance in both classification accuracy and feature selection accuracy.

Significance. If the LS-SVM surrogate is shown to produce feature subsets that are near-optimal under the true hinge-loss SVM cross-validation criterion, the work would provide a computationally tractable exact method for CV-based feature selection in classification, extending prior regression results and offering an alternative to regularization or heuristic search methods under the i.i.d. assumption.

major comments (2)

[Reformulation section and bilevel formulation] The core reformulation (bilevel to single-level MIO via LS-SVM substitution) optimizes the cross-validation criterion under squared loss rather than the hinge loss of the target linear SVM. Because the CV error landscape differs, the selected subsets are optimal only for the LS-SVM surrogate; the manuscript evaluates final performance by retraining a standard SVM on those subsets but provides no direct evidence (e.g., comparison of true SVM CV error on LS-SVM-selected vs. hinge-loss-selected subsets) that the subsets would be chosen by intractable direct CV on the claimed SVM.
[Simulation experiments] The simulation experiments section reports only that the proposed framework achieves 'favorable performance' without providing quantitative metrics (accuracy values, feature selection precision/recall), dataset characteristics (dimensions, sample sizes, noise levels), number of replications, or statistical tests comparing against the baselines. This absence prevents verification of the central empirical claim.

minor comments (1)

[Abstract and Introduction] The abstract and introduction would benefit from a brief statement of the computational complexity of the resulting MIO and the range of problem sizes for which it remains tractable.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We respond to each major comment below, indicating where revisions will be made.

read point-by-point responses

Referee: [Reformulation section and bilevel formulation] The core reformulation (bilevel to single-level MIO via LS-SVM substitution) optimizes the cross-validation criterion under squared loss rather than the hinge loss of the target linear SVM. Because the CV error landscape differs, the selected subsets are optimal only for the LS-SVM surrogate; the manuscript evaluates final performance by retraining a standard SVM on those subsets but provides no direct evidence (e.g., comparison of true SVM CV error on LS-SVM-selected vs. hinge-loss-selected subsets) that the subsets would be chosen by intractable direct CV on the claimed SVM.

Authors: We acknowledge that the reformulation selects subsets optimal under the LS-SVM squared-loss CV criterion rather than the hinge-loss CV of standard SVM. The LS-SVM substitution is introduced specifically to obtain closed-form optimality conditions that reduce the bilevel problem to a tractable single-level MIO. The manuscript then evaluates performance by retraining a standard linear SVM on the selected subsets, and the reported simulations indicate competitive accuracy relative to direct SVM baselines. We do not claim equivalence to the (intractable) hinge-loss CV optimum. In the revision we will add an explicit discussion of the surrogate approximation, its motivation, and its limitations. revision: yes
Referee: [Simulation experiments] The simulation experiments section reports only that the proposed framework achieves 'favorable performance' without providing quantitative metrics (accuracy values, feature selection precision/recall), dataset characteristics (dimensions, sample sizes, noise levels), number of replications, or statistical tests comparing against the baselines. This absence prevents verification of the central empirical claim.

Authors: We agree that the current presentation of the simulation results is insufficiently detailed. The revised manuscript will report quantitative metrics including mean classification accuracy, feature-selection precision and recall (with standard deviations), full dataset specifications (dimensions, sample sizes, noise levels), the number of replications, and statistical comparisons (e.g., paired t-tests or Wilcoxon tests) against L1-SVM, RFE, and AIC/BIC-based MIO. revision: yes

standing simulated objections not resolved

Direct empirical verification that LS-SVM-selected subsets coincide with those from intractable hinge-loss SVM cross-validation, which cannot be computed for the problem sizes considered.

Circularity Check

0 steps flagged

No circularity: bilevel-to-single-level reformulation via explicit LS-SVM surrogate is a standard modeling choice, not a definitional reduction

full rationale

The paper formulates CV-based feature selection for linear SVM as a bilevel MIO, then substitutes LS-SVM (with its closed-form optimality conditions) to obtain a tractable single-level MIO. This substitution is presented as an approximation to enable computation; the final reported performance is measured by training a standard hinge-loss SVM on the selected subsets and comparing against baselines. No equation reduces to its own input by construction, no parameter is fitted and then relabeled as a prediction, and no load-bearing premise rests on self-citation. The derivation chain therefore remains self-contained against external benchmarks and the chosen surrogate.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central approach depends on the i.i.d. assumption for cross-validation and the closed-form optimality conditions of the LS-SVM model.

axioms (1)

domain assumption Samples are independently and identically distributed (i.i.d.)
This assumption underpins the validity of the cross-validation criterion as stated in the abstract.

pith-pipeline@v0.9.0 · 5547 in / 1342 out tokens · 67596 ms · 2026-05-11T01:18:16.003129+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
we introduce the Least Squares Support Vector Machine (LS-SVM), whose optimality conditions admit a closed-form expression, and reduce the problem to a single-level mixed-integer optimization problem
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear
min ½∥w∥² + (γ/2) Σ e_i² s.t. y_i(wᵀx_i + b) = 1 - e_i

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

[2]

Statistical Learning Theory

Vapnik, Vladimir N. Statistical Learning Theory. 1998

work page 1998
[3]

Suykens, J. A. K. and Vandewalle, J. Least Squares Support Vector Machine Classifiers. Neural Processing Letters. 1999

work page 1999
[4]

Gene Selection for Cancer Classification using Support Vector Machines

Guyon, Isabelle and Weston, Jason and Barnhill, Stephen and Vapnik, Vladimir. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning. 2002

work page 2002
[5]

Use of the Zero-Norm with Linear Models and Kernel Methods

Weston, Jason and Elisseeff, Andr \'e and Sch \"o lkopf, Bernhard and Tipping, Michael. Use of the Zero-Norm with Linear Models and Kernel Methods. Journal of Machine Learning Research. 2003

work page 2003
[6]

Feature Selection for Support Vector Machines via Mixed Integer Linear Programming

Maldonado, Sebasti \'a n and P \'e rez, Jorge and Weber, Richard and Labb \'e , Mart \'i n. Feature Selection for Support Vector Machines via Mixed Integer Linear Programming. Information Sciences. 2014

work page 2014
[8]

Scikit-learn: Machine Learning in Python

Pedregosa, Fabian and Varoquaux, Ga \"e l and Gramfort, Alexandre and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter and Weiss, Ron and Dubourg, Vincent and Vanderplas, Jake and Passos, Alexandre and Cournapeau, David and Brucher, Matthieu and Perrot, Matthieu and Duchesnay, \'E douard. Scikit-learn:...

work page 2011
[10]

An Introduction to Variable and Feature Selection

Guyon, Isabelle and Elisseeff, Andr \'e. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research. 2003

work page 2003
[12]

Estimating the Dimension of a Model

Schwarz, Gideon. Estimating the Dimension of a Model. The Annals of Statistics. 1978

work page 1978
[13]

The Relationship between Variable Selection and Data Augmentation and a Method for Prediction

Allen, David M. The Relationship between Variable Selection and Data Augmentation and a Method for Prediction. Technometrics. 1974

work page 1974
[14]

Cross-Validatory Choice and Assessment of Statistical Predictions

Stone, Mervyn. Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society: Series B. 1974

work page 1974
[19]

Powers, David M. W. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation. Journal of Machine Learning Technologies. 2011

work page 2011
[20]

Subset Selection in Regression

Miller, Alan. Subset Selection in Regression. 2002

work page 2002
[21]

Regression Shrinkage and Selection via the Lasso

Tibshirani, Robert. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological). 1996

work page 1996
[22]

Best Subset Selection via a Modern Optimization Lens

Bertsimas, Dimitris and King, Angela and Mazumder, Rahul. Best Subset Selection via a Modern Optimization Lens. The Annals of Statistics. 2016

work page 2016
[24]

and Hu, Jing and Ji, Xiaoyun and Kunapuli, Gautam and Pang, Jong-Shi

Bennett, Kristin P. and Hu, Jing and Ji, Xiaoyun and Kunapuli, Gautam and Pang, Jong-Shi. Model Selection via Bilevel Optimization. The 2006 IEEE International Joint Conference on Neural Network Proceedings. 2006

work page 2006
[25]

and Hu, Jing and Pang, Jong-Shi

Kunapuli, Gautam and Bennett, Kristin P. and Hu, Jing and Pang, Jong-Shi. Classification Model Selection via Bilevel Programming. Optimization Methods and Software. 2008

work page 2008
[26]

An Overview of Bilevel Optimization

Colson, Beno \^i t and Marcotte, Patrice and Savard, Gilles. An Overview of Bilevel Optimization. Annals of Operations Research. 2007

work page 2007
[27]

A Review on Bilevel Optimization: From Classical to Evolutionary Approaches and Applications

Sinha, Ankur and Malo, Pekka and Deb, Kalyanmoy. A Review on Bilevel Optimization: From Classical to Evolutionary Approaches and Applications. IEEE Transactions on Evolutionary Computation. 2018

work page 2018
[28]

A new look at the statistical model identification

Akaike H (1974) A new look at the statistical model identification. IEEE Transactions on Automatic Control 19(6):716--723. doi:10.1109/TAC.1974.1100705

work page doi:10.1109/tac.1974.1100705 1974
[29]

Technometrics 16(1):125--127

Allen DM (1974) The relationship between variable selection and data augmentation and a method for prediction. Technometrics 16(1):125--127

work page 1974
[30]

Statistics Surveys 4:40--79

Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Statistics Surveys 4:40--79. doi:10.1214/09-SS054

work page doi:10.1214/09-ss054 2010
[31]

In: The 2006 IEEE International Joint Conference on Neural Network Proceedings, pp 1922--1929

Bennett KP, Hu J, Ji X, et al (2006) Model selection via bilevel optimization. In: The 2006 IEEE International Joint Conference on Neural Network Proceedings, pp 1922--1929

work page 2006
[32]

The Annals of Statistics 44(2):813--852

Bertsimas D, King A, Mazumder R (2016) Best subset selection via a modern optimization lens. The Annals of Statistics 44(2):813--852

work page 2016
[33]

A comprehensive survey on support vector machine classification: Applications, challenges and trends

Cervantes J, Garcia-Lamont F, Rodr \'i guez-Mazahua L, et al (2020) A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing 408:189--215. doi:10.1016/j.neucom.2019.10.118

work page doi:10.1016/j.neucom.2019.10.118 2020
[34]

Annals of Operations Research 153(1):235--256

Colson B, Marcotte P, Savard G (2007) An overview of bilevel optimization. Annals of Operations Research 153(1):235--256

work page 2007
[35]

Machine Learning , year =

Cortes C, Vapnik V (1995) Support-vector networks. Machine Learning 20(3):273--297. doi:10.1007/BF00994018

work page doi:10.1007/bf00994018 1995
[36]

An introduction to ROC analysis,

Fawcett T (2006) An introduction to roc analysis. Pattern Recognition Letters 27(8):861--874. doi:10.1016/j.patrec.2005.10.010

work page doi:10.1016/j.patrec.2005.10.010 2006
[37]

Journal of the American Statistical Association 70(350):320--328

Geisser S (1975) The predictive sample reuse method with applications. Journal of the American Statistical Association 70(350):320--328. doi:10.1080/01621459.1975.10479865

work page doi:10.1080/01621459.1975.10479865 1975
[38]

Journal of Machine Learning Research 3:1157--1182

Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. Journal of Machine Learning Research 3:1157--1182

work page 2003
[39]

Machine Learning 46(1--3):389--422

Guyon I, Weston J, Barnhill S, et al (2002) Gene selection for cancer classification using support vector machines. Machine Learning 46(1--3):389--422

work page 2002
[40]

Statistical Science 35(4):579--592

Hastie T, Tibshirani R, Tibshirani RJ (2020) Best subset, forward stepwise or lasso? analysis and recommendations based on extensive comparisons. Statistical Science 35(4):579--592. doi:10.1214/19-STS733

work page doi:10.1214/19-sts733 2020
[41]

Optimization Methods and Software 23(4):475--489

Kunapuli G, Bennett KP, Hu J, et al (2008) Classification model selection via bilevel programming. Optimization Methods and Software 23(4):475--489

work page 2008
[42]

Information Sciences 279:163--175

Maldonado S, P \'e rez J, Weber R, et al (2014) Feature selection for support vector machines via mixed integer linear programming. Information Sciences 279:163--175

work page 2014
[43]

Chapman and Hall/CRC, Boca Raton

Miller A (2002) Subset Selection in Regression, 2nd edn. Chapman and Hall/CRC, Boca Raton

work page 2002
[44]

Journal of Machine Learning Research 12:2825--2830

Pedregosa F, Varoquaux G, Gramfort A, et al (2011) Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12:2825--2830

work page 2011
[45]

Journal of Machine Learning Technologies 2(1):37--63

Powers DMW (2011) Evaluation: From precision, recall and f-measure to roc, informedness, markedness and correlation. Journal of Machine Learning Technologies 2(1):37--63

work page 2011
[46]

Computational Optimization and Applications 64(3):865--880

Sato T, Takano Y, Miyashiro R, et al (2016) Feature subset selection for logistic regression via mixed integer optimization. Computational Optimization and Applications 64(3):865--880. doi:10.1007/s10589-016-9832-2

work page doi:10.1007/s10589-016-9832-2 2016
[47]

The Annals of Statistics 6(2):461--464

Schwarz G (1978) Estimating the dimension of a model. The Annals of Statistics 6(2):461--464

work page 1978
[48]

IEEE Transactions on Evolutionary Computation 22(2):276--295

Sinha A, Malo P, Deb K (2018) A review on bilevel optimization: From classical to evolutionary approaches and applications. IEEE Transactions on Evolutionary Computation 22(2):276--295

work page 2018
[49]

Journal of the Royal Statistical Society: Series B 36(2):111--147

Stone M (1974) Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B 36(2):111--147

work page 1974
[50]

Neural Processing Letters 9(3):293--300

Suykens JAK, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Processing Letters 9(3):293--300

work page 1999
[51]

TOP 28(2):475--488

Takano Y, Miyashiro R (2020) Best subset selection via cross-validation criterion. TOP 28(2):475--488. doi:10.1007/s11750-020-00538-1

work page doi:10.1007/s11750-020-00538-1 2020
[52]

Journal of the Royal Statistical Society: Series B (Methodological) 58(1):267--288

Tibshirani R (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1):267--288

work page 1996
[53]

Wiley, New York

Vapnik VN (1998) Statistical Learning Theory. Wiley, New York

work page 1998
[54]

Journal of Machine Learning Research 3:1439--1461

Weston J, Elisseeff A, Sch \"o lkopf B, et al (2003) Use of the zero-norm with linear models and kernel methods. Journal of Machine Learning Research 3:1439--1461

work page 2003