A Mathematical Optimization Approach for Expert-Informed Bayesian Best Subset Selection
Pith reviewed 2026-06-30 07:35 UTC · model grok-4.3
The pith
Expert probability estimates of feature relevance are incorporated into the MIO best-subsets problem as a log-odds penalty in a MAP framework.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The EBBS model formulates best subset selection as a maximum a posteriori problem in which aggregated expert probabilities appear as additive log-odds penalty terms in the MIO objective, thereby allowing expert knowledge to influence the globally optimal sparse solution without altering the underlying optimization structure.
What carries the argument
The MAP objective augmented with a log-odds penalty derived from expert prior probabilities, which is optimized via mixed-integer optimization.
Load-bearing premise
Expert assessments, once aggregated via Poisson binomial, win rates, or normalized ranks, constitute a valid prior that improves the MAP objective without introducing systematic bias that the data cannot correct.
What would settle it
Generate synthetic data where the true relevant features directly contradict the supplied expert priors; if EBBS returns subsets with higher validation error than standard best subsets, the benefit of the expert term is refuted.
read the original abstract
A central challenge in statistical modeling is identifying the subset of features that belong in the true regression model. The classical best subset selection problem, recently made tractable via mixed-integer optimization (MIO), finds the globally optimal sparse solution. It does not, however, make use of any information beyond the observed data. In many applied settings, domain experts can meaningfully rank or score the relevance of candidate predictors, yet no existing framework integrates such probabilistic expert assessments directly into the best-subsets objective. This paper presents Expert-Implied Bayesian Best Subsets (EBBS), a method that incorporates domain-expert probability estimates of feature relevance into the MIO best-subsets problem through a maximum a posteriori (MAP) framework. Expert views from multiple respondents are aggregated into a single prior probability per feature using the Poisson binomial distribution for marginal probability estimates, the pairwise win rate for pairwise comparisons, or the normalized mean rank for ordinal rankings. This probability enters the objective function as a log-odds penalty term that smoothly encourages or discourages the selection of each feature consistent with the expert consensus. This paper provides analytic derivations of the MAP formulation and characterizes its theoretical properties. The proposed model reduces to Best Subsets when experts all have no views. Empirical results on synthetic and real datasets are forthcoming.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Expert-Implied Bayesian Best Subsets (EBBS), a MAP formulation that augments the MIO encoding of best-subset selection with a linear log-odds penalty term derived from per-feature inclusion probabilities p_i. These probabilities are obtained by aggregating expert assessments via the Poisson binomial distribution (for marginals), pairwise win rates, or normalized mean ranks. The resulting objective is claimed to be equivalent to -log p(data|subset) - log p(subset|expert p), reducing exactly to ordinary best subsets when all p_i = 1/2. Analytic derivations of the MAP equivalence and theoretical properties are stated to be supplied, with empirical results on synthetic and real data noted as forthcoming.
Significance. If the claimed MAP equivalence and theoretical properties can be verified, the construction supplies a direct, MIO-compatible mechanism for injecting external expert probabilities into sparse regression. This could be useful in applied domains where domain knowledge is available but currently ignored by best-subsets solvers. The reduction to the classical problem when experts are uninformative is a clean special case. However, because no derivations, closed-form checks, or empirical results appear in the manuscript, the practical significance and improvement over standard best subsets remain unevaluated.
major comments (2)
- [Abstract] Abstract: the statement that 'This paper provides analytic derivations of the MAP formulation and characterizes its theoretical properties' is unsupported; the manuscript contains no derivations, proofs, or closed-form verifications of the claimed equivalence between the penalized MIO objective and the MAP problem.
- [Abstract] Abstract: the central modeling choice (expert probabilities aggregated via Poisson binomial, win rates, or normalized ranks entering as a log-odds penalty) is presented without any analysis of bias, consistency, or conditions under which the prior improves rather than degrades subset recovery; this is load-bearing because the paper explicitly defers all empirical validation of solution quality.
minor comments (2)
- The reduction to best subsets when p_i = 1/2 is asserted but not written out explicitly as an equation; adding the simplified objective would improve clarity.
- Notation for the aggregated prior (e.g., how the three aggregation methods map to a single p_i vector) is described only at a high level; a short algorithmic box or explicit formula would help readers implement the penalty term.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive feedback. We address the two major comments point by point below, indicating the revisions we will undertake.
read point-by-point responses
-
Referee: [Abstract] Abstract: the statement that 'This paper provides analytic derivations of the MAP formulation and characterizes its theoretical properties' is unsupported; the manuscript contains no derivations, proofs, or closed-form verifications of the claimed equivalence between the penalized MIO objective and the MAP problem.
Authors: We agree that the submitted manuscript does not contain the promised derivations or closed-form verifications of the MAP equivalence. This was an omission in the current draft. In the revised version we will insert a dedicated section deriving the log-odds penalty from the expert-aggregated prior (via Poisson binomial marginals, win rates, and normalized ranks), proving the exact reduction to ordinary best-subsets when all p_i = 1/2, and characterizing the resulting MAP objective. The abstract will be rewritten to describe only the material that is actually present after revision. revision: yes
-
Referee: [Abstract] Abstract: the central modeling choice (expert probabilities aggregated via Poisson binomial, win rates, or normalized ranks entering as a log-odds penalty) is presented without any analysis of bias, consistency, or conditions under which the prior improves rather than degrades subset recovery; this is load-bearing because the paper explicitly defers all empirical validation of solution quality.
Authors: The current manuscript concentrates on the formulation and its MIO encoding. We concur that an examination of the statistical properties of the expert prior is necessary. The revision will add a theoretical section analyzing bias and consistency of the MAP estimator under the aggregated prior, together with sufficient conditions under which the expert term improves subset recovery relative to the uninformative case. While the original submission deferred full empirical results, we will also include a short set of synthetic experiments that illustrate the effect of the prior on recovery rates, thereby addressing the load-bearing concern raised. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents EBBS as a direct application of the standard MAP objective -log p(data|subset) - log p(subset|expert p_i) where p_i are external expert-derived inclusion probabilities aggregated via Poisson binomial, win rates, or ranks. The log-odds penalty is defined from these independent inputs and vanishes when all p_i = 1/2, recovering ordinary best subsets. No self-citations, fitted parameters, or self-definitional reductions appear; the analytic derivations referenced are of this standard equivalence and its MIO encoding, which remain self-contained against external expert data and the classical best-subsets problem.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert assessments can be aggregated into a single marginal probability per feature that functions as a legitimate prior for MAP estimation.
Reference graph
Works this paper leans on
-
[1]
Best Sub- set Selection via a Modern Optimization Lens
Bertsimas, Dimitris, Angela King, and Rahul Mazumder (Apr. 2016). “Best Sub- set Selection via a Modern Optimization Lens”. In:The Annals of Statistics 44.2, pp. 813–852.issn: 0090-5364, 2168-8966.doi:10.1214/15-AOS1388.url: https://projecteuclid.org/journals/annals- of- statistics/volume- 44 / issue - 2 / Best - subset - selection - via - a - modern - op...
-
[2]
Multiple Regression Analysis
Efroymson, M. A. (1960). “Multiple Regression Analysis”. In:Mathematical Methods for Digital Computers, pp. 191–203.url:https : / / cir . nii . ac . jp / crid / 1570009749670334592(visited on 01/06/2025). 18
1960
-
[3]
Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties
Fan, Jianqing and Runze Li (Dec. 1, 2001). “Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties”. In:Journal of the American Statistical Association96.456, pp. 1348–1360.issn: 0162-1459.doi:10.1198/ 016214501753382273.url:https://doi.org/10.1198/016214501753382273 (visited on 03/22/2026)
-
[4]
Carlin, Hal S
Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin (Nov. 1, 2013).Bayesian Data Analysis, Third Edition. CRC Press. 677 pp.isbn: 978-1-4398-4095-5. Google Books:ZXL6AQAAQBAJ
2013
-
[5]
Approaches for Bayesian Variable Selection
George, Edward I. and Robert E. McCulloch (1997). “Approaches for Bayesian Variable Selection”. In:Statistica Sinica7.2, pp. 339–373.issn: 1017-0405. JS- TOR:24306083.url:https://www.jstor.org/stable/24306083(visited on 04/12/2026)
-
[6]
Fast Best Subset Selection: Coordinate Descent and Local Combinatorial Optimization Algorithms
Hazimeh, Hussein and Rahul Mazumder (Sept. 2020). “Fast Best Subset Selection: Coordinate Descent and Local Combinatorial Optimization Algorithms”. In:Op- erations Research68.5, pp. 1517–1537.issn: 0030-364X.doi:10.1287/opre. 2019 . 1919.url:https : / / pubsonline . informs . org / doi / abs / 10 . 1287 / opre.2019.1919(visited on 01/06/2025)
-
[7]
Hoerl, Arthur E. and Robert W. Kennard (Feb. 1, 1970). “Ridge Regression: Appli- cations to Nonorthogonal Problems”. In:Technometrics12.1, pp. 69–82.issn: 0040-1706.doi:10 . 1080 / 00401706 . 1970 . 10488635.url:https : / / www . tandfonline.com/doi/abs/10.1080/00401706.1970.10488635(visited on 01/06/2025)
-
[8]
Bayesian Subset Selection and Variable Importance for Interpretable Prediction and Classification
Kowal, Daniel R. (2022). “Bayesian Subset Selection and Variable Importance for Interpretable Prediction and Classification”. In:Journal of Machine Learning Research23.108, pp. 1–38.issn: 1533-7928.url:http://jmlr.org/papers/ v23/21-0403.html(visited on 04/12/2026)
2022
-
[9]
Variable Selection for Regression Models
Kuo, Lynn and Bani Mallick (1998). “Variable Selection for Regression Models”. In: Sankhy¯ a: The Indian Journal of Statistics, Series B (1960-2002)60.1, pp. 65–81. issn: 0581-5738. JSTOR:25053023.url:https://www.jstor.org/stable/ 25053023(visited on 04/12/2026)
1998
-
[10]
Meinshausen, Nicolai (Sept. 15, 2007). “Relaxed Lasso”. In:Computational Statistics & Data Analysis52.1, pp. 374–393.issn: 0167-9473.doi:10.1016/j.csda. 2006 . 12 . 019.url:https : / / www . sciencedirect . com / science / article / pii/S0167947306004956(visited on 03/22/2026)
-
[11]
Peng, Hanchuan, Fuhui Long, and C. Ding (Aug. 2005). “Feature Selection Based on Mutual Information Criteria of Max-Dependency, Max-Relevance, and Min- Redundancy”. In:IEEE Transactions on Pattern Analysis and Machine Intel- 19 ligence27.8, pp. 1226–1238.issn: 1939-3539.doi:10.1109/TPAMI.2005.159. url:https://ieeexplore.ieee.org/abstract/document/1453511(...
-
[12]
Quadratic Programming Feature Selection
Rodriguez-Lujan, Irene, Ramon Huerta, Charles Elkan, and Carlos Santa Cruz (2010). “Quadratic Programming Feature Selection”. In:Journal of Machine Learning Research11.49, pp. 1491–1516.issn: 1533-7928.url:http://jmlr. org/papers/v11/rodriguez-lujan10a.html(visited on 06/18/2026)
2010
-
[13]
Tibshirani ,\ 10.1111/j.2517-6161.1996.tb02080.x journal journal J
Tibshirani, Robert (Jan. 1, 1996). “Regression Shrinkage and Selection Via the Lasso”. In:Journal of the Royal Statistical Society: Series B (Methodological) 58.1, pp. 267–288.issn: 0035-9246.doi:10.1111/j.2517-6161.1996.tb02080. x.url:https://doi.org/10.1111/j.2517-6161.1996.tb02080.x(visited on 01/06/2025)
-
[14]
On the Number of Successes in Independent Trials
Wang, Y. H. (1993). “On the Number of Successes in Independent Trials”. In:Sta- tistica Sinica3.2, pp. 295–312.issn: 1017-0405. JSTOR:24304959.url:https: //www.jstor.org/stable/24304959(visited on 01/07/2025)
-
[15]
Domain Knowledge-Enhanced Variable Selection for Biomedical Data Analysis
Wu, Xingyu, Zhenchao Tao, Bingbing Jiang, Tianhao Wu, Xin Wang, and Huan- huan Chen (Aug. 1, 2022). “Domain Knowledge-Enhanced Variable Selection for Biomedical Data Analysis”. In:Information Sciences606, pp. 469–488.issn: 0020-0255.doi:10.1016/j.ins.2022.05.076.url:https://www.sciencedirect. com/science/article/pii/S0020025522005072(visited on 03/23/2026)
work page doi:10.1016/j.ins.2022.05.076.url:https://www.sciencedirect 2022
-
[16]
Incorporating Prior Knowledge into Regularized Regression
Zeng, Chubing, Duncan Campbell Thomas, and Juan Pablo Lewinger (May 1, 2021). “Incorporating Prior Knowledge into Regularized Regression”. In:Bioinformat- ics37.4, pp. 514–521.issn: 1367-4803.doi:10.1093/bioinformatics/btaa776. url:https : / / doi . org / 10 . 1093 / bioinformatics / btaa776(visited on 03/23/2026)
-
[17]
Alizadeh, Kangwook Lee, Jose Blanchet, Mert Pilanci, and Robert Tibshirani (Aug. 12, 2025).LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization.doi:10.48550/arXiv.2502.10648. arXiv:2502. 10648 [cs].url:http://arxiv.org/abs/2502.10648(visited on 03/24/2026). Pre-published
-
[18]
The Adaptive Lasso and Its Oracle Properties
Zou, Hui (Dec. 1, 2006). “The Adaptive Lasso and Its Oracle Properties”. In:Jour- nal of the American Statistical Association101.476, pp. 1418–1429.issn: 0162- 1459.doi:10.1198/016214506000000735.url:https://doi.org/10.1198/ 016214506000000735(visited on 03/22/2026)
work page doi:10.1198/016214506000000735.url:https://doi.org/10.1198/ 2006
-
[19]
Regularization and Variable Selection Via the Elastic Net
Zou, Hui and Trevor Hastie (Apr. 1, 2005). “Regularization and Variable Selection Via the Elastic Net”. In:Journal of the Royal Statistical Society Series B: Sta- 20 tistical Methodology67.2, pp. 301–320.issn: 1369-7412.doi:10.1111/j.1467- 9868.2005.00503.x.url:https://doi.org/10.1111/j.1467-9868.2005. 00503.x(visited on 01/06/2025). A Derivation of the B...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.