AutoNFS: Automatic Neural Feature Selection
Pith reviewed 2026-05-22 23:35 UTC · model grok-4.3
The pith
AutoNFS automatically identifies the smallest feature set needed for a tabular task by training a Gumbel-Sigmoid mask jointly with the predictor.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AutoNFS combines the FS module based on Gumbel-Sigmoid sampling with a predictive model evaluating the relevance of the selected attributes. The model is trained end-to-end using a differentiable loss and automatically determines the minimal set of features essential to solve a given downstream task. Unlike many wrapper-style approaches, AutoNFS introduces a low and predictable training overhead and avoids repeated model retraining across feature budgets.
What carries the argument
Gumbel-Sigmoid sampling module that produces differentiable binary feature masks trained jointly with the predictor under a loss that trades task performance against the number of active features.
If this is right
- The method scales to high-dimensional tabular data because the additional cost of the masking module is largely independent of the number of input features.
- AutoNFS eliminates the need for user-specified feature budgets or repeated retraining across different budgets.
- It produces models that remain competitive with classical and neural feature-selection baselines on both classification and regression tasks.
- The approach directly applies to real-world metagenomic datasets while returning sparser solutions on average.
Where Pith is reading between the lines
- The automatic sparsity induced by joint training may act as an implicit regularizer that improves generalization on high-dimensional inputs.
- Similar differentiable masking layers could be inserted into non-tabular architectures to achieve automatic feature selection without manual tuning.
- In deployed systems the reduced feature count would lower both inference latency and the cost of collecting new data.
Load-bearing premise
The Gumbel-Sigmoid masking module, when trained jointly with the predictor, will reliably identify the minimal set of essential features without requiring user-specified budgets or multiple retrainings.
What would settle it
On a dataset where exhaustive search over all subsets finds a strictly smaller feature set that matches the task performance achieved by AutoNFS, the claim that the method finds the minimal set would be refuted.
Figures
read the original abstract
Feature selection (FS) is a fundamental challenge in machine learning, particularly for high-dimensional tabular data, where interpretability and computational efficiency are critical. Existing FS methods often cannot automatically detect the number of attributes required to solve a given task and involve user intervention or model retraining with different feature budgets. Additionally, they either neglect feature relationships (filter methods) or require time-consuming optimization (wrapper methods). To address these limitations, we propose AutoNFS, which combines the FS module based on Gumbel-Sigmoid sampling with a predictive model evaluating the relevance of the selected attributes. The model is trained end-to-end using a differentiable loss and automatically determines the minimal set of features essential to solve a given downstream task. Unlike many wrapper-style approaches, AutoNFS introduces a low and predictable training overhead and avoids repeated model retraining across feature budgets. In practice, the additional cost of the masking module is largely independent of the number of input features (beyond the unavoidable cost of processing the input itself), making the method scalable to high-dimensional tabular data. We evaluate AutoNFS on well-established classification and regression benchmarks as well as real-world metagenomic datasets. The results show that AutoNFS is competitive with, and often improves upon, strong classical and neural FS baselines while selecting fewer features on average across the evaluated benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AutoNFS, which augments a predictor with a Gumbel-Sigmoid masking module trained jointly end-to-end via a differentiable task loss. It claims this automatically identifies the minimal essential feature set for tabular classification and regression without user-specified budgets or repeated retraining, while incurring low overhead independent of input dimensionality and yielding competitive or superior performance with fewer features on standard benchmarks and metagenomic data.
Significance. If the end-to-end procedure reliably enforces minimality, the method would remove a practical barrier in existing wrapper and neural FS approaches by eliminating the need for budget search or multiple trainings, while remaining scalable to high-dimensional inputs.
major comments (1)
- [Abstract] Abstract: the central claim that the model 'automatically determines the minimal set of features essential to solve a given downstream task' and 'select[s] fewer features on average' rests on the Gumbel-Sigmoid module alone; the described objective is solely the task loss, which is satisfied by any sufficient feature set and supplies no explicit pressure (e.g., L0-style or expected-cardinality penalty) toward minimality.
minor comments (1)
- The abstract supplies no equations, training details, error bars, dataset sizes, or statistical tests, preventing verification of the performance and feature-count claims.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the abstract. We respond point-by-point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the model 'automatically determines the minimal set of features essential to solve a given downstream task' and 'select[s] fewer features on average' rests on the Gumbel-Sigmoid module alone; the described objective is solely the task loss, which is satisfied by any sufficient feature set and supplies no explicit pressure (e.g., L0-style or expected-cardinality penalty) toward minimality.
Authors: We agree that the objective is the task loss with no explicit L0 or cardinality penalty, so there is no theoretical guarantee of minimality from the loss alone. The Gumbel-Sigmoid module permits the predictor to learn which features can be masked while still achieving low task loss; our experiments demonstrate that the resulting selections are compact (fewer features on average than strong baselines) and that further removal degrades performance. We will revise the abstract to qualify the language (e.g., change 'automatically determines the minimal set' to 'learns a compact feature set sufficient for the task') and add a short discussion paragraph noting the absence of an explicit sparsity term and the empirical nature of the observed minimality. This revision does not change the method or results. revision: yes
Circularity Check
No circularity: new end-to-end module with empirical evaluation
full rationale
The paper introduces AutoNFS as a novel trainable Gumbel-Sigmoid masking module jointly optimized with a predictor via task loss. Claims of automatic minimal feature selection and competitive performance rest on experimental benchmarks rather than any derivation that reduces outputs to fitted inputs or prior self-citations by construction. No equations or steps equate a 'prediction' to its own training data, and the central method is presented as an independent architectural contribution without load-bearing self-referential uniqueness theorems or ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gumbel-Sigmoid sampling produces usable gradients for discrete feature selection during back-propagation
Reference graph
Works this paper leans on
-
[1]
Chen, T., Guestrin, C.: XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 785–794. KDD ’16, ACM (2016)
work page 2016
-
[2]
Artificial Intelligence and Digital Technology1(1), 65–78 (Nov 2024)
Cheng, X.: A Comprehensive Study of Feature Selection Techniques in Machine Learning Models. Artificial Intelligence and Digital Technology1(1), 65–78 (Nov 2024)
work page 2024
-
[3]
Cherepanova, V., Levin, R., Somepalli, G., Geiping, J., Bruss, C.B., Wilson, A.G., Goldstein, T., Goldblum, M.: A Performance-Driven Benchmark for Feature Se- lection in Tabular Deep Learning 16 W. Wydmański, M. Śmieja
-
[4]
In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J
Covert, I.C., Qiu, W., Lu, M., Kim, N.Y., White, N.J., Lee, S.I.: Learning to max- imize mutual information for dynamic feature selection. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp. 6424...
work page 2023
-
[5]
Gorishniy, Y., Rubachev, I., Khrulkov, V., Babenko, A.: Revisiting Deep Learning Models for Tabular Data (Oct 2023)
work page 2023
-
[6]
Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection
-
[7]
Ho, L.S.T., Richardson, N., Tran, G.: Adaptive Group Lasso Neural Network Mod- els for Functions of Few Variables and Time-Dependent Data (Dec 2021)
work page 2021
-
[8]
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax (2017)
work page 2017
-
[9]
Kohavi,R.,John,G.H.:Wrappersforfeaturesubsetselection.ArtificialIntelligence 97(1), 273–324 (Dec 1997)
work page 1997
-
[10]
Lemhadri, I., Ruan, F., Abraham, L., Tibshirani, R.: LassoNet: A Neural Network with Feature Sparsity (Jun 2021)
work page 2021
-
[11]
Information Sciences179(13), 2208–2217 (Jun 2009)
Maldonado, S., Weber, R.: A wrapper method for feature selection using Support Vector Machines. Information Sciences179(13), 2208–2217 (Jun 2009)
work page 2009
-
[12]
Pasolli, E., Schiffer, L., Manghi, P., Renson, A., Obenchain, V., Truong, D.T., Beghini, F., Malik, F., Ramos, M., Dowd, J.B., Huttenhower, C., Morgan, M., Segata, N., Waldron, L.: Accessible, curated metagenomic data through Experi- mentHub. Nat. Methods14(11), 1023–1024 (oct 2017)
work page 2017
-
[13]
In: Ruiz, F., Dy, J., van de Meent, J.W
Quinzan, F., Khanna, R., Hershcovitch, M., Cohen, S., Waddington, D., Friedrich, T., Mahoney, M.W.: Fast feature selection with fairness constraints. In: Ruiz, F., Dy, J., van de Meent, J.W. (eds.) Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 206, pp. 7800–7823. ...
work page 2023
-
[14]
Śmieja, M., Warszycki, D., Tabor, J., Bojarski, A.J.: Asymmetric clustering index in a case study of 5-ht1a receptor ligands. PloS one9(7), e102069 (2014)
work page 2014
-
[15]
Journal of the Royal Statistical Society
Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological)58(1), 267–288 (1996)
work page 1996
- [16]
-
[17]
Yu, L., Liu, H.: Efficient Feature Selection via Analysis of Relevance and Redun- dancy
-
[18]
Journal of the Royal Statistical Society Series B: Statistical Methodology67(2), 301–320 (Mar 2005)
Zou, H., Hastie, T.: Regularization and Variable Selection Via the Elastic Net. Journal of the Royal Statistical Society Series B: Statistical Methodology67(2), 301–320 (Mar 2005)
work page 2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.