pith. sign in

arxiv: 2503.13304 · v3 · submitted 2025-03-17 · 💻 cs.LG

AutoNFS: Automatic Neural Feature Selection

Pith reviewed 2026-05-22 23:35 UTC · model grok-4.3

classification 💻 cs.LG
keywords feature selectionGumbel-Sigmoidneural networkstabular dataend-to-end trainingautomatic sparsitymetagenomic data
0
0 comments X

The pith

AutoNFS automatically identifies the smallest feature set needed for a tabular task by training a Gumbel-Sigmoid mask jointly with the predictor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Feature selection for high-dimensional tabular data often forces users to pick a feature budget in advance or to retrain models many times. AutoNFS attaches a differentiable Gumbel-Sigmoid masking layer to any downstream predictor and optimizes both pieces together under a single loss. The loss penalizes the use of extra features while preserving task accuracy, so the mask learns to drop everything that is not essential. Because the masking overhead stays roughly constant regardless of input width, the method stays practical on datasets with thousands of columns. Experiments on classification, regression, and metagenomic benchmarks show that the resulting models use fewer features than strong baselines while matching or exceeding their performance.

Core claim

AutoNFS combines the FS module based on Gumbel-Sigmoid sampling with a predictive model evaluating the relevance of the selected attributes. The model is trained end-to-end using a differentiable loss and automatically determines the minimal set of features essential to solve a given downstream task. Unlike many wrapper-style approaches, AutoNFS introduces a low and predictable training overhead and avoids repeated model retraining across feature budgets.

What carries the argument

Gumbel-Sigmoid sampling module that produces differentiable binary feature masks trained jointly with the predictor under a loss that trades task performance against the number of active features.

If this is right

  • The method scales to high-dimensional tabular data because the additional cost of the masking module is largely independent of the number of input features.
  • AutoNFS eliminates the need for user-specified feature budgets or repeated retraining across different budgets.
  • It produces models that remain competitive with classical and neural feature-selection baselines on both classification and regression tasks.
  • The approach directly applies to real-world metagenomic datasets while returning sparser solutions on average.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The automatic sparsity induced by joint training may act as an implicit regularizer that improves generalization on high-dimensional inputs.
  • Similar differentiable masking layers could be inserted into non-tabular architectures to achieve automatic feature selection without manual tuning.
  • In deployed systems the reduced feature count would lower both inference latency and the cost of collecting new data.

Load-bearing premise

The Gumbel-Sigmoid masking module, when trained jointly with the predictor, will reliably identify the minimal set of essential features without requiring user-specified budgets or multiple retrainings.

What would settle it

On a dataset where exhaustive search over all subsets finds a strictly smaller feature set that matches the task performance achieved by AutoNFS, the claim that the method finds the minimal set would be refuted.

Figures

Figures reproduced from arXiv: 2503.13304 by Marek \'Smieja, Witold Wydma\'nski.

Figure 1
Figure 1. Figure 1: Architecture of GFSNetwork. Our method consists of two parts - masking [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Figure 3: Feature selection analysis showing the feature space represen [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The time requirements of GFSNetwork does not substantially increase [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average entropy of selected features is significantly higher than the en [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Analysis of sample features (top-left) from MNIST dataset shows that [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

Feature selection (FS) is a fundamental challenge in machine learning, particularly for high-dimensional tabular data, where interpretability and computational efficiency are critical. Existing FS methods often cannot automatically detect the number of attributes required to solve a given task and involve user intervention or model retraining with different feature budgets. Additionally, they either neglect feature relationships (filter methods) or require time-consuming optimization (wrapper methods). To address these limitations, we propose AutoNFS, which combines the FS module based on Gumbel-Sigmoid sampling with a predictive model evaluating the relevance of the selected attributes. The model is trained end-to-end using a differentiable loss and automatically determines the minimal set of features essential to solve a given downstream task. Unlike many wrapper-style approaches, AutoNFS introduces a low and predictable training overhead and avoids repeated model retraining across feature budgets. In practice, the additional cost of the masking module is largely independent of the number of input features (beyond the unavoidable cost of processing the input itself), making the method scalable to high-dimensional tabular data. We evaluate AutoNFS on well-established classification and regression benchmarks as well as real-world metagenomic datasets. The results show that AutoNFS is competitive with, and often improves upon, strong classical and neural FS baselines while selecting fewer features on average across the evaluated benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces AutoNFS, which augments a predictor with a Gumbel-Sigmoid masking module trained jointly end-to-end via a differentiable task loss. It claims this automatically identifies the minimal essential feature set for tabular classification and regression without user-specified budgets or repeated retraining, while incurring low overhead independent of input dimensionality and yielding competitive or superior performance with fewer features on standard benchmarks and metagenomic data.

Significance. If the end-to-end procedure reliably enforces minimality, the method would remove a practical barrier in existing wrapper and neural FS approaches by eliminating the need for budget search or multiple trainings, while remaining scalable to high-dimensional inputs.

major comments (1)
  1. [Abstract] Abstract: the central claim that the model 'automatically determines the minimal set of features essential to solve a given downstream task' and 'select[s] fewer features on average' rests on the Gumbel-Sigmoid module alone; the described objective is solely the task loss, which is satisfied by any sufficient feature set and supplies no explicit pressure (e.g., L0-style or expected-cardinality penalty) toward minimality.
minor comments (1)
  1. The abstract supplies no equations, training details, error bars, dataset sizes, or statistical tests, preventing verification of the performance and feature-count claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We respond point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the model 'automatically determines the minimal set of features essential to solve a given downstream task' and 'select[s] fewer features on average' rests on the Gumbel-Sigmoid module alone; the described objective is solely the task loss, which is satisfied by any sufficient feature set and supplies no explicit pressure (e.g., L0-style or expected-cardinality penalty) toward minimality.

    Authors: We agree that the objective is the task loss with no explicit L0 or cardinality penalty, so there is no theoretical guarantee of minimality from the loss alone. The Gumbel-Sigmoid module permits the predictor to learn which features can be masked while still achieving low task loss; our experiments demonstrate that the resulting selections are compact (fewer features on average than strong baselines) and that further removal degrades performance. We will revise the abstract to qualify the language (e.g., change 'automatically determines the minimal set' to 'learns a compact feature set sufficient for the task') and add a short discussion paragraph noting the absence of an explicit sparsity term and the empirical nature of the observed minimality. This revision does not change the method or results. revision: yes

Circularity Check

0 steps flagged

No circularity: new end-to-end module with empirical evaluation

full rationale

The paper introduces AutoNFS as a novel trainable Gumbel-Sigmoid masking module jointly optimized with a predictor via task loss. Claims of automatic minimal feature selection and competitive performance rest on experimental benchmarks rather than any derivation that reduces outputs to fitted inputs or prior self-citations by construction. No equations or steps equate a 'prediction' to its own training data, and the central method is presented as an independent architectural contribution without load-bearing self-referential uniqueness theorems or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that Gumbel-Sigmoid sampling yields useful gradients for feature selection.

axioms (1)
  • domain assumption Gumbel-Sigmoid sampling produces usable gradients for discrete feature selection during back-propagation
    Invoked by the description of the FS module trained end-to-end.

pith-pipeline@v0.9.0 · 5763 in / 1208 out tokens · 48816 ms · 2026-05-22T23:35:25.417107+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    Chen, T., Guestrin, C.: XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 785–794. KDD ’16, ACM (2016)

  2. [2]

    Artificial Intelligence and Digital Technology1(1), 65–78 (Nov 2024)

    Cheng, X.: A Comprehensive Study of Feature Selection Techniques in Machine Learning Models. Artificial Intelligence and Digital Technology1(1), 65–78 (Nov 2024)

  3. [3]

    Wydmański, M

    Cherepanova, V., Levin, R., Somepalli, G., Geiping, J., Bruss, C.B., Wilson, A.G., Goldstein, T., Goldblum, M.: A Performance-Driven Benchmark for Feature Se- lection in Tabular Deep Learning 16 W. Wydmański, M. Śmieja

  4. [4]

    In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

    Covert, I.C., Qiu, W., Lu, M., Kim, N.Y., White, N.J., Lee, S.I.: Learning to max- imize mutual information for dynamic feature selection. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp. 6424...

  5. [5]

    Gorishniy, Y., Rubachev, I., Khrulkov, V., Babenko, A.: Revisiting Deep Learning Models for Tabular Data (Oct 2023)

  6. [6]

    Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection

  7. [7]

    Ho, L.S.T., Richardson, N., Tran, G.: Adaptive Group Lasso Neural Network Mod- els for Functions of Few Variables and Time-Dependent Data (Dec 2021)

  8. [8]

    Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax (2017)

  9. [9]

    Kohavi,R.,John,G.H.:Wrappersforfeaturesubsetselection.ArtificialIntelligence 97(1), 273–324 (Dec 1997)

  10. [10]

    Lemhadri, I., Ruan, F., Abraham, L., Tibshirani, R.: LassoNet: A Neural Network with Feature Sparsity (Jun 2021)

  11. [11]

    Information Sciences179(13), 2208–2217 (Jun 2009)

    Maldonado, S., Weber, R.: A wrapper method for feature selection using Support Vector Machines. Information Sciences179(13), 2208–2217 (Jun 2009)

  12. [12]

    Pasolli, E., Schiffer, L., Manghi, P., Renson, A., Obenchain, V., Truong, D.T., Beghini, F., Malik, F., Ramos, M., Dowd, J.B., Huttenhower, C., Morgan, M., Segata, N., Waldron, L.: Accessible, curated metagenomic data through Experi- mentHub. Nat. Methods14(11), 1023–1024 (oct 2017)

  13. [13]

    In: Ruiz, F., Dy, J., van de Meent, J.W

    Quinzan, F., Khanna, R., Hershcovitch, M., Cohen, S., Waddington, D., Friedrich, T., Mahoney, M.W.: Fast feature selection with fairness constraints. In: Ruiz, F., Dy, J., van de Meent, J.W. (eds.) Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 206, pp. 7800–7823. ...

  14. [14]

    PloS one9(7), e102069 (2014)

    Śmieja, M., Warszycki, D., Tabor, J., Bojarski, A.J.: Asymmetric clustering index in a case study of 5-ht1a receptor ligands. PloS one9(7), e102069 (2014)

  15. [15]

    Journal of the Royal Statistical Society

    Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological)58(1), 267–288 (1996)

  16. [16]

    Yasuda, T., Bateni, M., Chen, L., Fahrbach, M., Fu, G., Mirrokni, V.: (Apr 2023), arXiv:2209.14881 [cs]

  17. [17]

    Yu, L., Liu, H.: Efficient Feature Selection via Analysis of Relevance and Redun- dancy

  18. [18]

    Journal of the Royal Statistical Society Series B: Statistical Methodology67(2), 301–320 (Mar 2005)

    Zou, H., Hastie, T.: Regularization and Variable Selection Via the Elastic Net. Journal of the Royal Statistical Society Series B: Statistical Methodology67(2), 301–320 (Mar 2005)