AutoNFS: Automatic Neural Feature Selection

Marek \'Smieja; Witold Wydma\'nski

arxiv: 2503.13304 · v3 · submitted 2025-03-17 · 💻 cs.LG

AutoNFS: Automatic Neural Feature Selection

Witold Wydma\'nski , Marek \'Smieja This is my paper

Pith reviewed 2026-05-22 23:35 UTC · model grok-4.3

classification 💻 cs.LG

keywords feature selectionGumbel-Sigmoidneural networkstabular dataend-to-end trainingautomatic sparsitymetagenomic data

0 comments

The pith

AutoNFS automatically identifies the smallest feature set needed for a tabular task by training a Gumbel-Sigmoid mask jointly with the predictor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Feature selection for high-dimensional tabular data often forces users to pick a feature budget in advance or to retrain models many times. AutoNFS attaches a differentiable Gumbel-Sigmoid masking layer to any downstream predictor and optimizes both pieces together under a single loss. The loss penalizes the use of extra features while preserving task accuracy, so the mask learns to drop everything that is not essential. Because the masking overhead stays roughly constant regardless of input width, the method stays practical on datasets with thousands of columns. Experiments on classification, regression, and metagenomic benchmarks show that the resulting models use fewer features than strong baselines while matching or exceeding their performance.

Core claim

AutoNFS combines the FS module based on Gumbel-Sigmoid sampling with a predictive model evaluating the relevance of the selected attributes. The model is trained end-to-end using a differentiable loss and automatically determines the minimal set of features essential to solve a given downstream task. Unlike many wrapper-style approaches, AutoNFS introduces a low and predictable training overhead and avoids repeated model retraining across feature budgets.

What carries the argument

Gumbel-Sigmoid sampling module that produces differentiable binary feature masks trained jointly with the predictor under a loss that trades task performance against the number of active features.

If this is right

The method scales to high-dimensional tabular data because the additional cost of the masking module is largely independent of the number of input features.
AutoNFS eliminates the need for user-specified feature budgets or repeated retraining across different budgets.
It produces models that remain competitive with classical and neural feature-selection baselines on both classification and regression tasks.
The approach directly applies to real-world metagenomic datasets while returning sparser solutions on average.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The automatic sparsity induced by joint training may act as an implicit regularizer that improves generalization on high-dimensional inputs.
Similar differentiable masking layers could be inserted into non-tabular architectures to achieve automatic feature selection without manual tuning.
In deployed systems the reduced feature count would lower both inference latency and the cost of collecting new data.

Load-bearing premise

The Gumbel-Sigmoid masking module, when trained jointly with the predictor, will reliably identify the minimal set of essential features without requiring user-specified budgets or multiple retrainings.

What would settle it

On a dataset where exhaustive search over all subsets finds a strictly smaller feature set that matches the task performance achieved by AutoNFS, the claim that the method finds the minimal set would be refuted.

Figures

Figures reproduced from arXiv: 2503.13304 by Marek \'Smieja, Witold Wydma\'nski.

**Figure 2.** Figure 2: Figure 3: Feature selection analysis showing the feature space represen [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: The time requirements of GFSNetwork does not substantially increase [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Average entropy of selected features is significantly higher than the en [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Analysis of sample features (top-left) from MNIST dataset shows that [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Feature selection (FS) is a fundamental challenge in machine learning, particularly for high-dimensional tabular data, where interpretability and computational efficiency are critical. Existing FS methods often cannot automatically detect the number of attributes required to solve a given task and involve user intervention or model retraining with different feature budgets. Additionally, they either neglect feature relationships (filter methods) or require time-consuming optimization (wrapper methods). To address these limitations, we propose AutoNFS, which combines the FS module based on Gumbel-Sigmoid sampling with a predictive model evaluating the relevance of the selected attributes. The model is trained end-to-end using a differentiable loss and automatically determines the minimal set of features essential to solve a given downstream task. Unlike many wrapper-style approaches, AutoNFS introduces a low and predictable training overhead and avoids repeated model retraining across feature budgets. In practice, the additional cost of the masking module is largely independent of the number of input features (beyond the unavoidable cost of processing the input itself), making the method scalable to high-dimensional tabular data. We evaluate AutoNFS on well-established classification and regression benchmarks as well as real-world metagenomic datasets. The results show that AutoNFS is competitive with, and often improves upon, strong classical and neural FS baselines while selecting fewer features on average across the evaluated benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoNFS adds a Gumbel-Sigmoid masking layer trained jointly with the predictor, but the task loss alone gives no direct pressure toward the smallest feature set.

read the letter

The paper's main move is to wrap a Gumbel-Sigmoid sampler around the input features and train the whole thing end-to-end so the selector and the downstream model optimize together. This removes the need for the user to pick a feature budget in advance or run the model repeatedly at different sizes. That part lines up with a real workflow complaint in tabular work, and the claim that the extra cost stays roughly constant with dimension is worth checking because it would matter for metagenomic or similar high-dim sets. The abstract also says the method ends up with fewer features than the baselines while staying competitive, which would be the practical payoff if it holds. The stress-test note is on target: the described loss is just the usual predictive objective, so nothing explicitly penalizes extra features. Any set that is good enough satisfies the gradient signal; minimality is not enforced by the math as stated. If the full paper adds a cardinality term or shows that the sampling dynamics reliably prune to the essential subset, that would close the gap; otherwise the “minimal set” language rests on the empirical outcome rather than the objective. The abstract supplies no equations, dataset sizes, error bars, or statistical tests, so the competitive claim and the “fewer features on average” result cannot be assessed yet. This is straightforward applied work aimed at people who run feature selection on tabular data and want less manual tuning. It engages the standard filter/wrapper distinction without obvious internal contradictions. I would bring the full version to a reading group focused on neural feature selection to see the actual training details and ablations. It is not a foundational result, but the idea is clear enough that a serious editor should send it out for review rather than desk-reject; the experiments and any missing regularizer can be sorted in revision.

Referee Report

1 major / 1 minor

Summary. The paper introduces AutoNFS, which augments a predictor with a Gumbel-Sigmoid masking module trained jointly end-to-end via a differentiable task loss. It claims this automatically identifies the minimal essential feature set for tabular classification and regression without user-specified budgets or repeated retraining, while incurring low overhead independent of input dimensionality and yielding competitive or superior performance with fewer features on standard benchmarks and metagenomic data.

Significance. If the end-to-end procedure reliably enforces minimality, the method would remove a practical barrier in existing wrapper and neural FS approaches by eliminating the need for budget search or multiple trainings, while remaining scalable to high-dimensional inputs.

major comments (1)

[Abstract] Abstract: the central claim that the model 'automatically determines the minimal set of features essential to solve a given downstream task' and 'select[s] fewer features on average' rests on the Gumbel-Sigmoid module alone; the described objective is solely the task loss, which is satisfied by any sufficient feature set and supplies no explicit pressure (e.g., L0-style or expected-cardinality penalty) toward minimality.

minor comments (1)

The abstract supplies no equations, training details, error bars, dataset sizes, or statistical tests, preventing verification of the performance and feature-count claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the abstract. We respond point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the model 'automatically determines the minimal set of features essential to solve a given downstream task' and 'select[s] fewer features on average' rests on the Gumbel-Sigmoid module alone; the described objective is solely the task loss, which is satisfied by any sufficient feature set and supplies no explicit pressure (e.g., L0-style or expected-cardinality penalty) toward minimality.

Authors: We agree that the objective is the task loss with no explicit L0 or cardinality penalty, so there is no theoretical guarantee of minimality from the loss alone. The Gumbel-Sigmoid module permits the predictor to learn which features can be masked while still achieving low task loss; our experiments demonstrate that the resulting selections are compact (fewer features on average than strong baselines) and that further removal degrades performance. We will revise the abstract to qualify the language (e.g., change 'automatically determines the minimal set' to 'learns a compact feature set sufficient for the task') and add a short discussion paragraph noting the absence of an explicit sparsity term and the empirical nature of the observed minimality. This revision does not change the method or results. revision: yes

Circularity Check

0 steps flagged

No circularity: new end-to-end module with empirical evaluation

full rationale

The paper introduces AutoNFS as a novel trainable Gumbel-Sigmoid masking module jointly optimized with a predictor via task loss. Claims of automatic minimal feature selection and competitive performance rest on experimental benchmarks rather than any derivation that reduces outputs to fitted inputs or prior self-citations by construction. No equations or steps equate a 'prediction' to its own training data, and the central method is presented as an independent architectural contribution without load-bearing self-referential uniqueness theorems or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that Gumbel-Sigmoid sampling yields useful gradients for feature selection.

axioms (1)

domain assumption Gumbel-Sigmoid sampling produces usable gradients for discrete feature selection during back-propagation
Invoked by the description of the FS module trained end-to-end.

pith-pipeline@v0.9.0 · 5763 in / 1208 out tokens · 48816 ms · 2026-05-22T23:35:25.417107+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Chen, T., Guestrin, C.: XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 785–794. KDD ’16, ACM (2016)

work page 2016
[2]

Artificial Intelligence and Digital Technology1(1), 65–78 (Nov 2024)

Cheng, X.: A Comprehensive Study of Feature Selection Techniques in Machine Learning Models. Artificial Intelligence and Digital Technology1(1), 65–78 (Nov 2024)

work page 2024
[3]

Wydmański, M

Cherepanova, V., Levin, R., Somepalli, G., Geiping, J., Bruss, C.B., Wilson, A.G., Goldstein, T., Goldblum, M.: A Performance-Driven Benchmark for Feature Se- lection in Tabular Deep Learning 16 W. Wydmański, M. Śmieja

work page
[4]

In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Covert, I.C., Qiu, W., Lu, M., Kim, N.Y., White, N.J., Lee, S.I.: Learning to max- imize mutual information for dynamic feature selection. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp. 6424...

work page 2023
[5]

Gorishniy, Y., Rubachev, I., Khrulkov, V., Babenko, A.: Revisiting Deep Learning Models for Tabular Data (Oct 2023)

work page 2023
[6]

Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection

work page
[7]

Ho, L.S.T., Richardson, N., Tran, G.: Adaptive Group Lasso Neural Network Mod- els for Functions of Few Variables and Time-Dependent Data (Dec 2021)

work page 2021
[8]

Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax (2017)

work page 2017
[9]

Kohavi,R.,John,G.H.:Wrappersforfeaturesubsetselection.ArtificialIntelligence 97(1), 273–324 (Dec 1997)

work page 1997
[10]

Lemhadri, I., Ruan, F., Abraham, L., Tibshirani, R.: LassoNet: A Neural Network with Feature Sparsity (Jun 2021)

work page 2021
[11]

Information Sciences179(13), 2208–2217 (Jun 2009)

Maldonado, S., Weber, R.: A wrapper method for feature selection using Support Vector Machines. Information Sciences179(13), 2208–2217 (Jun 2009)

work page 2009
[12]

Pasolli, E., Schiffer, L., Manghi, P., Renson, A., Obenchain, V., Truong, D.T., Beghini, F., Malik, F., Ramos, M., Dowd, J.B., Huttenhower, C., Morgan, M., Segata, N., Waldron, L.: Accessible, curated metagenomic data through Experi- mentHub. Nat. Methods14(11), 1023–1024 (oct 2017)

work page 2017
[13]

In: Ruiz, F., Dy, J., van de Meent, J.W

Quinzan, F., Khanna, R., Hershcovitch, M., Cohen, S., Waddington, D., Friedrich, T., Mahoney, M.W.: Fast feature selection with fairness constraints. In: Ruiz, F., Dy, J., van de Meent, J.W. (eds.) Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 206, pp. 7800–7823. ...

work page 2023
[14]

PloS one9(7), e102069 (2014)

Śmieja, M., Warszycki, D., Tabor, J., Bojarski, A.J.: Asymmetric clustering index in a case study of 5-ht1a receptor ligands. PloS one9(7), e102069 (2014)

work page 2014
[15]

Journal of the Royal Statistical Society

Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological)58(1), 267–288 (1996)

work page 1996
[16]

Yasuda, T., Bateni, M., Chen, L., Fahrbach, M., Fu, G., Mirrokni, V.: (Apr 2023), arXiv:2209.14881 [cs]

work page arXiv 2023
[17]

Yu, L., Liu, H.: Efficient Feature Selection via Analysis of Relevance and Redun- dancy

work page
[18]

Journal of the Royal Statistical Society Series B: Statistical Methodology67(2), 301–320 (Mar 2005)

Zou, H., Hastie, T.: Regularization and Variable Selection Via the Elastic Net. Journal of the Royal Statistical Society Series B: Statistical Methodology67(2), 301–320 (Mar 2005)

work page 2005

[1] [1]

In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Chen, T., Guestrin, C.: XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 785–794. KDD ’16, ACM (2016)

work page 2016

[2] [2]

Artificial Intelligence and Digital Technology1(1), 65–78 (Nov 2024)

Cheng, X.: A Comprehensive Study of Feature Selection Techniques in Machine Learning Models. Artificial Intelligence and Digital Technology1(1), 65–78 (Nov 2024)

work page 2024

[3] [3]

Wydmański, M

Cherepanova, V., Levin, R., Somepalli, G., Geiping, J., Bruss, C.B., Wilson, A.G., Goldstein, T., Goldblum, M.: A Performance-Driven Benchmark for Feature Se- lection in Tabular Deep Learning 16 W. Wydmański, M. Śmieja

work page

[4] [4]

In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

Covert, I.C., Qiu, W., Lu, M., Kim, N.Y., White, N.J., Lee, S.I.: Learning to max- imize mutual information for dynamic feature selection. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp. 6424...

work page 2023

[5] [5]

Gorishniy, Y., Rubachev, I., Khrulkov, V., Babenko, A.: Revisiting Deep Learning Models for Tabular Data (Oct 2023)

work page 2023

[6] [6]

Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection

work page

[7] [7]

Ho, L.S.T., Richardson, N., Tran, G.: Adaptive Group Lasso Neural Network Mod- els for Functions of Few Variables and Time-Dependent Data (Dec 2021)

work page 2021

[8] [8]

Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax (2017)

work page 2017

[9] [9]

Kohavi,R.,John,G.H.:Wrappersforfeaturesubsetselection.ArtificialIntelligence 97(1), 273–324 (Dec 1997)

work page 1997

[10] [10]

Lemhadri, I., Ruan, F., Abraham, L., Tibshirani, R.: LassoNet: A Neural Network with Feature Sparsity (Jun 2021)

work page 2021

[11] [11]

Information Sciences179(13), 2208–2217 (Jun 2009)

Maldonado, S., Weber, R.: A wrapper method for feature selection using Support Vector Machines. Information Sciences179(13), 2208–2217 (Jun 2009)

work page 2009

[12] [12]

Pasolli, E., Schiffer, L., Manghi, P., Renson, A., Obenchain, V., Truong, D.T., Beghini, F., Malik, F., Ramos, M., Dowd, J.B., Huttenhower, C., Morgan, M., Segata, N., Waldron, L.: Accessible, curated metagenomic data through Experi- mentHub. Nat. Methods14(11), 1023–1024 (oct 2017)

work page 2017

[13] [13]

In: Ruiz, F., Dy, J., van de Meent, J.W

Quinzan, F., Khanna, R., Hershcovitch, M., Cohen, S., Waddington, D., Friedrich, T., Mahoney, M.W.: Fast feature selection with fairness constraints. In: Ruiz, F., Dy, J., van de Meent, J.W. (eds.) Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 206, pp. 7800–7823. ...

work page 2023

[14] [14]

PloS one9(7), e102069 (2014)

Śmieja, M., Warszycki, D., Tabor, J., Bojarski, A.J.: Asymmetric clustering index in a case study of 5-ht1a receptor ligands. PloS one9(7), e102069 (2014)

work page 2014

[15] [15]

Journal of the Royal Statistical Society

Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological)58(1), 267–288 (1996)

work page 1996

[16] [16]

Yasuda, T., Bateni, M., Chen, L., Fahrbach, M., Fu, G., Mirrokni, V.: (Apr 2023), arXiv:2209.14881 [cs]

work page arXiv 2023

[17] [17]

Yu, L., Liu, H.: Efficient Feature Selection via Analysis of Relevance and Redun- dancy

work page

[18] [18]

Journal of the Royal Statistical Society Series B: Statistical Methodology67(2), 301–320 (Mar 2005)

Zou, H., Hastie, T.: Regularization and Variable Selection Via the Elastic Net. Journal of the Royal Statistical Society Series B: Statistical Methodology67(2), 301–320 (Mar 2005)

work page 2005