arxiv: 2605.07378 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns

Andy Song, HaythamM. Fayek, Vic Ciesielski, Xiaojun Chang, Yameng Peng

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:58 UTC · model grok-4.3

classification 💻 cs.LG

keywords zero-shot evaluationneural architecture searchactivation patternsperformance predictionCNNTransformersexpressivitytraining-free metrics

0 comments

The pith

Sample-wise activation patterns yield a zero-shot score that ranks neural networks by their eventual trained performance across CNNs and Transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SWAP, a way to read how a network activates on each sample in a small unlabeled batch, then derives the SWAP-Score from those patterns to estimate expressivity. Existing zero-shot proxies often work only for one architecture family or task and show weak links to real accuracy after training. SWAP-Score is shown to correlate strongly with final validation accuracy on both vision and language benchmarks while remaining label-free, so it can guide architecture search or pre-training decisions with almost no compute. If the metric holds, practitioners could evaluate thousands of candidate networks in minutes rather than hours of GPU training.

Core claim

SWAP-Score, computed directly from the sample-wise activation patterns that arise when a network processes an unlabeled mini-batch, serves as a training-free predictor of a network's performance after full supervised training. The same formulation applies to both convolutional and transformer architectures and produces higher rank correlation with ground-truth accuracy than prior zero-shot proxies on CIFAR-10, ImageNet, and GLUE tasks.

What carries the argument

Sample-Wise Activation Patterns (SWAP), which record per-sample binary activation states across layers to quantify how distinctly a network separates the inputs in a mini-batch.

If this is right

Networks can be ranked for Neural Architecture Search using only minutes of GPU time on CIFAR-10 and ImageNet while still reaching competitive final accuracy.
Language models can be screened for downstream-task potential during pre-training because the metric does not require task labels.
A single formulation works for both CNNs and Transformers, removing the need to maintain separate zero-shot proxies for each family.
The metric supplies a concrete numerical signal for how much a network distinguishes individual samples, which can be inspected layer by layer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If activation diversity on random samples is the dominant driver, then simple data-augmentation strategies that increase sample distinctness might raise SWAP-Score without changing the architecture.
The same patterns could be monitored during training to detect when a network has stopped gaining new expressivity and training can be stopped early.
Because the computation is label-independent, the score might serve as a quick filter for transfer-learning candidates before any fine-tuning begins.

Load-bearing premise

The activation patterns produced by an unlabeled mini-batch already encode enough information about a network's overall expressivity to forecast its accuracy after complete training on a specific downstream dataset.

What would settle it

A new architecture family or dataset in which the SWAP-Score ordering of networks disagrees sharply with the ordering obtained after full training and validation.

Figures

Figures reproduced from arXiv: 2605.07378 by Andy Song, HaythamM. Fayek, Vic Ciesielski, Xiaojun Chang, Yameng Peng.

**Figure 2.** Figure 2: Examples of activation pattern matrices AN,θ for the same network with different inputs. Duplicate activation patterns are highlighted in green. The upper bound of the cardinality of AN is equal to the number of input samples, S. Given the same number of inputs, higher-dimensional inputs or deeper networks will generate more intermediate values, i.e., V ≫ S, making it more likely to produce distinct vector… view at source ↗

**Figure 3.** Figure 3: Comparison between original activation pattern matrix [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Spearman’s correlation coefficients between 18 training-free metrics (rows), including SWAP-Score and its regularised [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Scatter plots of SWAP-Score vs. ground-truth performance for architectures from NAS-Bench-101/201/301 and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Scatter plots of regularised SWAP-Score vs. ground-truth performance for architectures from NAS-Bench-101/201/301 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Scatter plots of SWAP-Score vs. ground-truth performance for Transformer models from FlexiBERT search space on [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Pairwise Spearman’s correlation among the ground [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Illustration of the mutation operators in SWAP-NAS. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 11.** Figure 11: Spearman’s correlation coefficient between SWAP [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 10.** Figure 10: Spearman’s correlation coefficient between SWAP [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 12.** Figure 12: Distribution of the number of activation values of [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 14.** Figure 14: Spearman’s correlation coefficient between SWAP [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗

read the original abstract

Zero-shot proxies, also known as training-free metrics, are widely adopted to reduce the computational overhead in neural network evaluation for scenarios such as Neural Architecture Search (NAS), as they do not require any training. Existing zero-shot metrics have several limitations, including weak correlation with the true performance and poor generalisation across different networks or downstream tasks. For example, most of these metrics apply only to either convolutional neural networks (CNNs) or Transformers, but not both. To address these limitations, we propose Sample-Wise Activation Patterns (SWAP), and its derivative, SWAP-Score, a novel and highly effective zero-shot metric. SWAP-Score is broadly applicable across both architecture families and task domains, demonstrating strong predictive performance in the majority of tasks. This metric measures the expressivity of neural networks over a mini-batch of samples, showing a high correlation with the neural networks' ground-truth performance. For both CNNs and Transformers, the SWAP-Score outperforms existing zero-shot metrics across computer vision and natural language processing tasks. For instance, Spearman's correlation coefficient between the SWAP-Score and CIFAR-10 validation accuracy for DARTS CNNs is 0.93, and 0.71 for FlexiBERT Transformers on GLUE tasks. Moreover, SWAP-Score is label-independent, hence can be applied at the pre-training stage of language models to estimate their performance for downstream tasks. When applied to NAS, SWAP-empowered NAS, SWAP-NAS can achieve competitive performance using only approximately 6 and 9 minutes of GPU time, on CIFAR-10 and ImageNet respectively. Our code is available at: https://github.com/pym1024/SWAP_Universal

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWAP-Score gives a workable cross-architecture zero-shot metric with high reported correlations, but the mini-batch dependence needs direct checks.

read the letter

The main thing to know is that this paper introduces SWAP-Score, a zero-shot metric built from sample-wise activation patterns on an unlabeled mini-batch. It reports strong Spearman correlations with final accuracy for both CNNs and Transformers, and it runs fast enough to make NAS practical on CIFAR-10 and ImageNet in minutes of GPU time. The label-free property also lets it estimate downstream performance before any supervised training on language models. That combination of breadth and speed is the real addition over earlier proxies that were usually tied to one architecture family. The code release and the concrete numbers (0.93 on DARTS/CIFAR-10, 0.71 on FlexiBERT/GLUE) make the claims testable rather than hand-wavy. The methods section spells out how the patterns are aggregated into the score, which is better than many zero-shot papers that leave the exact statistic vague. On the soft side, the stress-test concern about mini-batch sensitivity is fair and not fully addressed in the main results. There are no explicit ablations swapping the evaluation batch or using data from a different distribution, so it remains possible that part of the correlation comes from implicit alignment between the chosen samples and the downstream task rather than pure network expressivity. If those checks were run in the appendix they are not highlighted, which leaves a gap for reviewers to probe. This paper is aimed at people doing architecture search or quick model filtering who need something that works across CNNs and Transformers without training. It deserves a serious referee because the idea is new, the empirical targets are specific, and the released code lets others verify or extend the work. A revision that adds the missing sensitivity tests would make it a useful addition to the zero-shot literature.

Referee Report

3 major / 1 minor

Summary. The paper proposes Sample-Wise Activation Patterns (SWAP) and its derivative SWAP-Score as a novel zero-shot, label-independent metric for neural network evaluation. It claims that activation patterns computed on an unlabeled mini-batch capture network expressivity, yielding high Spearman correlations with post-training accuracy (e.g., 0.93 for DARTS CNNs on CIFAR-10 and 0.71 for FlexiBERT Transformers on GLUE), outperforming prior zero-shot proxies across CNNs and Transformers in CV and NLP, and enabling competitive NAS in minutes of GPU time.

Significance. A reliable, architecture- and task-agnostic zero-shot metric would be a substantial contribution to NAS and model selection, as it could eliminate training costs while maintaining predictive power; the reported NAS timings (6 min on CIFAR-10, 9 min on ImageNet) and cross-domain applicability illustrate the potential practical impact if the correlations prove robust.

major comments (3)

[Abstract] Abstract: the stated correlations (0.93 on DARTS/CIFAR-10, 0.71 on FlexiBERT/GLUE) and outperformance claims are presented without any definition of how SWAP is computed (which layers, how activation patterns are encoded or aggregated, mini-batch size), which samples are used, or any statistical tests/controls, rendering it impossible to evaluate whether the numbers support the expressivity claim.
[Abstract] Abstract: no ablations or sensitivity analyses are referenced for mini-batch choice or data distribution, which is load-bearing for the central claim that SWAP-Score measures intrinsic network expressivity rather than alignment with the particular unlabeled samples; without such checks the reported correlations could be inflated by implicit data overlap.
[Abstract] Abstract: the claim that SWAP-Score is 'broadly applicable across both architecture families and task domains' and 'outperforms existing zero-shot metrics' requires explicit baseline comparisons and quantitative tables in the results; the abstract alone supplies no such evidence.

minor comments (1)

The abstract states that code is available at a GitHub link; ensure the repository contains the exact implementation used for the reported numbers to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. The comments correctly identify that the abstract is highly condensed and would benefit from additional context to make the central claims more evaluable on their own. We will revise the abstract to incorporate brief definitions, robustness notes, and references to the quantitative comparisons while preserving its length. Point-by-point responses are provided below.

read point-by-point responses

Referee: [Abstract] Abstract: the stated correlations (0.93 on DARTS/CIFAR-10, 0.71 on FlexiBERT/GLUE) and outperformance claims are presented without any definition of how SWAP is computed (which layers, how activation patterns are encoded or aggregated, mini-batch size), which samples are used, or any statistical tests/controls, rendering it impossible to evaluate whether the numbers support the expressivity claim.

Authors: We agree that the abstract does not define the computation of SWAP-Score. The full methodological details—including the layers from which activations are taken, the binary encoding and aggregation of sample-wise patterns, the unlabeled mini-batch size and sampling procedure, and the statistical controls—are provided in Section 3 of the manuscript. To address the concern, we will revise the abstract to include a concise, high-level definition of how SWAP is computed and note that statistical significance is reported via Spearman correlations with associated p-values. revision: yes
Referee: [Abstract] Abstract: no ablations or sensitivity analyses are referenced for mini-batch choice or data distribution, which is load-bearing for the central claim that SWAP-Score measures intrinsic network expressivity rather than alignment with the particular unlabeled samples; without such checks the reported correlations could be inflated by implicit data overlap.

Authors: This observation is fair; the abstract does not reference sensitivity checks. The manuscript contains ablations on mini-batch size and data distribution in Section 4.3 that demonstrate stable correlations across reasonable ranges of batch sizes and sampling strategies, supporting the expressivity interpretation. We will add a short clause to the revised abstract referencing these robustness results. revision: yes
Referee: [Abstract] Abstract: the claim that SWAP-Score is 'broadly applicable across both architecture families and task domains' and 'outperforms existing zero-shot metrics' requires explicit baseline comparisons and quantitative tables in the results; the abstract alone supplies no such evidence.

Authors: We accept that the abstract summarizes the outperformance without citing the supporting tables. Section 4 and Tables 1–3 provide the explicit comparisons to prior zero-shot proxies across CNN and Transformer benchmarks in both CV and NLP, with the reported Spearman values. We will revise the abstract to briefly note the outperformance relative to baselines and direct readers to the quantitative results. revision: yes

Circularity Check

0 steps flagged

No circularity: SWAP-Score defined independently from activation patterns

full rationale

The metric is constructed solely from sample-wise activation patterns over an unlabeled mini-batch, with no reference to downstream accuracy values or labels in its definition. Reported correlations (e.g., 0.93 on DARTS/CIFAR-10) are computed afterward as external validation, not embedded in the construction. No equations reduce the score to a fit on target performance, no self-citations bear the central claim, and the derivation does not rename or smuggle in prior results by construction. The chain remains self-contained against the stated inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the metric is described as a direct computation of activation patterns without stated tunable constants or new postulated objects.

pith-pipeline@v0.9.0 · 5621 in / 1204 out tokens · 44678 ms · 2026-05-11T01:58:33.733891+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Definition III.3 (SWAP-Score Ψ): cardinality of the set of sample-wise binarized activation vectors ˆA_N,θ
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
Section III-A: linear regions and expressivity for ReLU/GELU networks

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 4 internal anchors

[1]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,”arXiv preprint arXiv:1510.00149, 2015

work page internal anchor Pith review arXiv 2015
[2]

Learning both weights and con- nections for efficient neural network,

S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con- nections for efficient neural network,”Advances in neural information processing systems, vol. 28, 2015

work page 2015
[3]

Neural architecture search without training,

J. Mellor, J. Turner, A. J. Storkey, and E. J. Crowley, “Neural architecture search without training,” inProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 2021, pp. 7588–7598

work page 2021
[4]

Zero- cost proxies for lightweight NAS,

M. S. Abdelfattah, A. Mehrotra, L. Dudziak, and N. D. Lane, “Zero- cost proxies for lightweight NAS,” in9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021

work page 2021
[5]

On the number of linear regions of deep neural networks,

G. Mont ´ufar, R. Pascanu, K. Cho, and Y . Bengio, “On the number of linear regions of deep neural networks,” inAdvances in Neural Informa- tion Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Ed...

work page 2014
[6]

Knowledge distillation via the target-aware transformer,

S. Lin, H. Xie, B. Wang, K. Yu, X. Chang, X. Liang, and G. Wang, “Knowledge distillation via the target-aware transformer,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 10 915–10 924

work page 2022
[7]

Distilling the Knowledge in a Neural Network

G. Hinton, “Distilling the knowledge in a neural network,”arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[8]

FitNets: Hints for Thin Deep Nets

A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y . Ben- gio, “Fitnets: Hints for thin deep nets,”arXiv preprint arXiv:1412.6550, 2014

work page internal anchor Pith review arXiv 2014
[9]

Mastering the game of go with deep neural networks and tree search,

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V . Panneershelvam, M. Lanctotet al., “Mastering the game of go with deep neural networks and tree search,”nature, vol. 529, no. 7587, pp. 484–489, 2016

work page 2016
[10]

Neural architecture search with reinforcement learning,

B. Zoph and Q. V . Le, “Neural architecture search with reinforcement learning,” inInternational Conference on Learning Representations (ICLR), 2017

work page 2017
[11]

A comprehensive survey of neural architecture search: Challenges and solutions,

P. Ren, Y . Xiao, X. Chang, P. Huang, Z. Li, X. Chen, and X. Wang, “A comprehensive survey of neural architecture search: Challenges and solutions,”ACM Comput. Surv., vol. 54, no. 4, pp. 76:1–76:34, 2022

work page 2022
[12]

White, M

C. White, M. Safari, R. Sukthanker, B. Ru, T. Elsken, A. Zela, D. Dey, and F. Hutter, “Neural architecture search: Insights from 1000 papers,” CoRR, vol. abs/2301.08727, 2023

work page arXiv 2023
[13]

Surrogate- assisted evolutionary deep learning using an end-to-end random forest- based performance predictor,

Y . Sun, H. Wang, B. Xue, Y . Jin, G. G. Yen, and M. Zhang, “Surrogate- assisted evolutionary deep learning using an end-to-end random forest- based performance predictor,”IEEE Trans. Evol. Comput., vol. 24, no. 2, pp. 350–364, 2020

work page 2020
[14]

Neural predictor for neural architecture search,

W. Wen, H. Liu, Y . Chen, H. H. Li, G. Bender, and P. Kindermans, “Neural predictor for neural architecture search,” inComputer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23- 28, 2020, Proceedings, Part XXIX, vol. 12374. Springer, 2020, pp. 660–676

work page 2020
[15]

Pre- nas: Evolutionary neural architecture search with predictor,

Y . Peng, A. Song, V . Ciesielski, H. M. Fayek, and X. Chang, “Pre- nas: Evolutionary neural architecture search with predictor,”IEEE Transactions on Evolutionary Computation, vol. 27, no. 1, pp. 26–36, 2023

work page 2023
[16]

Con- trastive neural architecture search with neural architecture comparators,

Y . Chen, Y . Guo, Q. Chen, M. Li, W. Zeng, Y . Wang, and M. Tan, “Con- trastive neural architecture search with neural architecture comparators,” inIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 2021, pp. 9502–9511

work page 2021
[17]

Efficient neural architecture search via parameters sharing,

H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean, “Efficient neural architecture search via parameters sharing,” inProceedings of the 35th International Conference on Machine Learning (ICML), 2018, pp. 4095– 4104

work page 2018
[18]

DARTS: differentiable architec- ture search,

H. Liu, K. Simonyan, and Y . Yang, “DARTS: differentiable architec- ture search,” inInternational Conference on Learning Representations (ICLR), 2019

work page 2019
[19]

Neural architecture search on imagenet in four GPU hours: A theoretically inspired perspective,

W. Chen, X. Gong, and Z. Wang, “Neural architecture search on imagenet in four GPU hours: A theoretically inspired perspective,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021

work page 2021
[20]

Zen-nas: A zero-shot NAS for high-performance image recognition,

M. Lin, P. Wang, Z. Sun, H. Chen, X. Sun, Q. Qian, H. Li, and R. Jin, “Zen-nas: A zero-shot NAS for high-performance image recognition,” in2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 2021, pp. 337–346

work page 2021
[21]

Epe-nas: Efficient performance estimation without training for neural architecture search,

V . Lopes, S. Alirezazadeh, and L. A. Alexandre, “Epe-nas: Efficient performance estimation without training for neural architecture search,” inArtificial Neural Networks and Machine Learning–ICANN 2021: 30th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14–17, 2021, Proceedings, Part V. Springer, 2021, pp. 552–563

work page 2021
[22]

Demystifying the neural tangent kernel from a practical perspective: Can it be trusted for neural architecture search without training?

J. Mok, B. Na, J. Kim, D. Han, and S. Yoon, “Demystifying the neural tangent kernel from a practical perspective: Can it be trusted for neural architecture search without training?” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 2022, pp. 11 851–11 860

work page 2022
[23]

Pruning neural networks without any data by iteratively conserving synaptic flow,

H. Tanaka, D. Kunin, D. L. K. Yamins, and S. Ganguli, “Pruning neural networks without any data by iteratively conserving synaptic flow,” in Advances in Neural Information Processing Systems 33: Annual Con- ference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, a...

work page 2020
[24]

Zico: Zero-shot NAS via inverse coefficient of variation on gradients,

G. Li, Y . Yang, K. Bhardwaj, and R. Marculescu, “Zico: Zero-shot NAS via inverse coefficient of variation on gradients,” inThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

work page 2023
[25]

Nas-bench-suite-zero: Accelerating research on zero cost proxies,

A. Krishnakumar, C. White, A. Zela, R. Tu, M. Safari, and F. Hutter, “Nas-bench-suite-zero: Accelerating research on zero cost proxies,” inThirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022

work page 2022
[26]

On the number of linear regions of convolutional neural networks,

H. Xiong, L. Huang, M. Yu, L. Liu, F. Zhu, and L. Shao, “On the number of linear regions of convolutional neural networks,” inProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 2020, pp. 10 514–10 523

work page 2020
[27]

Nas-bench-101: Towards reproducible neural architecture search,

C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, and F. Hutter, “Nas-bench-101: Towards reproducible neural architecture search,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, vol. 97. PMLR, 2019, pp. 7105–7114

work page 2019
[28]

Nas-bench-201: Extending the scope of repro- ducible neural architecture search,

X. Dong and Y . Yang, “Nas-bench-201: Extending the scope of repro- ducible neural architecture search,” in8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26- 30, 2020, 2020

work page 2020
[29]

Nas-bench-301 and the case for surrogate benchmarks for neural architecture search,

J. Siems, L. Zimmer, A. Zela, J. Lukasik, M. Keuper, and F. Hutter, “Nas-bench-301 and the case for surrogate benchmarks for neural architecture search,”CoRR, vol. abs/2008.09777, 2020

work page arXiv 2008
[30]

Transnas-bench-101: Improving transferability and generalizability of cross-task neural architecture search,

Y . Duan, X. Chen, H. Xu, Z. Chen, X. Liang, T. Zhang, and Z. Li, “Transnas-bench-101: Improving transferability and generalizability of cross-task neural architecture search,” inIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 2021, pp. 5251–5260

work page 2021
[31]

Flexibert: Are current trans- former architectures too homogeneous and rigid?

S. Tuli, B. Dedhia, S. Tuli, and N. K. Jha, “Flexibert: Are current trans- former architectures too homogeneous and rigid?”Journal of Artificial Intelligence Research, vol. 77, pp. 39–70, 2023

work page 2023
[32]

GLUE: A multi-task benchmark and analysis platform for natural language understanding,

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” inInternational Conference on Learning Rep- resentations, 2019

work page 2019
[33]

Regularized evolution for image classifier architecture search,

E. Real, A. Aggarwal, Y . Huang, and Q. V . Le, “Regularized evolution for image classifier architecture search,” inAAAI Conference on Artificial Intelligence, vol. 33, 2019, p. 4780–4789

work page 2019
[34]

Accelerating neural architecture search using performance prediction

B. Baker, O. Gupta, R. Raskar, and N. Naik, “Practical neural network performance prediction for early stopping,”CoRR, vol. abs/1705.10823, 2017

work page arXiv 2017
[35]

Darts+: Improved differentiable architecture search with early stopping,

H. Liang, S. Zhang, J. Sun, X. He, W. Huang, K. Zhuang, and Z. Li, “Darts+: Improved differentiable architecture search with early stopping,” 2019

work page 2019
[36]

Learning curve prediction with bayesian neural networks,

A. Klein, S. Falkner, J. T. Springenberg, and F. Hutter, “Learning curve prediction with bayesian neural networks,”International Conference on Learning Representations, 2017

work page 2017
[37]

Mnasnet: Platform-aware neural architecture search for mobile,

M. Tan, B. Chen, R. Pang, V . Vasudevan, M. Sandler, A. Howard, and Q. V . Le, “Mnasnet: Platform-aware neural architecture search for mobile,” inIEEE Conference on Computer Vision and Pattern Recognition, (CVPR), 2019, pp. 2820–2828

work page 2019
[38]

Progressive neural architecture search,

C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search,” inProceedings of the European Conference on Computer Vision (ECCV), September 2018

work page 2018
[39]

Semi- supervised neural architecture search,

R. Luo, X. Tan, R. Wang, T. Qin, E. Chen, and T. Liu, “Semi- supervised neural architecture search,” inAdvances in Neural Informa- tion Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020

work page 2020
[40]

Bridging the gap between sample-based and one-shot neural architecture search with bonas,

H. Shi, R. Pi, H. Xu, Z. Li, J. T. Kwok, and T. Zhang, “Bridging the gap between sample-based and one-shot neural architecture search with bonas,” inAdvances in Neural Information Processing Systems, vol. 33, 2020

work page 2020
[41]

ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware

H. Cai, L. Zhu, and S. Han, “Proxylessnas: Direct neural architecture search on target task and hardware,”CoRR, vol. abs/1812.00332, 2018

work page Pith review arXiv 2018
[42]

Fair DARTS: eliminating unfair advantages in differentiable architecture search,

X. Chu, T. Zhou, B. Zhang, and J. Li, “Fair DARTS: eliminating unfair advantages in differentiable architecture search,” inComputer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XV, vol. 12360. Springer, 2020, pp. 465–480

work page 2020
[43]

One-shot neural architecture search via self- evaluated template network,

X. Dong and Y . Yang, “One-shot neural architecture search via self- evaluated template network,” in2019 IEEE/CVF International Confer- ence on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 2019, pp. 3680–3689

work page 2019
[44]

PC- DARTS: partial channel connections for memory-efficient differentiable architecture search,

Y . Xu, L. Xie, X. Zhang, X. Chen, G. Qi, Q. Tian, and H. Xiong, “PC- DARTS: partial channel connections for memory-efficient differentiable architecture search,” inInternational Conference on Learning Represen- tations (ICLR), 2020

work page 2020
[45]

Weight-sharing neural architecture search: A battle to shrink the optimization gap,

L. Xie, X. Chen, K. Bi, L. Wei, Y . Xu, L. Wang, Z. Chen, A. Xiao, J. Chang, X. Zhanget al., “Weight-sharing neural architecture search: A battle to shrink the optimization gap,”ACM Computing Surveys (CSUR), vol. 54, no. 9, pp. 1–37, 2021

work page 2021
[46]

Meco: zero-shot nas with one data and single forward pass via minimum eigenvalue of correlation,

T. Jiang, H. Wang, and R. Bie, “Meco: zero-shot nas with one data and single forward pass via minimum eigenvalue of correlation,”Advances in Neural Information Processing Systems, vol. 36, pp. 61 020–61 047, 2023

work page 2023
[47]

Neural tangent kernel: Conver- gence and generalization in neural networks,

A. Jacot, C. Hongler, and F. Gabriel, “Neural tangent kernel: Conver- gence and generalization in neural networks,” inAdvances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr ´eal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianc...

work page 2018
[48]

Training-free transformer architecture search,

Q. Zhou, K. Sheng, X. Zheng, K. Li, X. Sun, Y . Tian, J. Chen, and R. Ji, “Training-free transformer architecture search,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 894–10 903

work page 2022
[49]

Training-free neural architecture search for rnns and transformers,

A. Serianni and J. Kalita, “Training-free neural architecture search for rnns and transformers,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, A. Rogers, J. L. Boyd-Graber, and N. Okazaki, Eds. Association for Computational Linguistics, 2023, p...

work page 2023
[50]

On the number of response regions of deep feedforward networks with piecewise linear activa- tions,

R. Pascanu, G. Montufar, and Y . Bengio, “On the number of response re- gions of deep feed forward networks with piece-wise linear activations,” arXiv preprint arXiv:1312.6098, 2013

work page arXiv 2013
[51]

Rectified linear units improve restricted boltzmann machines,

V . Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” inProceedings of the 27th International Confer- ence on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, J. F ¨urnkranz and T. Joachims, Eds. Omnipress, 2010, pp. 807–814

work page 2010
[52]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),”arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[53]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[54]

Learning multiple layers of features from tiny images,

A. Krizhevsky, “Learning multiple layers of features from tiny images,” 2009

work page 2009
[55]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA. IEEE Computer Society, 2009, pp. 248–255

work page 2009
[56]

A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets

P. Chrabaszcz, I. Loshchilov, and F. Hutter, “A downsampled variant of imagenet as an alternative to the CIFAR datasets,”CoRR, vol. abs/1707.08819, 2017

work page Pith review arXiv 2017
[57]

Taskonomy: Disentangling task transfer learning,

A. R. Zamir, A. Sax, W. B. Shen, L. J. Guibas, J. Malik, and S. Savarese, “Taskonomy: Disentangling task transfer learning,” in2018 IEEE Con- ference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. Computer Vision Foundation / IEEE Computer Society, 2018, pp. 3712–3722

work page 2018
[58]

Places: A 10 million image database for scene recognition,

B. Zhou, `A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 6, pp. 1452–1464, 2018

work page 2018
[59]

Rethinking the inception architecture for computer vision,

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826

work page 2016
[60]

Blockswap: Fisher-guided block substitution for network compression on a budget,

J. Turner, E. J. Crowley, M. O’Boyle, A. Storkey, and G. Gray, “Blockswap: Fisher-guided block substitution for network compression on a budget,” inInternational Conference on Learning Representations, 2020

work page 2020
[61]

Evaluating efficient performance estimators of neural architectures,

X. Ning, C. Tang, W. Li, Z. Zhou, S. Liang, H. Yang, and Y . Wang, “Evaluating efficient performance estimators of neural architectures,” Advances in Neural Information Processing Systems, vol. 34, pp. 12 265–12 277, 2021

work page 2021
[62]

Picking winning tickets before training by preserving gradient flow,

C. Wang, G. Zhang, and R. Grosse, “Picking winning tickets before training by preserving gradient flow,” inInternational Conference on Learning Representations

work page
[63]

Snip: Single-shot network pruning based on connection sensitivity,

N. Lee, T. Ajanthan, and P. Torr, “Snip: Single-shot network pruning based on connection sensitivity,” inInternational Conference on Learn- ing Representations

work page
[64]

Pruning neural networks without any data by iteratively conserving synaptic flow,

H. Tanaka, D. Kunin, D. L. Yamins, and S. Ganguli, “Pruning neural networks without any data by iteratively conserving synaptic flow,” Advances in neural information processing systems, vol. 33, pp. 6377– 6389, 2020. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15

work page 2020
[65]

Openwebtext corpus,

A. Gokaslan and V . Cohen, “Openwebtext corpus,” http://Skylion007. github.io/OpenWebTextCorpus, 2019

work page 2019
[66]

ELECTRA: Pre- training text encoders as discriminators rather than generators,

K. Clark, M.-T. Luong, Q. V . Le, and C. D. Manning, “ELECTRA: Pre- training text encoders as discriminators rather than generators,” inICLR,

work page
[67]

Available: https://openreview.net/pdf?id=r1xMH1BtvB

[Online]. Available: https://openreview.net/pdf?id=r1xMH1BtvB

work page
[68]

Neural network accept- ability judgments,

A. Warstadt, A. Singh, and S. R. Bowman, “Neural network accept- ability judgments,”Transactions of the Association for Computational Linguistics, vol. 7, pp. 625–641, 2019

work page 2019
[69]

A broad-coverage challenge corpus for sentence understanding through inference,

A. Williams, N. Nangia, and S. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 1112–1122

work page 2018
[70]

Automatically constructing a corpus of sen- tential paraphrases,

B. Dolan and C. Brockett, “Automatically constructing a corpus of sen- tential paraphrases,” inThird international workshop on paraphrasing (IWP2005), 2005

work page 2005
[71]

Squad: 100,000+ questions for machine comprehension of text,

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ questions for machine comprehension of text,” inProceedings of the 2016 Conference on Empirical Methods in Natural Language Process- ing. Association for Computational Linguistics, 2016

work page 2016
[72]

Recursive deep models for semantic compositionality over a sentiment treebank,

R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y . Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” inProceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1631–1642

work page 2013
[73]

Semeval-2017 task 1: Semantic textual similarity multilingual and cross-lingual focused evaluation,

D. Cer, M. Diab, E. E. Agirre, I. Lopez-Gazpio, and L. Specia, “Semeval-2017 task 1: Semantic textual similarity multilingual and cross-lingual focused evaluation,” inThe 11th International Workshop on Semantic Evaluation (SemEval-2017), 2017, pp. 1–14

work page 2017
[74]

Exploring and predicting transferability across nlp tasks,

T. Vu, T. Wang, T. Munkhdalai, A. Sordoni, A. Trischler, A. Mattarella- Micke, S. Maji, and M. Iyyer, “Exploring and predicting transferability across nlp tasks,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 7882– 7926

work page 2020
[75]

Econas: Finding proxies for economical neural architecture search,

D. Zhou, X. Zhou, W. Zhang, C. C. Loy, S. Yi, X. Zhang, and W. Ouyang, “Econas: Finding proxies for economical neural architecture search,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 2020, pp. 11 393–11 401

work page 2020
[76]

Evolving neural architecture using one shot model,

N. Sinha and K. Chen, “Evolving neural architecture using one shot model,” inGECCO ’21: Genetic and Evolutionary Computation Con- ference, Lille, France, July 10-14, 2021, F. Chicano and K. Krawiec, Eds. ACM, 2021, pp. 910–918

work page 2021
[77]

Random search and reproducibility for neural architecture search,

L. Li and A. Talwalkar, “Random search and reproducibility for neural architecture search,” inProceedings of The 35th Uncertainty in Arti- ficial Intelligence Conference, ser. Proceedings of Machine Learning Research, vol. 115, 2020, pp. 367–377

work page 2020
[78]

EENA: efficient evolution of neural architecture,

H. Zhu, Z. An, C. Yang, K. Xu, E. Zhao, and Y . Xu, “EENA: efficient evolution of neural architecture,” in2019 IEEE/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, Seoul, Korea (South), October 27-28, 2019. IEEE, 2019, pp. 1891– 1899

work page 2019
[79]

CARS: continuous evolution for efficient neural architecture search,

Z. Yang, Y . Wang, X. Chen, B. Shi, C. Xu, C. Xu, Q. Tian, and C. Xu, “CARS: continuous evolution for efficient neural architecture search,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 2020, pp. 1826–1835

work page 2020
[80]

Progressive differentiable archi- tecture search: Bridging the depth gap between search and evaluation,

X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiable archi- tecture search: Bridging the depth gap between search and evaluation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

work page 2019

Showing first 80 references.