arxiv: 2605.07256 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords transformer architecture searchLoRAmixture of expertsfeature collapsevision transformersneural architecture searchparameter-efficient adaptation

0 comments

The pith

TAS-LoRA equips transformer architecture search with a mixture of LoRA experts so that subnets learn distinct features instead of collapsing to shared representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision transformer architecture search builds a supernet from which many candidate subnets can be sampled, yet shared weights cause the subnets to converge on nearly identical features. TAS-LoRA counters this by attaching low-rank adaptation modules that are treated as experts and routed on the fly according to each subnet's architecture. A lightweight router network makes the assignment, and a group-wise initialization step ensures the experts start learning different directions before training proceeds. The resulting subnets therefore extract architecture-specific representations while the search itself remains computationally light. Experiments on ImageNet and multiple transfer-learning datasets show clear accuracy gains over prior TAS baselines.

Core claim

TAS-LoRA mitigates feature collapse in transformer architecture search by introducing parameter-efficient LoRA modules organized as a Mixture-of-LoRAExperts. A lightweight router dynamically selects which expert to apply based on the sampled subnet architecture, and a group-wise router initialization encourages early diversity among the experts. This combination allows each subnet to learn its own features despite weight sharing, producing higher-performing architectures on ImageNet classification and on transfer tasks including CIFAR-10/100, Flowers, Cars, and iNat-19.

What carries the argument

Mixture-of-LoRAExperts (MoLE) router that assigns specialized low-rank adaptation modules to subnets according to their architectures, augmented by group-wise initialization to promote expert diversity.

If this is right

Each sampled subnet extracts architecture-specific features, raising its standalone accuracy without increasing inference cost.
The search remains efficient because LoRA adds only a small number of trainable parameters per expert.
Gains observed on ImageNet translate directly to improved transfer performance on smaller image-classification benchmarks.
The approach can be grafted onto existing supernet-based TAS pipelines without redesigning the search algorithm itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same router-plus-expert pattern could be tested in convolutional or hybrid supernets to see whether feature collapse is equally alleviated outside pure transformers.
Increasing the number of experts or making the router architecture-aware at multiple depths might further separate the learned representations.
If the initialization technique proves critical, similar grouping strategies could be applied to other mixture-of-experts modules in large-scale model training.

Load-bearing premise

The router will actually route different experts to different subnets in a manner that produces genuinely distinct features rather than allowing the supernet to absorb the extra parameters without changing its collapsed behavior.

What would settle it

Train an ablation in which the router is replaced by a fixed or random assignment of the same LoRA experts; if the resulting subnets still exhibit high feature similarity and no accuracy gain, the dynamic routing mechanism is not responsible for the claimed improvement.

Figures

Figures reproduced from arXiv: 2605.07256 by Bumsub Ham, Hyunju Lee, Jeimin Jeon.

**Figure 1.** Figure 1: Feature similarities between subnets trained with different strategies. Six subnets are randomly sampled from each supernet, and [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Left: An overview of TAS-LoRA. We exploit a MoLE for TAS to learn subnet-specific feature representations effectively and efficiently. The router dynamically assigns expert weights to each subnet based on its architectural properties. Right: Illustration of our router design. Both block-level and subnet-level attributes are processed by a learnable block embedding layer, and passed through an LSTM [19], wh… view at source ↗

**Figure 3.** Figure 3: Comparison of router initialization strategies. (a) Ran [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Cosine similarities of features from LoRA experts in the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Transformer architecture search (TAS) discovers optimal vision transformer (ViT) architectures automatically, reducing human effort to manually design ViTs. However, existing TAS methods suffer from the feature collapse problem, where subnets within a supernet fail to learn subnet-specific features, mainly due to the shared weights in a supernet, limiting the performance of individual subnets. To address this, we propose TAS-LoRA, a novel method that introduces parameter-efficient low-rank adaptation (LoRA) to enable subnet-specific feature learning, while maintaining computational efficiency. TAS-LoRA incorporates a Mixture-of-LoRAExperts (MoLE) strategy, where a lightweight router dynamically assigns LoRA experts based on subnet architectures, and introduces a group-wise router initialization technique to encourage diverse feature learning across experts early in training. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate that TAS-LoRA mitigates feature collapse effectively, improving performance over state-of-the-art TAS methods significantly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TAS-LoRA layers routed LoRA experts onto a ViT supernet to tackle feature collapse, but without checks on whether the router actually diversifies features the gains may just come from added capacity.

read the letter

The core move is to take an existing supernet for transformer architecture search and add a small set of LoRA experts whose selection is controlled by a lightweight router that sees the subnet architecture. They also initialize the router in groups to push early diversity. That specific combination has not been tried in the TAS literature they cite, so the integration itself is the new piece. The experiments report better ImageNet accuracy and stronger transfer results on CIFAR, Flowers, Cars, and iNat than prior TAS baselines, which is the main empirical support they offer. The approach keeps the supernet memory footprint reasonable because LoRA is low-rank, so the efficiency claim holds up on paper. The soft spot is the missing link between the router and the claimed reduction in collapse. There is no reported measurement of expert activation per architecture, no cosine-similarity check across subnet features, and no ablation that holds the total LoRA parameters fixed while turning the router off or randomizing it. If performance stays similar under those controls, the improvement is explained by extra trainable parameters rather than architecture-specific feature learning. The abstract also gives no error bars or run-to-run variance, which makes it harder to judge how reliable the reported margins are. This paper sits inside the narrow NAS-for-ViT community. A reader already working on supernet training or parameter-efficient adaptation for search would find the router-plus-group-init trick worth testing, but it does not change the broader picture of how we do architecture search. The idea is coherent enough and the problem it targets is real, so it should go to peer review; the referees can ask for the missing ablations and variance numbers.

Referee Report

3 major / 2 minor

Summary. The paper proposes TAS-LoRA for vision transformer architecture search. It augments a supernet with parameter-efficient LoRA modules and a Mixture-of-LoRAExperts (MoLE) router that conditions expert assignment on subnet architecture, together with a group-wise router initialization scheme. The central claim is that this combination mitigates feature collapse (subnets failing to learn distinct representations under shared weights), yielding substantial accuracy gains over prior TAS methods on ImageNet and transfer tasks (CIFAR-10/100, Flowers, CARS, INAT-19).

Significance. If the mechanism is verified, the work would be significant for supernet-based NAS: it offers a lightweight, trainable way to encourage architecture-specific representations without inflating inference cost. The MoLE-plus-initialization design is a concrete, reproducible idea that could be adopted in other weight-sharing NAS pipelines for transformers.

major comments (3)

[Abstract, §4] Abstract and §4 (Experiments): the claim that TAS-LoRA 'mitigates feature collapse effectively' is unsupported by any reported measurement of collapse (e.g., inter-subnet feature cosine similarity, expert activation histograms conditioned on architecture, or diversity metrics). Without these, it is impossible to confirm that the router produces subnet-specific features rather than simply adding capacity.
[§3, §4.2] §3 (Method) and §4.2 (Ablations): no ablation isolates the router's contribution from the mere addition of multiple LoRA experts. An experiment with random or uniform expert assignment (keeping total LoRA parameters fixed) is required to test whether the architecture-conditioned routing is load-bearing for the reported gains.
[§4] §4 (Experiments): results are presented without error bars, multiple random seeds, or a full experimental protocol (hyper-parameters for router training, supernet sampling strategy, and exact transfer-learning fine-tuning settings). This prevents assessment of statistical reliability and reproducibility of the claimed improvements over SOTA TAS baselines.

minor comments (2)

[§3] Notation for the router output probabilities and the group-wise initialization is introduced without an accompanying equation or pseudocode block, making the precise initialization procedure hard to replicate.
[§4] Table captions and axis labels in the experimental figures should explicitly state the number of subnets evaluated and whether the reported numbers are top-1 accuracy on the validation or test split.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed review and valuable suggestions. We appreciate the opportunity to clarify and strengthen our work on TAS-LoRA. Below, we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (Experiments): the claim that TAS-LoRA 'mitigates feature collapse effectively' is unsupported by any reported measurement of collapse (e.g., inter-subnet feature cosine similarity, expert activation histograms conditioned on architecture, or diversity metrics). Without these, it is impossible to confirm that the router produces subnet-specific features rather than simply adding capacity.

Authors: We thank the referee for pointing this out. While the performance gains on ImageNet and transfer tasks provide indirect evidence for the mitigation of feature collapse, we agree that direct measurements would strengthen the claim. In the revised manuscript, we will include analyses such as expert activation histograms conditioned on architecture and inter-subnet feature diversity metrics to better support the mechanism. revision: yes
Referee: [§3, §4.2] §3 (Method) and §4.2 (Ablations): no ablation isolates the router's contribution from the mere addition of multiple LoRA experts. An experiment with random or uniform expert assignment (keeping total LoRA parameters fixed) is required to test whether the architecture-conditioned routing is load-bearing for the reported gains.

Authors: We acknowledge the need for this ablation study. To isolate the contribution of the architecture-conditioned router, we will add an experiment comparing our MoLE router against random and uniform expert assignments, while keeping the total number of LoRA parameters constant. This will be included in the revised §4.2. revision: yes
Referee: [§4] §4 (Experiments): results are presented without error bars, multiple random seeds, or a full experimental protocol (hyper-parameters for router training, supernet sampling strategy, and exact transfer-learning fine-tuning settings). This prevents assessment of statistical reliability and reproducibility of the claimed improvements over SOTA TAS baselines.

Authors: We apologize for not including these details in the initial submission. In the revised version, we will report all results with error bars from multiple random seeds (e.g., 3-5 runs), and provide a comprehensive experimental protocol including hyper-parameters for router training, supernet sampling, and transfer learning settings to ensure reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method proposal with no derivation chain

full rationale

The paper introduces TAS-LoRA as an architectural modification to existing supernet-based TAS, using LoRA experts and a lightweight MoLE router with group-wise initialization. No equations, derivations, or first-principles predictions are present that could reduce to inputs by construction. Performance claims rest entirely on experimental results across ImageNet and transfer benchmarks, which are independent falsifiable measurements rather than self-referential fits or renamings. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, fitted constants, or new physical entities are described; the contribution is an algorithmic combination of existing techniques.

pith-pipeline@v0.9.0 · 5481 in / 994 out tokens · 34316 ms · 2026-05-11T01:30:09.691059+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

subnets within a supernet fail to learn subnet-specific features, mainly due to the shared weights... TAS-LoRA incorporates a Mixture-of-LoRAExperts (MoLE) strategy, where a lightweight router dynamically assigns LoRA experts based on subnet architectures
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

group-wise router initialization... biases each group toward the designated expert, encouraging exploiting diverse experts early in training

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 3 internal anchors

[1]

Understanding and simplifying one-shot architecture search

Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. InICML, pages 550–559, 2018. 2

work page 2018
[2]

ProxylessNAS: Direct neural architecture search on target task and hardware

Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In ICLR, 2019

work page 2019
[3]

Once-for-All: Train one network and specialize it for efficient deployment

Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-All: Train one network and specialize it for efficient deployment. InICLR, 2020. 2

work page 2020
[4]

Glit: Neural architecture search for global and local image transformer

Boyu Chen, Peixia Li, Chuming Li, Baopu Li, Lei Bai, Chen Lin, Ming Sun, Junjie Yan, and Wanli Ouyang. Glit: Neural architecture search for global and local image transformer. In ICCV, 2021. 6, 7

work page 2021
[5]

AutoFormer: Searching transformers for visual recognition

Minghao Chen, Houwen Peng, Jianlong Fu, and Haibin Ling. AutoFormer: Searching transformers for visual recognition. InICCV, pages 12270–12280, 2021. 1, 2, 3, 4, 5, 6, 7, 8, 12, 13

work page 2021
[6]

Dearkd: data-efficient early knowledge distillation for vision transformers

Xianing Chen, Qiong Cao, Yujie Zhong, Jing Zhang, Shenghua Gao, and Dacheng Tao. Dearkd: data-efficient early knowledge distillation for vision transformers. InCVPR,

work page
[7]

Empirical evaluation of gated recurrent neural networks on sequence modeling.NeurIPSW, 2014

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling.NeurIPSW, 2014. 11

work page 2014
[8]

Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, Zhenda Xie, Y . K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeek- MoE: Towards ultimate expert specialization in mixture-of- experts language models.CoRR, abs/2401.06066, 2024. 5

work page internal anchor Pith review arXiv 2024
[9]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009. 2, 6, 8, 12

work page 2009
[10]

Understanding and exploring the network with stochastic architectures

Zhijie Deng, Yinpeng Dong, Shifeng Zhang, and Jun Zhu. Understanding and exploring the network with stochastic architectures. InNeurIPS, 2020. 1, 3

work page 2020
[11]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2021. 1, 6, 7

work page 2021
[12]

Convit: Improving vision transformers with soft convolutional inductive biases

St´ephane d’Ascoli, Hugo Touvron, Matthew L Leavitt, Ari S Morcos, Giulio Biroli, and Levent Sagun. Convit: Improving vision transformers with soft convolutional inductive biases. InICML, 2021. 6

work page 2021
[13]

Buehler E.L

M.J. Buehler E.L. Buehler. X-lora: Mixture of low-rank adapter experts, a flexible framework for large language mod- els with applications in protein mechanics and design.ArXiv,

work page
[14]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 2022. 5

work page 2022
[15]

Higher layers need more lora experts.arXiv preprint arXiv:2402.08562,

Chongyang Gao, Kezhen Chen, Jinmeng Rao, Baochen Sun, Ruibo Liu, Daiyi Peng, Yawen Zhang, Xiaoyuan Guo, Jie Yang, and VS Subrahmanian. Higher layers need more lora experts.arXiv preprint arXiv:2402.08562, 2024. 13

work page arXiv 2024
[16]

A comparative anal- ysis of selection schemes used in genetic algorithms

David E Goldberg and Kalyanmoy Deb. A comparative anal- ysis of selection schemes used in genetic algorithms. In Foundations of genetic algorithms. 1991. 5

work page 1991
[17]

Transformer in transformer.NeurIPS, 2021

Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer.NeurIPS, 2021. 6

work page 2021
[18]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016. 1

work page 2016
[19]

Long short-term memory.Neural computation, 1997

Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.Neural computation, 1997. 4, 5, 11

work page 1997
[20]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. MobileNets: Efficient convolu- tional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861, 2017. 1

work page internal anchor Pith review arXiv 2017
[21]

Lora: Low-rank adaptation of large language models.ICLR, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 2022. 2

work page 2022
[22]

Ganesh Jawahar, Haichuan Yang, Yunyang Xiong, Zechun Liu, Dilin Wang, Fei Sun, Meng Li, Aasish Pappu, Barlas Oguz, Muhammad Abdul-Mageed, Laks V . S. Lakshmanan, Raghuraman Krishnamoorthi, and Vikas Chandra. Mixture- of-supernets: Improving weight-sharing supernet training with architecture-routed mixture-of-experts. InACL, 2024. 2

work page 2024
[23]

Subnet- aware dynamic supernet training for neural architecture search

Jeimin Jeon, Youngmin Oh, Junghyup Lee, Donghyeon Baek, Dohyung Kim, Chanho Eom, and Bumsub Ham. Subnet- aware dynamic supernet training for neural architecture search. InCVPR, 2025. 2, 4, 6, 7

work page 2025
[24]

Similarity of neural network representations revisited

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InICML, 2019. 5

work page 2019
[25]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCVW, 2013. 2, 6, 7, 8

work page 2013
[26]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, 2009. 2, 6, 7, 8

work page 2009
[27]

AZ-NAS: Assembling zero- cost proxies for network architecture search

Junghyup Lee and Bumsub Ham. AZ-NAS: Assembling zero- cost proxies for network architecture search. InCVPR, 2024. 1, 2 9

work page 2024
[28]

Gshard: Scaling giant models with conditional computation and automatic sharding, 2020

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding, 2020. 5

work page 2020
[29]

DARTS: Differentiable architecture search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. InICLR, 2019. 2

work page 2019
[30]

Focusformer: Focus- ing on what we need via architecture sampler.arXiv preprint arXiv:2208.10861, 2022

Jing Liu, Jianfei Cai, and Bohan Zhuang. Focusformer: Focus- ing on what we need via architecture sampler.arXiv preprint arXiv:2208.10861, 2022. 1, 2, 3, 6, 7

work page arXiv 2022
[31]

Dora: Weight-decomposed low-rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. InICML, 2024. 2

work page 2024
[32]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021. 1

work page 2021
[33]

Decoupled weight decay regularization.ICLR, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.ICLR, 2019. 6, 7

work page 2019
[34]

Pissa: Prin- cipal singular values and singular vectors adaptation of large language models.NeurIPS, 2025

Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Prin- cipal singular values and singular vectors adaptation of large language models.NeurIPS, 2025. 2

work page 2025
[35]

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 7

work page internal anchor Pith review arXiv 2013
[36]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing. IEEE. 2, 6, 7, 8

work page
[37]

Efficient few-shot neural architecture search by counting the number of nonlinear functions

Youngmin Oh, Hyunju Lee, and Bumsub Ham. Efficient few-shot neural architecture search by counting the number of nonlinear functions. InAAAI, 2025. 2

work page 2025
[38]

Pi-nas: Improving neural archi- tecture search by reducing supernet training consistency shift

Jiefeng Peng, Jiqi Zhang, Changlin Li, Guangrun Wang, Xi- aodan Liang, and Liang Lin. Pi-nas: Improving neural archi- tecture search by reducing supernet training consistency shift. InICCV, 2021. 3

work page 2021
[39]

Glove: Global vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher D Man- ning. Glove: Global vectors for word representation. In EMNLP, 2014. 7

work page 2014
[40]

MobileNetV2: Inverted residuals and linear bottlenecks

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. InCVPR, pages 4510–4520,

work page
[41]

Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer. InICLR, 2017. 5

work page 2017
[42]

Vitas: Vision transformer architecture search

Xiu Su, Shan You, Jiyang Xie, Mingkai Zheng, Fei Wang, Chen Qian, Changshui Zhang, Xiaogang Wang, and Chang Xu. Vitas: Vision transformer architecture search. InECCV,

work page
[43]

Unleashing the power of gradient signal-to-noise ratio for zero-shot NAS

Zihao Sun, Yu Sun, Longxing Yang, Shun Lu, Jilin Mei, Wenxiao Zhao, and Yu Hu. Unleashing the power of gradient signal-to-noise ratio for zero-shot NAS. InCVPR, pages 5763–5773, 2023. 2

work page 2023
[44]

EfficientNet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. EfficientNet: Rethinking model scaling for convolutional neural networks. InICML, pages 6105–6114, 2019. 1

work page 2019
[45]

Training data-efficient image transformers & distillation through atten- tion

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J´egou. Training data-efficient image transformers & distillation through atten- tion. InICML, 2021. 1, 6, 7

work page 2021
[46]

The inaturalist species classification and detection dataset

Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InCVPR, 2018. 2, 6, 7, 8

work page 2018
[47]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 1

work page 2017
[48]

Prenas: Preferred one-shot learning towards efficient neural architec- ture search

Haibin Wang, Ce Ge, Hesen Chen, and Xiuyu Sun. Prenas: Preferred one-shot learning towards efficient neural architec- ture search. InICML, 2023. 2, 3, 6, 7

work page 2023
[49]

Pyra- mid vision transformer: A versatile backbone for dense pre- diction without convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra- mid vision transformer: A versatile backbone for dense pre- diction without convolutions. InICCV, pages 568–578, 2021. 6

work page 2021
[50]

Auto- prox: Training-free vision transformer architecture search via automatic proxy discovery

Zimian Wei, Peijie Dong, Zheng Hui, Anggeng Li, Lujun Li, Menglong Lu, Hengyue Pan, and Dongsheng Li. Auto- prox: Training-free vision transformer architecture search via automatic proxy discovery. InAAAI, 2024. 1, 6

work page 2024
[51]

Mixture of loRA experts

Xun Wu, Shaohan Huang, and Furu Wei. Mixture of loRA experts. InICLR, 2024. 3

work page 2024
[52]

Vitae: Vision transformer advanced by exploring intrinsic inductive bias.NeurIPS, 2021

Yufei Xu, Qiming Zhang, Jing Zhang, and Dacheng Tao. Vitae: Vision transformer advanced by exploring intrinsic inductive bias.NeurIPS, 2021. 7

work page 2021
[53]

Multi-task dense prediction via mixture of low-rank experts

Yuqi Yang, Peng-Tao Jiang, Qibin Hou, Hao Zhang, Jinwei Chen, and Bo Li. Multi-task dense prediction via mixture of low-rank experts. InCVPR, 2024. 3

work page 2024
[54]

Adalora: Adaptive budget allocation for parameter- efficient fine-tuning.ICLR, 2023

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter- efficient fine-tuning.ICLR, 2023. 3, 13

work page 2023
[55]

Few-shot neural architecture search

Yiyang Zhao, Linnan Wang, Yuandong Tian, Rodrigo Fon- seca, and Tian Guo. Few-shot neural architecture search. In ICML, pages 12707–12718, 2021. 2

work page 2021
[56]

Training-free transformer architecture search

Qinqin Zhou, Kekai Sheng, Xiawu Zheng, Ke Li, Xing Sun, Yonghong Tian, Jie Chen, and Rongrong Ji. Training-free transformer architecture search. InCVPR, pages 10894– 10903, 2022. 2, 6 10 TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts Supplementary Material In this supplementary material, we provide more detailed analyses on the des...

work page 2022