Recognition: 2 theorem links
· Lean TheoremTAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts
Pith reviewed 2026-05-11 01:30 UTC · model grok-4.3
The pith
TAS-LoRA equips transformer architecture search with a mixture of LoRA experts so that subnets learn distinct features instead of collapsing to shared representations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TAS-LoRA mitigates feature collapse in transformer architecture search by introducing parameter-efficient LoRA modules organized as a Mixture-of-LoRAExperts. A lightweight router dynamically selects which expert to apply based on the sampled subnet architecture, and a group-wise router initialization encourages early diversity among the experts. This combination allows each subnet to learn its own features despite weight sharing, producing higher-performing architectures on ImageNet classification and on transfer tasks including CIFAR-10/100, Flowers, Cars, and iNat-19.
What carries the argument
Mixture-of-LoRAExperts (MoLE) router that assigns specialized low-rank adaptation modules to subnets according to their architectures, augmented by group-wise initialization to promote expert diversity.
If this is right
- Each sampled subnet extracts architecture-specific features, raising its standalone accuracy without increasing inference cost.
- The search remains efficient because LoRA adds only a small number of trainable parameters per expert.
- Gains observed on ImageNet translate directly to improved transfer performance on smaller image-classification benchmarks.
- The approach can be grafted onto existing supernet-based TAS pipelines without redesigning the search algorithm itself.
Where Pith is reading between the lines
- The same router-plus-expert pattern could be tested in convolutional or hybrid supernets to see whether feature collapse is equally alleviated outside pure transformers.
- Increasing the number of experts or making the router architecture-aware at multiple depths might further separate the learned representations.
- If the initialization technique proves critical, similar grouping strategies could be applied to other mixture-of-experts modules in large-scale model training.
Load-bearing premise
The router will actually route different experts to different subnets in a manner that produces genuinely distinct features rather than allowing the supernet to absorb the extra parameters without changing its collapsed behavior.
What would settle it
Train an ablation in which the router is replaced by a fixed or random assignment of the same LoRA experts; if the resulting subnets still exhibit high feature similarity and no accuracy gain, the dynamic routing mechanism is not responsible for the claimed improvement.
Figures
read the original abstract
Transformer architecture search (TAS) discovers optimal vision transformer (ViT) architectures automatically, reducing human effort to manually design ViTs. However, existing TAS methods suffer from the feature collapse problem, where subnets within a supernet fail to learn subnet-specific features, mainly due to the shared weights in a supernet, limiting the performance of individual subnets. To address this, we propose TAS-LoRA, a novel method that introduces parameter-efficient low-rank adaptation (LoRA) to enable subnet-specific feature learning, while maintaining computational efficiency. TAS-LoRA incorporates a Mixture-of-LoRAExperts (MoLE) strategy, where a lightweight router dynamically assigns LoRA experts based on subnet architectures, and introduces a group-wise router initialization technique to encourage diverse feature learning across experts early in training. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate that TAS-LoRA mitigates feature collapse effectively, improving performance over state-of-the-art TAS methods significantly.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TAS-LoRA for vision transformer architecture search. It augments a supernet with parameter-efficient LoRA modules and a Mixture-of-LoRAExperts (MoLE) router that conditions expert assignment on subnet architecture, together with a group-wise router initialization scheme. The central claim is that this combination mitigates feature collapse (subnets failing to learn distinct representations under shared weights), yielding substantial accuracy gains over prior TAS methods on ImageNet and transfer tasks (CIFAR-10/100, Flowers, CARS, INAT-19).
Significance. If the mechanism is verified, the work would be significant for supernet-based NAS: it offers a lightweight, trainable way to encourage architecture-specific representations without inflating inference cost. The MoLE-plus-initialization design is a concrete, reproducible idea that could be adopted in other weight-sharing NAS pipelines for transformers.
major comments (3)
- [Abstract, §4] Abstract and §4 (Experiments): the claim that TAS-LoRA 'mitigates feature collapse effectively' is unsupported by any reported measurement of collapse (e.g., inter-subnet feature cosine similarity, expert activation histograms conditioned on architecture, or diversity metrics). Without these, it is impossible to confirm that the router produces subnet-specific features rather than simply adding capacity.
- [§3, §4.2] §3 (Method) and §4.2 (Ablations): no ablation isolates the router's contribution from the mere addition of multiple LoRA experts. An experiment with random or uniform expert assignment (keeping total LoRA parameters fixed) is required to test whether the architecture-conditioned routing is load-bearing for the reported gains.
- [§4] §4 (Experiments): results are presented without error bars, multiple random seeds, or a full experimental protocol (hyper-parameters for router training, supernet sampling strategy, and exact transfer-learning fine-tuning settings). This prevents assessment of statistical reliability and reproducibility of the claimed improvements over SOTA TAS baselines.
minor comments (2)
- [§3] Notation for the router output probabilities and the group-wise initialization is introduced without an accompanying equation or pseudocode block, making the precise initialization procedure hard to replicate.
- [§4] Table captions and axis labels in the experimental figures should explicitly state the number of subnets evaluated and whether the reported numbers are top-1 accuracy on the validation or test split.
Simulated Author's Rebuttal
Thank you for the detailed review and valuable suggestions. We appreciate the opportunity to clarify and strengthen our work on TAS-LoRA. Below, we provide point-by-point responses to the major comments.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (Experiments): the claim that TAS-LoRA 'mitigates feature collapse effectively' is unsupported by any reported measurement of collapse (e.g., inter-subnet feature cosine similarity, expert activation histograms conditioned on architecture, or diversity metrics). Without these, it is impossible to confirm that the router produces subnet-specific features rather than simply adding capacity.
Authors: We thank the referee for pointing this out. While the performance gains on ImageNet and transfer tasks provide indirect evidence for the mitigation of feature collapse, we agree that direct measurements would strengthen the claim. In the revised manuscript, we will include analyses such as expert activation histograms conditioned on architecture and inter-subnet feature diversity metrics to better support the mechanism. revision: yes
-
Referee: [§3, §4.2] §3 (Method) and §4.2 (Ablations): no ablation isolates the router's contribution from the mere addition of multiple LoRA experts. An experiment with random or uniform expert assignment (keeping total LoRA parameters fixed) is required to test whether the architecture-conditioned routing is load-bearing for the reported gains.
Authors: We acknowledge the need for this ablation study. To isolate the contribution of the architecture-conditioned router, we will add an experiment comparing our MoLE router against random and uniform expert assignments, while keeping the total number of LoRA parameters constant. This will be included in the revised §4.2. revision: yes
-
Referee: [§4] §4 (Experiments): results are presented without error bars, multiple random seeds, or a full experimental protocol (hyper-parameters for router training, supernet sampling strategy, and exact transfer-learning fine-tuning settings). This prevents assessment of statistical reliability and reproducibility of the claimed improvements over SOTA TAS baselines.
Authors: We apologize for not including these details in the initial submission. In the revised version, we will report all results with error bars from multiple random seeds (e.g., 3-5 runs), and provide a comprehensive experimental protocol including hyper-parameters for router training, supernet sampling, and transfer learning settings to ensure reproducibility. revision: yes
Circularity Check
No significant circularity; empirical method proposal with no derivation chain
full rationale
The paper introduces TAS-LoRA as an architectural modification to existing supernet-based TAS, using LoRA experts and a lightweight MoLE router with group-wise initialization. No equations, derivations, or first-principles predictions are present that could reduce to inputs by construction. Performance claims rest entirely on experimental results across ImageNet and transfer benchmarks, which are independent falsifiable measurements rather than self-referential fits or renamings. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
subnets within a supernet fail to learn subnet-specific features, mainly due to the shared weights... TAS-LoRA incorporates a Mixture-of-LoRAExperts (MoLE) strategy, where a lightweight router dynamically assigns LoRA experts based on subnet architectures
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
group-wise router initialization... biases each group toward the designated expert, encouraging exploiting diverse experts early in training
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Understanding and simplifying one-shot architecture search
Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. InICML, pages 550–559, 2018. 2
work page 2018
-
[2]
ProxylessNAS: Direct neural architecture search on target task and hardware
Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In ICLR, 2019
work page 2019
-
[3]
Once-for-All: Train one network and specialize it for efficient deployment
Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-All: Train one network and specialize it for efficient deployment. InICLR, 2020. 2
work page 2020
-
[4]
Glit: Neural architecture search for global and local image transformer
Boyu Chen, Peixia Li, Chuming Li, Baopu Li, Lei Bai, Chen Lin, Ming Sun, Junjie Yan, and Wanli Ouyang. Glit: Neural architecture search for global and local image transformer. In ICCV, 2021. 6, 7
work page 2021
-
[5]
AutoFormer: Searching transformers for visual recognition
Minghao Chen, Houwen Peng, Jianlong Fu, and Haibin Ling. AutoFormer: Searching transformers for visual recognition. InICCV, pages 12270–12280, 2021. 1, 2, 3, 4, 5, 6, 7, 8, 12, 13
work page 2021
-
[6]
Dearkd: data-efficient early knowledge distillation for vision transformers
Xianing Chen, Qiong Cao, Yujie Zhong, Jing Zhang, Shenghua Gao, and Dacheng Tao. Dearkd: data-efficient early knowledge distillation for vision transformers. InCVPR,
-
[7]
Empirical evaluation of gated recurrent neural networks on sequence modeling.NeurIPSW, 2014
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling.NeurIPSW, 2014. 11
work page 2014
-
[8]
Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, Zhenda Xie, Y . K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeek- MoE: Towards ultimate expert specialization in mixture-of- experts language models.CoRR, abs/2401.06066, 2024. 5
work page internal anchor Pith review arXiv 2024
-
[9]
ImageNet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009. 2, 6, 8, 12
work page 2009
-
[10]
Understanding and exploring the network with stochastic architectures
Zhijie Deng, Yinpeng Dong, Shifeng Zhang, and Jun Zhu. Understanding and exploring the network with stochastic architectures. InNeurIPS, 2020. 1, 3
work page 2020
-
[11]
An image is worth 16x16 words: Trans- formers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2021. 1, 6, 7
work page 2021
-
[12]
Convit: Improving vision transformers with soft convolutional inductive biases
St´ephane d’Ascoli, Hugo Touvron, Matthew L Leavitt, Ari S Morcos, Giulio Biroli, and Levent Sagun. Convit: Improving vision transformers with soft convolutional inductive biases. InICML, 2021. 6
work page 2021
-
[13]
M.J. Buehler E.L. Buehler. X-lora: Mixture of low-rank adapter experts, a flexible framework for large language mod- els with applications in protein mechanics and design.ArXiv,
-
[14]
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.JMLR, 2022. 5
work page 2022
-
[15]
Higher layers need more lora experts.arXiv preprint arXiv:2402.08562,
Chongyang Gao, Kezhen Chen, Jinmeng Rao, Baochen Sun, Ruibo Liu, Daiyi Peng, Yawen Zhang, Xiaoyuan Guo, Jie Yang, and VS Subrahmanian. Higher layers need more lora experts.arXiv preprint arXiv:2402.08562, 2024. 13
-
[16]
A comparative anal- ysis of selection schemes used in genetic algorithms
David E Goldberg and Kalyanmoy Deb. A comparative anal- ysis of selection schemes used in genetic algorithms. In Foundations of genetic algorithms. 1991. 5
work page 1991
-
[17]
Transformer in transformer.NeurIPS, 2021
Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer.NeurIPS, 2021. 6
work page 2021
-
[18]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016. 1
work page 2016
-
[19]
Long short-term memory.Neural computation, 1997
Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.Neural computation, 1997. 4, 5, 11
work page 1997
-
[20]
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. MobileNets: Efficient convolu- tional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861, 2017. 1
work page internal anchor Pith review arXiv 2017
-
[21]
Lora: Low-rank adaptation of large language models.ICLR, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 2022. 2
work page 2022
-
[22]
Ganesh Jawahar, Haichuan Yang, Yunyang Xiong, Zechun Liu, Dilin Wang, Fei Sun, Meng Li, Aasish Pappu, Barlas Oguz, Muhammad Abdul-Mageed, Laks V . S. Lakshmanan, Raghuraman Krishnamoorthi, and Vikas Chandra. Mixture- of-supernets: Improving weight-sharing supernet training with architecture-routed mixture-of-experts. InACL, 2024. 2
work page 2024
-
[23]
Subnet- aware dynamic supernet training for neural architecture search
Jeimin Jeon, Youngmin Oh, Junghyup Lee, Donghyeon Baek, Dohyung Kim, Chanho Eom, and Bumsub Ham. Subnet- aware dynamic supernet training for neural architecture search. InCVPR, 2025. 2, 4, 6, 7
work page 2025
-
[24]
Similarity of neural network representations revisited
Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. InICML, 2019. 5
work page 2019
-
[25]
3d object representations for fine-grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In ICCVW, 2013. 2, 6, 7, 8
work page 2013
-
[26]
Learning multiple layers of features from tiny images
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, 2009. 2, 6, 7, 8
work page 2009
-
[27]
AZ-NAS: Assembling zero- cost proxies for network architecture search
Junghyup Lee and Bumsub Ham. AZ-NAS: Assembling zero- cost proxies for network architecture search. InCVPR, 2024. 1, 2 9
work page 2024
-
[28]
Gshard: Scaling giant models with conditional computation and automatic sharding, 2020
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding, 2020. 5
work page 2020
-
[29]
DARTS: Differentiable architecture search
Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. InICLR, 2019. 2
work page 2019
-
[30]
Jing Liu, Jianfei Cai, and Bohan Zhuang. Focusformer: Focus- ing on what we need via architecture sampler.arXiv preprint arXiv:2208.10861, 2022. 1, 2, 3, 6, 7
-
[31]
Dora: Weight-decomposed low-rank adaptation
Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. InICML, 2024. 2
work page 2024
-
[32]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021. 1
work page 2021
-
[33]
Decoupled weight decay regularization.ICLR, 2019
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.ICLR, 2019. 6, 7
work page 2019
-
[34]
Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Prin- cipal singular values and singular vectors adaptation of large language models.NeurIPS, 2025. 2
work page 2025
-
[35]
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. 7
work page internal anchor Pith review arXiv 2013
-
[36]
Automated flower classification over a large number of classes
Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing. IEEE. 2, 6, 7, 8
-
[37]
Efficient few-shot neural architecture search by counting the number of nonlinear functions
Youngmin Oh, Hyunju Lee, and Bumsub Ham. Efficient few-shot neural architecture search by counting the number of nonlinear functions. InAAAI, 2025. 2
work page 2025
-
[38]
Pi-nas: Improving neural archi- tecture search by reducing supernet training consistency shift
Jiefeng Peng, Jiqi Zhang, Changlin Li, Guangrun Wang, Xi- aodan Liang, and Liang Lin. Pi-nas: Improving neural archi- tecture search by reducing supernet training consistency shift. InICCV, 2021. 3
work page 2021
-
[39]
Glove: Global vectors for word representation
Jeffrey Pennington, Richard Socher, and Christopher D Man- ning. Glove: Global vectors for word representation. In EMNLP, 2014. 7
work page 2014
-
[40]
MobileNetV2: Inverted residuals and linear bottlenecks
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh- moginov, and Liang-Chieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. InCVPR, pages 4510–4520,
-
[41]
Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer. InICLR, 2017. 5
work page 2017
-
[42]
Vitas: Vision transformer architecture search
Xiu Su, Shan You, Jiyang Xie, Mingkai Zheng, Fei Wang, Chen Qian, Changshui Zhang, Xiaogang Wang, and Chang Xu. Vitas: Vision transformer architecture search. InECCV,
-
[43]
Unleashing the power of gradient signal-to-noise ratio for zero-shot NAS
Zihao Sun, Yu Sun, Longxing Yang, Shun Lu, Jilin Mei, Wenxiao Zhao, and Yu Hu. Unleashing the power of gradient signal-to-noise ratio for zero-shot NAS. InCVPR, pages 5763–5773, 2023. 2
work page 2023
-
[44]
EfficientNet: Rethinking model scaling for convolutional neural networks
Mingxing Tan and Quoc Le. EfficientNet: Rethinking model scaling for convolutional neural networks. InICML, pages 6105–6114, 2019. 1
work page 2019
-
[45]
Training data-efficient image transformers & distillation through atten- tion
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J´egou. Training data-efficient image transformers & distillation through atten- tion. InICML, 2021. 1, 6, 7
work page 2021
-
[46]
The inaturalist species classification and detection dataset
Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InCVPR, 2018. 2, 6, 7, 8
work page 2018
-
[47]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017. 1
work page 2017
-
[48]
Prenas: Preferred one-shot learning towards efficient neural architec- ture search
Haibin Wang, Ce Ge, Hesen Chen, and Xiuyu Sun. Prenas: Preferred one-shot learning towards efficient neural architec- ture search. InICML, 2023. 2, 3, 6, 7
work page 2023
-
[49]
Pyra- mid vision transformer: A versatile backbone for dense pre- diction without convolutions
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyra- mid vision transformer: A versatile backbone for dense pre- diction without convolutions. InICCV, pages 568–578, 2021. 6
work page 2021
-
[50]
Auto- prox: Training-free vision transformer architecture search via automatic proxy discovery
Zimian Wei, Peijie Dong, Zheng Hui, Anggeng Li, Lujun Li, Menglong Lu, Hengyue Pan, and Dongsheng Li. Auto- prox: Training-free vision transformer architecture search via automatic proxy discovery. InAAAI, 2024. 1, 6
work page 2024
-
[51]
Xun Wu, Shaohan Huang, and Furu Wei. Mixture of loRA experts. InICLR, 2024. 3
work page 2024
-
[52]
Vitae: Vision transformer advanced by exploring intrinsic inductive bias.NeurIPS, 2021
Yufei Xu, Qiming Zhang, Jing Zhang, and Dacheng Tao. Vitae: Vision transformer advanced by exploring intrinsic inductive bias.NeurIPS, 2021. 7
work page 2021
-
[53]
Multi-task dense prediction via mixture of low-rank experts
Yuqi Yang, Peng-Tao Jiang, Qibin Hou, Hao Zhang, Jinwei Chen, and Bo Li. Multi-task dense prediction via mixture of low-rank experts. InCVPR, 2024. 3
work page 2024
-
[54]
Adalora: Adaptive budget allocation for parameter- efficient fine-tuning.ICLR, 2023
Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter- efficient fine-tuning.ICLR, 2023. 3, 13
work page 2023
-
[55]
Few-shot neural architecture search
Yiyang Zhao, Linnan Wang, Yuandong Tian, Rodrigo Fon- seca, and Tian Guo. Few-shot neural architecture search. In ICML, pages 12707–12718, 2021. 2
work page 2021
-
[56]
Training-free transformer architecture search
Qinqin Zhou, Kekai Sheng, Xiawu Zheng, Ke Li, Xing Sun, Yonghong Tian, Jie Chen, and Rongrong Ji. Training-free transformer architecture search. InCVPR, pages 10894– 10903, 2022. 2, 6 10 TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts Supplementary Material In this supplementary material, we provide more detailed analyses on the des...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.