arxiv: 2605.13161 · v1 · submitted 2026-05-13 · 💻 cs.CV · cs.LG

Recognition: unknown

A₃B₂: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning

Chang Yao, Jingyuan Chen, Kunxi Li, Mingjing Xu, Wenkang Han, Yiyun Zhou, Zhonghua Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:11 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords few-shot learningvision-language modelsadapter tuningbranch biasCLIPimage classificationuncertainty estimation

0 comments

The pith

An adaptive asymmetric adapter suppresses image-branch updates in vision-language models when uncertainty is high, improving few-shot classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models like CLIP often suffer from branch bias in few-shot image classification, where adapting the image encoder can hurt performance on out-of-distribution data. The paper shows that this bias arises because the two branches are not equally important across tasks. A3B2 counters it with an uncertainty-aware dampening mechanism that reduces image adaptation automatically when predictions are uncertain. This leads to consistent gains over standard prompt and adapter methods on multiple datasets. A sympathetic reader would care because it offers a data-driven way to balance adaptation without extra hyperparameters or manual checks.

Core claim

The central discovery is that branch bias in vision-language image classification can be alleviated by an adaptive asymmetric adapter called A3B2, which uses uncertainty-aware adapter dampening to suppress image-branch adaptation when prediction uncertainty is high. This is implemented through a lightweight design inspired by mixture-of-experts with load balancing regularization. Experiments confirm it outperforms baselines across three few-shot tasks on 11 datasets.

What carries the argument

Uncertainty-Aware Adapter Dampening (UAAD), which automatically reduces the influence of image-branch adaptations based on prediction uncertainty to balance the branches.

Load-bearing premise

Prediction uncertainty reliably signals when image-branch adaptation should be reduced, without creating new errors or needing per-dataset adjustments.

What would settle it

A dataset where high uncertainty predictions still benefit from full image-branch adaptation, or where the dampening mechanism reduces accuracy compared to fixed adaptation.

Figures

Figures reproduced from arXiv: 2605.13161 by Chang Yao, Jingyuan Chen, Kunxi Li, Mingjing Xu, Wenkang Han, Yiyun Zhou, Zhonghua Jiang.

**Figure 2.** Figure 2: Overview of the proposed A3B2 architecture. The asymmetric adapters are integrated into each Transformer layer of the CLIP. Down Matrix W𝒅𝒐𝒘𝒏 Up Expert Matrix W𝒖𝒑 𝟏 Up Expert Matrix W𝒖𝒑 𝟐 Up Expert Matrix W𝒖𝒑 𝒏 Softmax Linear Dynamic Router ⋯ ReLU Adapter Input z Gating Weights 𝝎 ℒ𝒃𝒂𝒍 Uniform Probability 𝟏/𝒏 ∆𝝂 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Detailed structure of the A3B2 adapter. The module consists of a shared down-projection layer and a dynamic router that adaptively weights multiple up-projection experts. additional parameters on the image encoder may harm the transferability of VLMs on non-distribution data. Task-adaptive and Structure-asymmetric Adapter Based on the insights above, we propose an asymmetric architecture where adapters a… view at source ↗

**Figure 4.** Figure 4: Comparison (HM) of A3B2 and 7 leading methods on few-shot learning, with results on all datasets provided in the Appendix D. 4.3 Cross-Dataset Evaluation We have compared the top 7 methods in the base-to-novel generalization task with the proposed A3B2 in the crossdataset evaluation task, as shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Performance comparison of A3 and A3 in terms of the base setting in base-to-novel generalization. ImageNet Caltech101 OxfordPets StanfordCars Flowers102 Food101 FGVCAircraft SUN397 DTD EuroSAT UCF101 Average 20 40 60 80 100 70.4 94.1 97.7 73.4 73.9 91.1 34.8 77.2 62.9 68.8 78.7 74.8 70.5 94.7 98.1 74.7 75.1 92.1 36.5 78.1 63.3 67.6 80.4 75.6 A3 A3 [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Performance comparison of A3 and A3 in terms of the novel setting in base-to-novel generalization. ImageNet Caltech101 OxfordPets StanfordCars Flowers102 Food101 FGVCAircraft SUN397 DTD EuroSAT UCF101 Average 20 40 60 80 100 73.7 96.2 96.7 77.3 84.3 90.1 39.2 79.4 71.5 79.7 82.6 79.4 73.8 96.6 96.7 78.2 85.1 90.6 41.2 79.8 72.3 79.2 83.5 80.1 A3 A3 [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Performance comparison of A3 and A3 in terms of the hm setting in base-to-novel generalization. Caltech101 OxfordPets StanfordCars Flowers102 Food101 FGVCAircraft SUN397 DTD EuroSAT UCF101 Average 20 30 40 50 60 70 80 90 100 94.3 89.5 62.9 69.6 85.6 24.5 66.3 43.9 45.5 68.9 65.1 94.0 91.0 65.5 71.3 86.0 24.5 67.2 45.6 45.9 68.8 66.0 A3 A3 [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Performance comparison of A3 and A3 in cross-dataset evaluation [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Performance comparison of A3 and A3 in domain generalization. Let us define the bottleneck variable as the output of the shared projection: Z ≜ Wdown(X). The IB objective for this architecture is to learn the parameters of Wdown (which define the mapping p(z|x)) that minimize LIB from Eq. 17. Theoretical Analysis. The one-down-many-ups architecture imposes a single shared bottleneck: all information fro… view at source ↗

**Figure 11.** Figure 11: The performance of symmetric (both) and asymmetric (text and image) adapters in the Base-to-Novel Generalization task across [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: The performance of symmetric (both) and asymmetric (text and image) adapters in the Cross-Dataset Evaluation task across 10 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: The performance of symmetric (both) and asymmetric (text and image) adapters in the Domain Generalization task across 4 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

read the original abstract

Efficient transfer learning methods for large-scale vision-language models ($e.g.$, CLIP) enable strong few-shot transfer, yet existing adaptation methods follow a fixed fine-tuning paradigm that implicitly assumes a uniform importance of the image and text branches, which has not been systematically studied in image classification. Through extensive analysis, we reveal a Branch Bias issue in vision-language image classification: adapting the image encoder does not always improve performance under out-of-distribution settings. Motivated by this observation, we propose A$_3$B$_2$, an Adaptive Asymmetric Adapter that alleviates Branch Bias in few-shot learning. A$_3$B$_2$ introduces Uncertainty-Aware Adapter Dampening (UAAD), which automatically suppresses image-branch adaptation when prediction uncertainty is high, enabling soft and data-driven control without manual intervention. Architecturally, A$_3$B$_2$ adopts a lightweight asymmetric design inspired by mixture-of-experts with Load Balancing Regularization. Extensive experiments on three few-shot image classification tasks across 11 datasets demonstrate that A$_3$B$_2$ consistently outperforms 11 competitive prompt- and adapter-based baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags branch bias in CLIP few-shot adaptation and offers an asymmetric adapter with uncertainty dampening, but the supporting analysis and robustness evidence are thin.

read the letter

The quick read is that this work identifies a practical issue: in vision-language models like CLIP, adapting the image branch does not reliably help few-shot image classification under distribution shift, and can sometimes hurt. A3B2 counters this with Uncertainty-Aware Adapter Dampening to automatically reduce image-branch updates when predictions are uncertain, paired with a lightweight asymmetric adapter inspired by mixture-of-experts and regularized for load balance. That combination is the main novelty over symmetric prompt or adapter baselines. The experiments run across 11 datasets in three tasks and report consistent gains over 11 prior methods, which gives the empirical side decent breadth. The design choices keep the added parameters small and avoid manual thresholds, which is a plus for practical use. The soft spots sit in the missing pieces. The branch-bias observation is stated as coming from analysis, yet that analysis is not shown, so it is difficult to assess how general or reproducible the finding is. The results claim outperformance without error bars, run-to-run variance, or significance tests, which leaves the size of the gains unclear. The uncertainty proxy itself is the load-bearing assumption, and the stress-test concern is fair: if high uncertainty stems from label noise, imbalance, or text-branch problems rather than image-branch bias, the dampening could suppress useful adaptation on in-distribution cases. No checks against miscalibration or alternative uncertainty measures are described, so the automatic claim rests on an unverified link. This paper is for researchers working on efficient adaptation of vision-language models for few-shot or robust classification. A reader could extract the adapter architecture and the experimental protocol for their own comparisons even if they re-verify the bias claim themselves. It deserves a serious referee because the problem is real for many CLIP-based pipelines and the method is simple enough to test quickly, but the review should focus on adding the missing analysis, variance numbers, and proxy validation.

Referee Report

3 major / 2 minor

Summary. The paper claims that vision-language models exhibit a 'Branch Bias' in few-shot image classification, where image-encoder adaptation does not always improve performance under out-of-distribution conditions. Motivated by this, it introduces A₃B₂, an adaptive asymmetric adapter that uses Uncertainty-Aware Adapter Dampening (UAAD) to automatically suppress image-branch adaptation when prediction uncertainty is high. The design incorporates a lightweight mixture-of-experts-inspired asymmetry and load-balancing regularization. Experiments across three few-shot tasks on 11 datasets show consistent outperformance over 11 prompt- and adapter-based baselines.

Significance. If the branch-bias observation holds and UAAD provides a reliable, dataset-agnostic control without new failure modes, the work would strengthen few-shot adaptation for CLIP-style models by replacing fixed fine-tuning paradigms with a data-driven branch-balancing mechanism. The scale of the evaluation (11 datasets, 11 baselines) is a clear strength that would support adoption if the uncertainty proxy is shown to be robust.

major comments (3)

[§3.2] §3.2 (UAAD definition): the claim that prediction uncertainty serves as a faithful proxy for branch bias is load-bearing for the 'no manual intervention' guarantee, yet the manuscript provides no ablation or diagnostic showing that high uncertainty correlates specifically with image-branch harm rather than label noise, class imbalance, or text-branch issues; without this, suppression could degrade in-distribution performance.
[§4] §4 (Experiments): the abstract states 'consistent outperformance' across 11 datasets, but no error bars, statistical significance tests, or exact few-shot sampling protocols (e.g., number of seeds, class-balanced splits) are reported; this prevents assessment of whether reported gains exceed variance and undermines the cross-dataset claim.
[§2] §2 (Branch Bias Analysis): the motivation depends on an 'extensive analysis' revealing when image adaptation hurts, but the specific figures, tables, or quantitative thresholds linking uncertainty to performance drop are not shown; this leaves the UAAD design choice under-motivated relative to its centrality.

minor comments (2)

[Figure 1] Figure 1 or 2 (architecture diagram): the asymmetric MoE routing and dampening factor should be annotated with the exact mathematical form of the uncertainty-based gate to improve reproducibility.
[Table 1] Table 1 (baseline comparison): ensure all 11 baselines include their original citation and hyper-parameter settings used in the re-implementation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have helped us identify areas where the manuscript can be strengthened. We address each major comment below, outlining the specific revisions we will implement in the next version of the paper.

read point-by-point responses

Referee: [§3.2] §3.2 (UAAD definition): the claim that prediction uncertainty serves as a faithful proxy for branch bias is load-bearing for the 'no manual intervention' guarantee, yet the manuscript provides no ablation or diagnostic showing that high uncertainty correlates specifically with image-branch harm rather than label noise, class imbalance, or text-branch issues; without this, suppression could degrade in-distribution performance.

Authors: We agree that additional diagnostics are needed to confirm that high uncertainty specifically signals image-branch harm rather than confounding factors. In the revised manuscript, we will add a dedicated ablation subsection with new experiments and plots that measure performance change when forcing image-branch adaptation at varying uncertainty levels, while controlling for label noise and class balance. We will also report in-distribution results to verify that UAAD does not degrade performance when uncertainty is low. revision: yes
Referee: [§4] §4 (Experiments): the abstract states 'consistent outperformance' across 11 datasets, but no error bars, statistical significance tests, or exact few-shot sampling protocols (e.g., number of seeds, class-balanced splits) are reported; this prevents assessment of whether reported gains exceed variance and undermines the cross-dataset claim.

Authors: We acknowledge that the current presentation lacks the necessary statistical details. The revised version will report standard deviation error bars over 5 random seeds, specify the exact few-shot protocol (class-balanced random sampling of k examples per class with no overlap across seeds), and include paired t-test p-values comparing A₃B₂ against each baseline on every dataset. These additions will appear in Section 4 and the corresponding tables. revision: yes
Referee: [§2] §2 (Branch Bias Analysis): the motivation depends on an 'extensive analysis' revealing when image adaptation hurts, but the specific figures, tables, or quantitative thresholds linking uncertainty to performance drop are not shown; this leaves the UAAD design choice under-motivated relative to its centrality.

Authors: Section 2 presents the branch-bias observation, but we agree that more explicit quantitative support would strengthen the motivation. We will expand Section 2 with new figures and a table that report performance deltas as a function of uncertainty bins, along with concrete thresholds (e.g., uncertainty > 0.7 correlates with >3% drop when image adaptation is applied). These will directly link the observed bias to the UAAD design. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical design with external validation

full rationale

The paper motivates A3B2 from an observed Branch Bias phenomenon and introduces UAAD as an empirical, uncertainty-driven suppression mechanism without any shown equations, derivations, or fitted parameters that reduce to the inputs by construction. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided text. The method is presented as a lightweight asymmetric adapter with load-balancing regularization, validated through experiments on 11 datasets against 11 baselines. This keeps the central claim independent of its own fitted values or prior self-citations, qualifying as self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the high-level design choices of uncertainty measurement and load-balancing regularization.

pith-pipeline@v0.9.0 · 5521 in / 1014 out tokens · 19277 ms · 2026-05-14T19:11:01.039898+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 18 canonical work pages · 4 internal anchors

[1]

Flamingo: a visual language model for few-shot learning.Advances in neural information pro- cessing systems, 35:23716–23736,

[Alayracet al., 2022 ] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information pro- cessing systems, 35:23716–23736,

2022
[2]

Food-101–mining discriminative com- ponents with random forests

[Bossardet al., 2014 ] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative com- ponents with random forests. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part VI 13, pages 446–461. Springer,

2014
[3]

Language models are few-shot learners.Advances in neural information processing sys- tems, 33:1877–1901,

[Brownet al., 2020 ] Tom Brown, Benjamin Mann, Nick Ry- der, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing sys- tems, 33:1877–1901,

2020
[4]

Markov chains.Springer- Verlag, New York,

[Chung, 1967] Kai Lai Chung. Markov chains.Springer- Verlag, New York,

1967
[5]

Describing textures in the wild

[Cimpoiet al., 2014 ] Mircea Cimpoi, Subhransu Maji, Ia- sonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 3606–3613,

2014
[6]

Imagenet: A large-scale hierarchical image database

[Denget al., 2009 ] Jia Deng, Wei Dong, Richard Socher, Li- Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee,

2009
[7]

Switch transformers: Scaling to trillion param- eter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39,

[Feduset al., 2022 ] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion param- eter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39,

2022
[8]

Learning generative visual models from few train- ing examples: An incremental bayesian approach tested on 101 object categories

[Fei-Feiet al., 2004 ] Li Fei-Fei, Rob Fergus, and Pietro Per- ona. Learning generative visual models from few train- ing examples: An incremental bayesian approach tested on 101 object categories. In2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE,

2004
[9]

Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008,

[Fuet al., 2025 ] Stephanie Fu, Tyler Bonnen, Devin Guil- lory, and Trevor Darrell. Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008,

work page arXiv 2025
[10]

Higher layers need more lora experts.arXiv preprint arXiv:2402.08562,

[Gaoet al., 2024a ] Chongyang Gao, Kezhen Chen, Jinmeng Rao, Baochen Sun, Ruibo Liu, Daiyi Peng, Yawen Zhang, Xiaoyuan Guo, Jie Yang, and VS Subrahmanian. Higher layers need more lora experts.arXiv preprint arXiv:2402.08562,

work page arXiv
[11]

Kernel-based unsupervised embedding alignment for enhanced visual representation in vision- language models.arXiv preprint arXiv:2506.02557,

[Gonget al., 2025 ] Shizhan Gong, Yankai Jiang, Qi Dou, and Farzan Farnia. Kernel-based unsupervised embedding alignment for enhanced visual representation in vision- language models.arXiv preprint arXiv:2506.02557,

work page arXiv 2025
[12]

Mmrl: Multi-modal representation learning for vision- language models.arXiv preprint arXiv:2503.08497,

[Guo and Gu, 2025a] Yuncheng Guo and Xiaodong Gu. Mmrl: Multi-modal representation learning for vision- language models.arXiv preprint arXiv:2503.08497,

work page arXiv
[13]

Mmrl++: Parameter-efficient and interaction-aware rep- resentation learning for vision-language models.arXiv preprint arXiv:2505.10088,

[Guo and Gu, 2025b] Yuncheng Guo and Xiaodong Gu. Mmrl++: Parameter-efficient and interaction-aware rep- resentation learning for vision-language models.arXiv preprint arXiv:2505.10088,

work page arXiv
[14]

[Helberet al., 2019 ] Patrick Helber, Benjamin Bischke, An- dreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Top- ics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226,

2019
[15]

A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

[Hendrycks and Gimpel, 2016] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of- distribution examples in neural networks.arXiv preprint arXiv:1610.02136,

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

Purekv: Plug-and-play kv cache optimization with spatial-temporal sparse attention for vision-language large models.arXiv preprint arXiv:2510.25600,

[Jianget al., 2025 ] Zhonghua Jiang, Kunxi Li, Yiyun Zhou, Sihao Liu, Zhaode Wang, Shengyu Zhang, et al. Purekv: Plug-and-play kv cache optimization with spatial-temporal sparse attention for vision-language large models.arXiv preprint arXiv:2510.25600,

work page arXiv 2025
[17]

Acckv: Towards efficient audio-video llms inference via adaptive-focusing and cross-calibration kv cache optimization

[Jianget al., 2026 ] Zhonghua Jiang, Kui Chen, Kunxi Li, Keting Yin, Yiyun Zhou, Zhaode Wang, Chengfei Lv, and Shengyu Zhang. Acckv: Towards efficient audio-video llms inference via adaptive-focusing and cross-calibration kv cache optimization. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 40, pages 5494– 5502,

2026
[18]

Maple: Multi-modal prompt learning

[Khattaket al., 2023 ] Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fa- had Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19113–19122,

2023
[19]

Shifts in selective visual attention: towards the un- derlying neural circuitry

[Koch and Ullman, 1987] Christof Koch and Shimon Ull- man. Shifts in selective visual attention: towards the un- derlying neural circuitry. InMatters of intelligence: Con- ceptual structures in cognitive neuroscience, pages 115–

1987
[20]

3d object representations for fine- grained categorization

[Krauseet al., 2013 ] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine- grained categorization. InProceedings of the IEEE inter- national conference on computer vision workshops, pages 554–561,

2013
[21]

Read-only prompt optimization for vision-language few- shot learning

[Leeet al., 2023 ] Dongjun Lee, Seokwon Song, Jihee Suh, Joonmyeong Choi, Sanghyeok Lee, and Hyunwoo J Kim. Read-only prompt optimization for vision-language few- shot learning. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 1401–1411,

2023
[22]

Language-driven semantic segmentation.arXiv preprint arXiv:2201.03546,

[Liet al., 2022 ] Boyi Li, Kilian Q Weinberger, Serge Be- longie, Vladlen Koltun, and Ren´e Ranftl. Language-driven semantic segmentation.arXiv preprint arXiv:2201.03546,

work page arXiv 2022
[23]

Scaling language-image pre-training via masking

[Liet al., 2023 ] Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. Scaling language-image pre-training via masking. InProceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, pages 23390–23400,

2023
[24]

Vision- language model fine-tuning via simple parameter-efficient modification.arXiv preprint arXiv:2409.16718,

[Liet al., 2024 ] Ming Li, Jike Zhong, Chenxin Li, Li- uzhuozheng Li, Nie Lin, and Masashi Sugiyama. Vision- language model fine-tuning via simple parameter-efficient modification.arXiv preprint arXiv:2409.16718,

work page arXiv 2024
[25]

Flowmm: Cross-modal information flow guided kv cache merging for efficient multimodal context infer- ence.arXiv preprint arXiv:2511.05534,

[Liet al., 2025a ] Kunxi Li, Yufan Xiong, Zhonghua Jiang, Yiyun Zhou, Zhaode Wang, Chengfei Lv, and Shengyu Zhang. Flowmm: Cross-modal information flow guided kv cache merging for efficient multimodal context infer- ence.arXiv preprint arXiv:2511.05534,

work page arXiv
[26]

Open-vocabulary se- mantic segmentation with mask-adapted clip

[Lianget al., 2023 ] Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary se- mantic segmentation with mask-adapted clip. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7061–7070,

2023
[27]

Fine-Grained Visual Classification of Aircraft

[Majiet al., 2013 ] Subhransu Maji, Esa Rahtu, Juho Kan- nala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint arXiv:1306.5151,

work page internal anchor Pith review Pith/arXiv arXiv 2013
[28]

A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications

[Mu and Lin, 2025] Siyuan Mu and Sen Lin. A comprehen- sive survey of mixture-of-experts: Algorithms, theory, and applications.arXiv preprint arXiv:2503.07137,

work page arXiv 2025
[29]

Automated flower classification over a large number of classes

[Nilsback and Zisserman, 2008] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE,

2008
[30]

Cats and dogs

[Parkhiet al., 2012 ] Omkar M Parkhi, Andrea Vedaldi, An- drew Zisserman, and CV Jawahar. Cats and dogs. In2012 IEEE conference on computer vision and pattern recogni- tion, pages 3498–3505. IEEE,

2012
[31]

Understanding fine-tuning clip for open- vocabulary semantic segmentation in hyperbolic space

[Penget al., 2025 ] Zelin Peng, Zhengqin Xu, Zhilin Zeng, Changsong Wen, Yu Huang, Menglin Yang, Feilong Tang, and Wei Shen. Understanding fine-tuning clip for open- vocabulary semantic segmentation in hyperbolic space. In Proceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 4562–4572,

2025
[32]

Learning transferable visual models from nat- ural language supervision

[Radfordet al., 2021 ] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from nat- ural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,

2021
[33]

Do imagenet classi- fiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400

[Rechtet al., 2019 ] Benjamin Recht, Rebecca Roelofs, Lud- wig Schmidt, and Vaishaal Shankar. Do imagenet classi- fiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400. PMLR,

2019
[34]

Inter-module credit assignment in modular reinforcement learning.Neural Networks, 16(7):985–994,

[Samejimaet al., 2003 ] Kazuyuki Samejima, Kenji Doya, and Mitsuo Kawato. Inter-module credit assignment in modular reinforcement learning.Neural Networks, 16(7):985–994,

2003
[35]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

[Shazeeret al., 2017 ] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hin- ton, and Jeff Dean. Outrageously large neural net- works: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

[Soomroet al., 2012 ] Khurram Soomro, Amir Roshan Za- mir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402,

work page internal anchor Pith review Pith/arXiv arXiv 2012
[37]

Hydralora: An asymmet- ric lora architecture for efficient fine-tuning.Advances in Neural Information Processing Systems, 37:9565–9584,

[Tianet al., 2024 ] Chunlin Tian, Zhan Shi, Zhijiang Guo, Li Li, and Cheng-Zhong Xu. Hydralora: An asymmet- ric lora architecture for efficient fine-tuning.Advances in Neural Information Processing Systems, 37:9565–9584,

2024
[38]

Attention is all you need.Advances in neural information processing systems, 30,

[Vaswaniet al., 2017 ] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30,

2017
[39]

Learning robust global repre- sentations by penalizing local predictive power.Advances in neural information processing systems, 32,

[Wanget al., 2019 ] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global repre- sentations by penalizing local predictive power.Advances in neural information processing systems, 32,

2019
[40]

Sun database: Large-scale scene recognition from abbey to zoo

[Xiaoet al., 2010 ] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE,

2010
[41]

Side adapter network for open- vocabulary semantic segmentation

[Xuet al., 2023 ] Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xiang Bai. Side adapter network for open- vocabulary semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2945–2954,

2023
[42]

Go wider instead of deeper

[Xueet al., 2022 ] Fuzhao Xue, Ziji Shi, Futao Wei, Yuxuan Lou, Yong Liu, and Yang You. Go wider instead of deeper. InProceedings of the AAAI Conference on Artificial Intel- ligence, volume 36, pages 8779–8787,

2022
[43]

Mma: Multi-modal adapter for vision-language models

[Yanget al., 2024 ] Lingxiao Yang, Ru-Yuan Zhang, Yanchen Wang, and Xiaohua Xie. Mma: Multi-modal adapter for vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23826–23837,

2024
[44]

Language-image alignment with fixed text en- coders.arXiv preprint arXiv:2506.04209,

[Yanget al., 2025 ] Jingfeng Yang, Ziyang Wu, Yue Zhao, and Yi Ma. Language-image alignment with fixed text en- coders.arXiv preprint arXiv:2506.04209,

work page arXiv 2025
[45]

Visual-language prompt tuning with knowledge- guided context optimization

[Yaoet al., 2023 ] Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge- guided context optimization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6757–6767,

2023
[46]

Tcp: Textual-based class-aware prompt tuning for visual-language model

[Yaoet al., 2024 ] Hantao Yao, Rui Zhang, and Changsheng Xu. Tcp: Textual-based class-aware prompt tuning for visual-language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23438–23448,

2024
[47]

Spurious correla- tions in machine learning: A survey.arXiv preprint arXiv:2402.12715,

[Yeet al., 2024 ] Wenqian Ye, Guangtao Zheng, Xu Cao, Yunsheng Ma, and Aidong Zhang. Spurious correla- tions in machine learning: A survey.arXiv preprint arXiv:2402.12715,

work page arXiv 2024
[48]

Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence,

[Zhanget al., 2024 ] Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence,

2024
[49]

More: A mixture of low-rank experts for adaptive multi-task learning.arXiv preprint arXiv:2505.22694,

[Zhanget al., 2025 ] Dacao Zhang, Kun Zhang, Shimao Chu, Le Wu, Xin Li, and Si Wei. More: A mixture of low-rank experts for adaptive multi-task learning.arXiv preprint arXiv:2505.22694,

work page arXiv 2025
[50]

Multimodal graph-based variational mixture of experts network for zero-shot multi- modal information extraction

[Zhouet al., 2025a ] Baohang Zhou, Ying Zhang, Yu Zhao, Xuhui Sui, and Xiaojie Yuan. Multimodal graph-based variational mixture of experts network for zero-shot multi- modal information extraction. InProceedings of the ACM on Web Conference 2025, pages 4823–4831,

2025
[51]

Disentangled knowledge tracing for alleviating cognitive bias

[Zhouet al., 2025c ] Yiyun Zhou, Zheqi Lv, Shengyu Zhang, and Jingyuan Chen. Disentangled knowledge tracing for alleviating cognitive bias. InProceedings of the ACM on Web Conference 2025, pages 2633–2645,

2025
[52]

Cola: Collaborative low-rank adaptation

[Zhouet al., 2025d ] Yiyun Zhou, Chang Yao, and Jingyuan Chen. Cola: Collaborative low-rank adaptation. InFind- ings of the Association for Computational Linguistics: ACL 2025, pages 14115–14130,

2025
[53]

Beyond student: An asymmetric network for neural network inheritance.arXiv preprint arXiv:2602.09509,

[Zhouet al., 2026a ] Yiyun Zhou, Jingwei Shi, Mingjing Xu, Zhonghua Jiang, and Jingyuan Chen. Beyond student: An asymmetric network for neural network inheritance.arXiv preprint arXiv:2602.09509,

work page arXiv
[54]

This strongly demonstrates the effectiveness of the proposed fixed asymmetric design

From these results, we observe thatA3 generally performs worse than A3 across different tasks. This strongly demonstrates the effectiveness of the proposed fixed asymmetric design. In the following, we analyze the underlying reasons behind this outcome. A.2 Theoretical Support We build upon the theoretical analysis developed in our previ- ous work [Zhouet...

1987
[55]

one-down-many-ups

Theoretical Analysis.The one-down-many-ups architec- ture imposes a single shared bottleneck: all information fromXtoYmust pass through the same low-dimensional Z. This meansZmust serve as the representation for the entire mixtureofHexperts. Consequently, to maximize the predictive informationI(Z;Y),Zis forced to encode only those features ofXthat are sal...

2024
[56]

The second term penalizes large updates and is non-negative, hence: ∥∇V(x) ℓ′∥ ≤ ∥∇ V(x) ℓ∥

Then: ∇V(x) ℓ′ =∇ V(x) ℓ+ (1−κ(x))∇ V(x) ∥∆v(x)∥2. The second term penalizes large updates and is non-negative, hence: ∥∇V(x) ℓ′∥ ≤ ∥∇ V(x) ℓ∥. Taking expectation: Ceff V (T)≤C V (T). Method ImageNetCaltech101OxfordPetsStanfordCarsFlowers102Food101FGVCAircraftSUN397DTDEuroSATUCF101Average CoOpOp 70.62 94.52 90.47 65.91 71.92 86.02 23.34 66.54 45.51 44.43 ...

2024