arxiv: 2604.07965 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing

Gyanendra Das, Sai Satyam Jena

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords model editingvision language modelslifelong editingsubspace decompositionconcept alignmentcontinual learningcatastrophic forgetting

0 comments

The pith

Decomposing VLM representation spaces into orthogonal subspaces enables precise lifelong concept editing without interference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve interference during sequential edits in vision-language models by turning concept isolation into a structural feature rather than an optimization goal. It decomposes the shared representation space into orthogonal semantic subspaces using incremental clustering and PCA on combined vision and language features. Edits are then applied only inside the relevant subspace while the base model stays frozen. A reader would care because this promises stable updates over thousands of changes without degrading reasoning or causing forgetting, unlike methods that edit inside the original entangled space.

Core claim

DSCA decomposes the joint vision-language representation space into a set of orthogonal semantic subspaces obtained through incremental clustering and PCA. Surgical edits are performed only in these transformed spaces, which structurally isolates concepts and prevents cross-interference. A multi-term loss maintains task fidelity, edit locality, and cross-modal alignment. With the base model frozen, this yields 98 percent single-edit success that remains above 95 percent after 1000 sequential edits while lowering hallucination by 3 to 5 percent and producing the best backward-transfer scores on continual instruction-tuning benchmarks.

What carries the argument

Dynamic Subspace Concept Alignment (DSCA), which decomposes representations into orthogonal subspaces via incremental clustering and PCA so that edits target isolated concept regions without affecting others.

If this is right

Edits stay localized and do not degrade performance on unrelated concepts or tasks.
The frozen base model retains cross-modal alignment across many sequential updates.
Hallucination rates fall by 3 to 5 percent relative to prior editing approaches.
Best-in-class backward transfer scores indicate strong retention on continual instruction-tuning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structural separation might extend to editing other multimodal architectures if clustering remains stable on different feature types.
Independent subspaces could in principle support simultaneous edits to multiple concepts without added interference.
This approach might reduce reliance on heavy regularization terms in other lifelong-learning settings.
One could verify subspace quality by measuring residual correlations between subspaces after large numbers of edits.

Load-bearing premise

Incremental clustering and PCA on joint vision-language representations will create subspaces that isolate distinct concepts without meaningful information loss or leakage between subspaces.

What would settle it

A drop below 95 percent edit success or measurable interference with non-target concepts after 1000 sequential edits would show the subspaces fail to deliver the claimed isolation.

Figures

Figures reproduced from arXiv: 2604.07965 by Gyanendra Das, Sai Satyam Jena.

**Figure 1.** Figure 1: Conceptual comparison of knowledge-editing paradigms. (a) The initial concept space where concepts are wellseparated. (b) Global fine-tuning perturbs the entire representation space, distorting unrelated concepts. (c) LoRA / local adapters constrain edits but still produce coupled interference. (d) DSCA performs subspace-confined, concept-specific interventions, maintaining isolation and preserving all o… view at source ↗

**Figure 2.** Figure 2: Diagnostic analysis of DSCA. (a) Mean pairwise subspace overlap(ε = ∥R⊤ i Rj∥ 2 F ) as a function of the number of sequential edits. DSCA with residualized Incremental PCA keeps the overlap essentially flat at ≈ 3 × 10−3 across 1,000 edits, comparable to a globally orthonormal baseline, whereas a variant without orthogonalization drifts to more than 10−1 . (b) Relationship between mean subspace overlap and… view at source ↗

**Figure 3.** Figure 3: Routing sparsity analysis. (a) Histogram of routing weights shows that over 95% are below 0.05, indicating highly selective module activation. (b) Trade-off between sparsity coefficient λsparse and average number of active DSAMs. The chosen operating point (blue dot) yields ≈ 3 active DSAMs per input without degrading performance. vation pattern in the full model, with over 95% of routing weights being n… view at source ↗

**Figure 4.** Figure 4: Concept-wise subspace visualization. t-SNE of fused representations projected through their assigned semantic subspaces R⊤ k hf for a subset of concepts (e.g., “car”, “truck”, “bicycle”). DSCA yields compact, well-separated clusters, indicating that edits remain confined to localized regions of the representation space.This empirically validates the conceptual illustration provided in [PITH_FULL_IMAGE… view at source ↗

read the original abstract

Model editing aims to update knowledge to add new concepts and change relevant information without retraining. Lifelong editing is a challenging task, prone to disrupting previously learned concepts, especially for Vision Language Models (VLMs), because sequential edits can lead to degraded reasoning and cross modal misalignment. Existing VLM knowledge editing methods based on gated adapters, activation edits, and parameter merging techniques address catastrophic forgetting seen in full fine tuning; however, they still operate in the shared representation space of the VLM, where concepts are entangled, so edits interfere with other non relevant concepts. We hypothesize that this instability persists because current methods algorithmically control edits via optimization rather than structurally separating knowledge. We introduce Dynamic Subspace Concept Alignment (DSCA) which by design mitigates this limitation by decomposing the representation space into a set of orthogonal semantic subspaces and proposing edits only in those transformed spaces. These subspaces are obtained through incremental clustering and PCA on joint vision language representations. This process structurally isolates concepts, enabling precise, non interfering edits by turning isolation from a soft training objective into an architectural property. The surgical edits are guided by a multi term loss function for maintaining task fidelity, edit locality, and cross modal alignment. With the base model frozen, our method achieves 98 percent single edit success, remains over 95 percent after 1000 sequential edits, lowers hallucination by 3 to 5 percent, and achieves the best backward transfer (BWT) scores on continual instruction tuning benchmarks. Extensive experiments demonstrate DSCA state of the art stability and knowledge retention capability in continual lifelong editing across various datasets and benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Dynamic Subspace Concept Alignment (DSCA) for lifelong editing of Vision-Language Models. It decomposes the joint vision-language representation space into a set of orthogonal semantic subspaces obtained via incremental clustering and PCA, then performs surgical edits only within the relevant transformed subspaces while keeping the base model frozen. A multi-term loss maintains task fidelity, edit locality, and cross-modal alignment. The central claim is that this structural separation (rather than optimization-based control) prevents interference and catastrophic forgetting, yielding 98% single-edit success, >95% success after 1000 sequential edits, 3-5% lower hallucination, and the best backward transfer (BWT) scores on continual instruction tuning benchmarks.

Significance. If the claimed subspace isolation can be shown to hold with negligible cross-subspace leakage, DSCA would offer a promising architectural route to stable lifelong VLM editing that sidesteps the entanglement problems of shared representation spaces. The reported retention of performance over 1000 edits and superior BWT would constitute a notable empirical advance over existing gated-adapter, activation-edit, and parameter-merging baselines.

major comments (2)

[Abstract] Abstract: the assertion that incremental clustering plus PCA 'structurally isolates concepts' and converts isolation from a training objective into an architectural property is load-bearing for the non-interference claim, yet no quantitative verification (e.g., measured inter-subspace orthogonality, residual cross-subspace norms, or linear separability tests on held-out concepts) is referenced. PCA on per-cluster variance does not guarantee orthogonality across clusters or absence of leakage in the entangled VLM feature space, directly undermining the guarantee that edits remain confined even with the base model frozen.
[Abstract] Abstract: the multi-term loss is described only at a high level ('maintaining task fidelity, edit locality, and cross-modal alignment') with no explicit formulation, weighting schedule, or ablation of individual terms. Without these details it is impossible to assess whether the reported stability after 1000 edits is attributable to the subspace mechanism or to the loss design, and whether the orthogonality assumption was ever stress-tested.

minor comments (1)

[Abstract] Abstract: the phrase 'lowers hallucination by 3 to 5 percent' should specify the exact metric (e.g., hallucination rate on which benchmark) and the baseline against which the reduction is measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of how our claims are presented in the abstract. We have revised the manuscript to strengthen the substantiation of the subspace isolation mechanism and to provide clearer details on the loss function.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that incremental clustering plus PCA 'structurally isolates concepts' and converts isolation from a training objective into an architectural property is load-bearing for the non-interference claim, yet no quantitative verification (e.g., measured inter-subspace orthogonality, residual cross-subspace norms, or linear separability tests on held-out concepts) is referenced. PCA on per-cluster variance does not guarantee orthogonality across clusters or absence of leakage in the entangled VLM feature space, directly undermining the guarantee that edits remain confined even with the base model frozen.

Authors: We agree that the abstract would benefit from explicit quantitative verification to support the structural isolation claim. While the incremental clustering combined with per-cluster PCA is designed to produce separated subspaces (with intra-cluster orthogonality guaranteed by the PCA step itself), we acknowledge that cross-cluster leakage metrics were not quantified there. In the revised manuscript we will add these measurements—inter-subspace orthogonality, residual cross-subspace norms, and linear separability on held-out concepts—and reference them in the abstract to directly address the concern about potential leakage in the original VLM space. revision: yes
Referee: [Abstract] Abstract: the multi-term loss is described only at a high level ('maintaining task fidelity, edit locality, and cross-modal alignment') with no explicit formulation, weighting schedule, or ablation of individual terms. Without these details it is impossible to assess whether the reported stability after 1000 edits is attributable to the subspace mechanism or to the loss design, and whether the orthogonality assumption was ever stress-tested.

Authors: The referee correctly observes that the abstract summarizes the loss at a high level. The full manuscript contains the explicit multi-term loss formulation, weighting schedule, and corresponding ablations. To improve accessibility, we have expanded the abstract to briefly describe the loss terms and now include a direct reference to the detailed formulation and ablation studies in the main text. This revision clarifies the respective roles of the subspace architecture and the loss in achieving the reported stability. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural method with empirical validation

full rationale

The paper defines DSCA via incremental clustering and PCA on joint representations to create orthogonal subspaces, then measures performance empirically (98% single-edit success, >95% after 1000 edits, improved BWT). No equations, derivations, or claims reduce by construction to fitted parameters or prior self-citations; isolation is presented as a design choice whose benefits are tested externally rather than assumed tautologically. The derivation chain is self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven domain assumption that PCA-derived subspaces on clustered joint representations will remain sufficiently orthogonal and stable under sequential edits; no free parameters or new entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Incremental clustering and PCA on joint vision-language representations produce subspaces that structurally isolate semantic concepts without significant cross-concept interference.
Invoked to justify why edits in the transformed spaces remain non-interfering; this is the load-bearing premise that converts isolation from a soft objective into an architectural guarantee.

invented entities (1)

Dynamic Subspace Concept Alignment (DSCA) mechanism no independent evidence
purpose: To enforce structural separation of concepts for lifelong editing
New architectural component introduced by the paper; no independent evidence outside the reported experiments is provided in the abstract.

pith-pipeline@v0.9.0 · 5591 in / 1359 out tokens · 54998 ms · 2026-05-10T17:35:49.920249+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 9 canonical work pages · 2 internal anchors

[1]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, Andr ´e Susano Pinto, Alexan- der Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisensch- los, Rishabh Kabra, Matthi...

work page internal anchor Pith review arXiv 2024
[2]

Coin: A benchmark of con- tinual instruction tuning for multimodal large language mod- els, 2024

Cheng Chen, Junchen Zhu, Xu Luo, Heng Tao Shen, Jingkuan Song, and Lianli Gao. Coin: A benchmark of con- tinual instruction tuning for multimodal large language mod- els, 2024. arXiv:2403.08350v2 [cs.CV]. 5, 6, 7

work page arXiv 2024
[4]

1, 2, 5, 6, 7

arXiv:2411.15432v2 [cs.CL]. 1, 2, 5, 6, 7

work page arXiv
[5]

Attribution analysis meets model editing: Advancing knowledge correction in vision language models with visedit

Qizhou Chen, Taolin Zhang, Chengyu Wang, Xiaofeng He, Dakan Wang, and Tingting Liu. Attribution analysis meets model editing: Advancing knowledge correction in vision language models with visedit. InProceedings of the AAAI Conference on Artificial Intelligence, 2025. 3, 5, 6, 7

2025
[6]

Can we edit multimodal large language models?arXiv preprint arXiv:2310.08475, 2023

Siyuan Cheng, Bozhong Tian, Qingbin Liu, Xi Chen, Yongheng Wang, Huajun Chen, and Ningyu Zhang. Can we edit multimodal large language models?arXiv preprint arXiv:2310.08475, 2023. 6, 7

work page arXiv 2023
[7]

Cgil: Clip with generative latent replay: a strong baseline for incremental learning

Emanuele Frascaroli, Aniello Panariello, Pietro Buzzega, Lorenzo Bonicelli, Angelo Porrello, and Simone Calderara. Cgil: Clip with generative latent replay: a strong baseline for incremental learning. InProceedings of the British Machine Vision Conference (BMVC), 2024. 2

2024
[8]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 2, 6, 7

work page internal anchor Pith review arXiv 2023
[9]

En- hanced continual learning of vision-language models with model fusion, 2025

Haoyuan Gao, Zicong Zhang, Yuqi Wei, Linglan Zhao, Guilin Li, Yexin Li, Linghe Kong, and Weiran Huang. En- hanced continual learning of vision-language models with model fusion, 2025. Workshop paper at SCOPE, ICLR 2025. 2

2025
[10]

Making the V in VQA matter: Ele- vating the role of image understanding in Visual Question Answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the V in VQA matter: Ele- vating the role of image understanding in Visual Question Answering. InConference on Computer Vision and Pattern Recognition (CVPR), 2017. 2, 6, 7

2017
[11]

Lora: Low-rank adaptation of large language models, 2021

Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. Ver- sion 2. 2

2021
[12]

Vlkeb: A large vision-language model knowledge editing benchmark, 2024

Han Huang, Haitian Zhong, Tao Yu, Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. Vlkeb: A large vision-language model knowledge editing benchmark, 2024. 6, 7

2024
[13]

Clap4clip: Contin- ual learning with probabilistic finetuning for vision-language models

Saurav Jha, Dong Gong, and Lina Yao. Clap4clip: Contin- ual learning with probabilistic finetuning for vision-language models. InAdvances in Neural Information Processing Sys- tems, pages 1–35. NeurIPS, 2024. 38th Conference on Neu- ral Information Processing Systems (NeurIPS 2024). 2

2024
[14]

Learning to edit: Aligning llms with knowledge editing, 2024

Yuxin Jiang, Yufei Wang, Chuhan Wu, Wanjun Zhong, Xing- shan Zeng, Jiahui Gao, Liangyou Li, Xin Jiang, Lifeng Shang, Ruiming Tang, Qun Liu, and Wei Wang. Learning to edit: Aligning llms with knowledge editing, 2024. 3, 5, 6, 7

2024
[15]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision – ECCV 2014, pages 740–755, Cham,

2014
[16]

Springer International Publishing. 6, 7
[17]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. 5, 6, 7

2023
[18]

Unlocking efficient, scalable, and con- tinual knowledge editing with basis-level representation fine- tuning

Tianci Liu, Ruirui Li, Yunzhe Qi, Hui Liu, Xianfeng Tang, Tianqi Zheng, Qingyu Yin, Monica Cheng, Jun Huan, Haoyu Wang, and Jing Gao. Unlocking efficient, scalable, and con- tinual knowledge editing with basis-level representation fine- tuning. InInternational Conference on Learning Represen- tations (ICLR), 2025. 2, 3

2025
[19]

Continual learning for vlms: A survey and taxonomy beyond forgetting.arXiv preprint arXiv:2508.04227, 2025

Yuyang Liu, Qiuhe Hong, Linlan Huang, Alexandra Gomez- Villa, Dipam Goswami, Xialei Liu, Joost van de Wei- jer, and Yonghong Tian. Continual learning for vlms: A survey and taxonomy beyond forgetting.arXiv preprint arXiv:2508.04227, 2025. 2

work page arXiv 2025
[20]

Re-imagining multimodal instruction tuning: A representation view

Yiyang Liu, James Liang, Ruixiang Tang, Yugyung Lee, Ma- jid Rabbani, Sohail Dianat, Raghuveer Rao, Lifu Huang, Dongfang Liu, Qifan Wang, and Cheng Han. Re-imagining multimodal instruction tuning: A representation view. In 13th International Conference on Learning Representations, ICLR 2025. International Conference on Learning Represen- tations, ICLR, 2025. 7

2025
[21]

C-clip: Contrastive learning improves knowledge editing in large vision-language models

Ziyang Liu, Yichen Wu, Zhiyi Shi, Binjie Wang, Junsik Kim, and Hanspeter Pfister. C-clip: Contrastive learning improves knowledge editing in large vision-language models. InIn- ternational Conference on Learning Representations (ICLR),
[22]

Gradient episodic memory for continual learning

David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InProceedings of the 31st International Conference on Neural Information Processing Systems, page 6470–6479, Red Hook, NY , USA,
[23]

Curran Associates Inc. 6
[24]

Adaptive rank, reduced for- getting: Knowledge retention in continual learning vision- language models with dynamic rank-selective lora, 2024

Haodong Lu, Chongyang Zhao, Jason Xue, Lina Yao, Kris- ten Moore, and Dong Gong. Adaptive rank, reduced for- getting: Knowledge retention in continual learning vision- language models with dynamic rank-selective lora, 2024. Version 6, last revised 8 Oct 2025. 2

2024
[25]

Magmax: Leveraging model merging for seamless continual learning

Daniel Marczak, Bartłomiej Twardowski, Tomasz Trzci´nski, and Sebastian Cygert. Magmax: Leveraging model merging for seamless continual learning. 2024. 6, 7

2024
[26]

arXiv preprint arXiv:2202.05262 , year=

Kevin Meng, David Bau, Alex Andonian, and Yonatan Be- linkov. Locating and editing factual associations in GPT.Ad- vances in Neural Information Processing Systems, 36, 2022. arXiv:2202.05262. 2

work page arXiv 2022
[27]

Mass editing memory in a trans- former.The Eleventh International Conference on Learning Representations (ICLR), 2023

Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass editing memory in a trans- former.The Eleventh International Conference on Learning Representations (ICLR), 2023. 2

2023
[28]

Fast model editing at scale

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D Manning. Fast model editing at scale. InInternational Conference on Learning Representations,
[29]

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. Memory-based model editing at scale. InInternational Conference on Machine Learning,
[30]

Continual vision-language representation learning with off-diagonal information

Zixuan Ni, Longhui Wei, Siliang Tang, Yueting Zhuang, and Qi Tian. Continual vision-language representation learning with off-diagonal information. InProceedings of the 40th In- ternational Conference on Machine Learning, pages 26129– 26149. PMLR, 2023. 2

2023
[31]

Object hallucination in image cap- tioning

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image cap- tioning. InProceedings of the 2018 Conference on Empiri- cal Methods in Natural Language Processing, pages 4035– 4045, Brussels, Belgium, 2018. Association for Computa- tional Linguistics. 6, 7

2018
[32]

Exposing halluci- nations to suppress them: Vlms representation editing with generative anchors, 2025

Youxu Shi, Suorong Yang, and Dong Liu. Exposing halluci- nations to suppress them: Vlms representation editing with generative anchors, 2025. arXiv:2509.21997 [cs.CV]. 7

work page arXiv 2025
[33]

Dualedit: Dual editing for knowledge updating in vision-language models

Zhiyi Shi, Binjie Wang, Chongjie Si, Yichen Wu, Junsik Kim, and Hanspeter Pfister. Dualedit: Dual editing for knowledge updating in vision-language models. InPro- ceedings of the Conference on Language Modeling (COLM),
[34]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarjan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 6, 7

2019
[35]

Continual learning in vision-language models via aligned model merging, 2025

Ghada Sokar. Continual learning in vision-language models via aligned model merging, 2025. 2, 5, 6, 7

2025
[36]

Repre- sentation learning with contrastive predictive coding, 2018

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding, 2018. 4, 3

2018
[37]

Lawrence Zitnick, and Devi Parikh

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description evalu- ation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, 2015. 6, 7

2015
[38]

Manning, and Christo- pher Potts

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christo- pher Potts. Reft: Representation finetuning for language models, 2024. 3

2024
[39]

Generative neg- ative text replay for continual vision-language pretraining

Shipeng Yan, Lanqing Hong, Hang Xu, Jianhua Han, Tinne Tuytelaars, Zhenguo Li, and Xuming He. Generative neg- ative text replay for continual vision-language pretraining. InComputer Vision – ECCV 2022, pages 22–38. Springer,

2022
[40]

Boosting continual learning of vision-language models via mixture-of-experts adapters

Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Ping Hu, Dong Wang, Huchuan Lu, and You He. Boosting continual learning of vision-language models via mixture-of-experts adapters. arXiv preprint arXiv:2403.11549, 2024. 2

work page arXiv 2024
[41]

Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023. 6, 7

2023
[42]

Select and distill: Selective dual-teacher knowledge transfer for continual learning on vision-language models

Yu-Chu Yu, Chi-Pin Huang, Jr-Jen Chen, Kai-Po Chang, Yung-Hsuan Lai, Fu-En Yang, and Yu-Chiang Frank Wang. Select and distill: Selective dual-teacher knowledge transfer for continual learning on vision-language models. InEuro- pean Conference on Computer Vision (ECCV), 2024. 2

2024
[43]

Vqacl: A novel visual question answering continual learning set- ting

Xi Zhang, Feifei Zhang, and Changsheng Xu. Vqacl: A novel visual question answering continual learning set- ting. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 19102– 19112, 2023. 2

2023
[44]

Preventing zero-shot transfer degradation in continual learning of vision-language models

Zangwei Zheng, Mingyuan Ma, Kai Wang, Ziheng Qin, Xi- angyu Yue, and Yang You. Preventing zero-shot transfer degradation in continual learning of vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19125–19136, 2023. 2 DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing Supplementar...

2023
[45]

Contents • Theoretical Analysis of Non-Interference in DSCA • Additional Methodology Details • Evaluation Metrics • Implementation Details and Hyperparameters • Extended Experimental Results
[46]

Preliminaries Let the frozen VLM encoder produce fused representations hf ∈R df (as defined in Sec

Theoretical Analysis of Non-Interference in DSCA 8.1. Preliminaries Let the frozen VLM encoder produce fused representations hf ∈R df (as defined in Sec. 3.1). For each discovered conceptC k, DSCA maintains a low-dimensional semantic subspace with basis matrixR k ∈R rk×df , wherer k ≪ df .(Sec. 3.3) We view the rows ofR k as an orthonormal basis for the c...
[47]

edit” samples (De) and “out-of-scope

Additional Methodology Details 9.1. Gating Implementation Details As discussed in Sec. 3.3, the component-wise gating vector γk(hf)∈[0,1] df is implemented via a lightweight neural layer γk(hf) =σ(W g,khf +b g,k), whereσis the element-wise sigmoid. To avoid quadratic parameter growth ind f , we factorizeW g,k as a low-rank bottleneck: Wg,k =U kVk, withU k...
[48]

4.2 of the main paper

Evaluation Metrics In this section, we provide the formal definitions of all eval- uation metrics referenced in Sec. 4.2 of the main paper. Let fθ0 denote the original (unedited) model andf θt the model aftertsequential edits. Each edit request is represented as a tuplee= (v, p, o), consisting of a visual inputv, textual promptp, and desired target output...
[49]

All experiments were conducted using the PyTorch frame- work on8×NVIDIA A100 (80GB) GPUs with mixed- precision training

Implementation Details and Hyperparam- eters In this section, we detail the experimental setup and hyper- parameter configurations used to train and evaluate DSCA. All experiments were conducted using the PyTorch frame- work on8×NVIDIA A100 (80GB) GPUs with mixed- precision training. Backbone Models.We apply DSCA to two distinct vision- language architect...
[50]

Extended Experimental Results We provide expanded comparisons against a wider range of baselines in Tables 8, 9, and 10. 12.1. Expanded Single-Edit Performance Table 8 provides a comprehensive single edit success com- parison on the E-VQA and E-IC benchmarks. All baseline numbers, including standard fine-tuning variants and retrieval-based methods, are so...

2023
[51]

are taken directly from the Sequential Editing bench- marks reported inLiveEdit [3]. 12.3. Expanded Continual Learning on CoIN Table 10 reports results on the CoIN benchmark using the PaliGemma-3B backbone[1].All baseline numbers are sourced fromPAM [32]. To establish performance bounds, we include three foundational setups defined in their work: •Zero-sh...

2017