arxiv: 2605.10765 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning

Da-Wei Zhou, Tao Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords multimodal continual instruction tuningdynamic prompt generationcatastrophic forgettingcross-modal promptsinstance-specific adaptationnull-space projectionprototype routing

0 comments

The pith

DRAPE generates instance-specific soft prompts for each query-image pair in multimodal models by deriving queries from text and cross-attending to visual patches, then prepends them to a frozen LLM while using null-space projection and CLIP

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve multimodal continual instruction tuning, where models must learn new tasks from sequential data without losing prior skills. Current approaches rely on task-level prompts or LoRA modules that are selected or combined at inference, but they overlook large variations inside each task in scenes, questions, and reasoning needs. DRAPE instead builds fresh soft prompts on the fly for every individual instruction and image, using cross-modal attention to condition them, while protecting the shared projector with null-space gradient projection and routing via CLIP prototypes so no task labels are required at test time. If this holds, continual expansion of multimodal capabilities becomes feasible in open deployment without the usual overwriting of earlier knowledge. Experiments on standard MCIT benchmarks position it ahead of prompt-based and LoRA-based continual baselines.

Core claim

DRAPE creates continuous instance-specific soft prompts by extracting query features from the textual instruction and cross-attending them to visual patch features from the image, then prepending the resulting prompts to the frozen LLM; forgetting is controlled by projecting gradients into the null space of the shared projector during updates and by selecting the appropriate generator at inference through CLIP-based prototype routing without task labels.

What carries the argument

Dynamic cross-modal prompt generation that produces query-image-conditioned soft prompts via text-derived queries cross-attended to visual patches, protected by null-space gradient projection and CLIP prototype routing.

If this is right

Intra-task sample differences in visuals and reasoning are handled by per-instance prompts rather than task-level selection.
No task identity is needed at inference because routing uses CLIP prototypes.
The shared projector remains stable across updates through null-space projection.
Performance exceeds representative prompt and LoRA continual baselines on MCIT benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could reduce the need for storing separate modules per task, lowering memory growth in long task sequences.
Instance-level conditioning might improve robustness when test distributions shift within a known task.
If the projection and routing generalize, similar dynamic generation could apply to other frozen-backbone continual setups.

Load-bearing premise

Null-space gradient projection on the shared projector together with CLIP-based prototype routing will keep forgetting low across any sequence of tasks even when no task labels are supplied at inference.

What would settle it

A sequential task stream in which accuracy on earlier tasks falls sharply below the best baseline after several updates despite applying the null-space projection and prototype routing.

Figures

Figures reproduced from arXiv: 2605.10765 by Da-Wei Zhou, Tao Hu.

**Figure 2.** Figure 2: Illustration of DRAPE. Left: Training on task t. A task-specific generator synthesizes soft prompts, while the shared visual projector is regularized by projection onto the complement of the retained principal subspace. Feature statistics M(t) are decomposed via SVD to obtain a projection matrix Π(t) for the next task. After training, a task prototype ct is registered in a frozen CLIP embedding space. Top-… view at source ↗

**Figure 3.** Figure 3: Routing ablation and generator hidden-dimension sensitivity on the CoIN benchmark. Left: [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt-to-image attention visualizations on OCR-VQA examples. For each example, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of prompt and LoRA expert numbers on the CoIN benchmark. We vary the [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Row-normalized routing confusion matrix on the final task. Each row corresponds to a [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Case studies on GQA. The left example is a relatively simple case where both variants are [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Case studies on VQAv2. The left example is a relatively simple case where both variants [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: OCR-VQA case study with a fixed image and different queries. Although the visual [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: OCR-VQA case study with a fixed query type and different images. Both variants [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, yet real-world deployment often requires continual capability expansion across sequential tasks. In such scenarios, Multimodal Continual Instruction Tuning (MCIT) aims to acquire new capabilities while limiting catastrophic forgetting. Existing methods mainly follow a module-composition paradigm: they maintain task-level prompts or LoRA experts and dynamically route or aggregate a subset of them at inference. However, samples within the same task can still differ substantially in visual scenes, question intents, and reasoning demands. This motivates instance-level adaptation to individual query-image pairs rather than only selecting or combining task-level modules. To this end, we propose DRAPE (Dynamic Cross-Modal Prompt Generation), a prompt-learning framework that synthesizes continuous instance-specific soft prompts for MCIT. Instead of selecting prompts from a fixed pool, DRAPE derives prompt queries from the textual instruction and cross-attends to visual patch features, producing query-image conditioned prompts that are prepended to the frozen LLM. To mitigate forgetting during sequential updates, DRAPE applies null-space gradient projection to the shared projector and uses CLIP-based prototype routing for task-label-free generator selection at inference. Extensive experiments on MCIT benchmarks show that DRAPE achieves state-of-the-art performance among representative prompt-based and LoRA-based continual-learning baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRAPE moves to instance-specific prompt synthesis via text-to-visual cross-attention plus null-space projection and CLIP routing, a practical step past task-level module composition for MCIT, but the SOTA claim needs the actual numbers to judge.

read the letter

The core move is generating soft prompts per image-question pair instead of picking from a task-level pool. Text instructions produce query vectors that cross-attend to visual patches, the resulting prompt is prepended to the frozen LLM, null-space projection protects the shared projector during updates, and CLIP prototypes handle routing at test time without task labels. This directly targets the variation inside tasks that module-composition methods ignore. The pieces are established techniques put together in a new combination for this setting, so the framework itself is coherent and avoids obvious circularity. If the experiments hold, it gives a workable path for sequential multimodal instruction tuning in deployed systems. The soft spot is the performance evidence. The abstract states SOTA over prompt-based and LoRA baselines on MCIT benchmarks, yet supplies no tables, ablation details, or error bars in the summary we have. Without seeing how the task streams were constructed, how forgetting was measured across arbitrary sequences, or the size of the gains, it is hard to know whether the null-space plus CLIP routing actually delivers reliable protection or just works on the chosen splits. A reader working on continual adaptation of MLLMs would find the design choices useful to consider. The paper is clear enough on its own terms to deserve a serious referee, mainly so the experimental section can be checked in detail.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DRAPE (Dynamic Cross-Modal Prompt Generation), a prompt-learning framework for Multimodal Continual Instruction Tuning (MCIT) in MLLMs. It generates instance-specific soft prompts by deriving queries from the textual instruction and cross-attending to visual patch features, prepending these to the frozen LLM. Forgetting is mitigated via null-space gradient projection on the shared projector during sequential updates, combined with CLIP-based prototype routing to enable task-label-free generator selection at inference. The central claim is that extensive experiments on MCIT benchmarks demonstrate state-of-the-art performance relative to representative prompt-based and LoRA-based continual-learning baselines.

Significance. If the empirical results hold, the work offers a meaningful advance by shifting from task-level module composition to instance-level prompt synthesis, better accommodating intra-task variability in visual scenes and reasoning demands. The combination of null-space projection with CLIP prototypes is a practical synthesis of established techniques that avoids circular fitting and supports label-free inference. This could inform more flexible continual adaptation strategies for large multimodal models in deployment scenarios.

major comments (2)

Abstract and §4: The headline claim of SOTA performance is stated without accompanying quantitative tables, exact metric values, baseline implementation details, ablation studies, or error bars. This prevents direct verification of the magnitude and statistical reliability of the reported gains over prompt-based and LoRA baselines.
§3.2: The null-space gradient projection is applied to the shared projector, but the manuscript does not specify how the null-space basis is maintained or updated across sequential tasks when new visual-textual distributions arrive; without this, it is unclear whether the projection remains effective at preventing interference in later tasks.

minor comments (2)

The expansion of the DRAPE acronym is implicit from the title but should be stated explicitly on first use in the abstract and introduction for clarity.
Notation for the cross-attention operation between prompt queries and visual patches could be formalized with an equation to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive overall assessment. We address each major comment below with clarifications and commitments to revisions that strengthen the presentation without altering the core contributions.

read point-by-point responses

Referee: Abstract and §4: The headline claim of SOTA performance is stated without accompanying quantitative tables, exact metric values, baseline implementation details, ablation studies, or error bars. This prevents direct verification of the magnitude and statistical reliability of the reported gains over prompt-based and LoRA baselines.

Authors: We appreciate this point. Section 4 already contains the full quantitative tables with exact metric values, baseline implementation details, and ablation studies. To address the concern about immediate verifiability in the abstract and opening of §4, we will revise the abstract to include a concise summary of key performance deltas and add error bars (computed over multiple random seeds) to all relevant tables and figures in the revised manuscript. This improves accessibility while preserving the existing experimental content. revision: partial
Referee: §3.2: The null-space gradient projection is applied to the shared projector, but the manuscript does not specify how the null-space basis is maintained or updated across sequential tasks when new visual-textual distributions arrive; without this, it is unclear whether the projection remains effective at preventing interference in later tasks.

Authors: Thank you for identifying this gap in clarity. The current description in §3.2 focuses on the projection step but does not explicitly detail the cross-task maintenance procedure. In the revision we will expand §3.2 with the following specification: after each task t, the null-space basis is updated by computing the orthogonal complement (via SVD) to the accumulated gradient matrix formed from all prior tasks 1…t; the new basis is then used for projection in task t+1. This incremental orthogonalization ensures the protected subspace grows without circular fitting and remains effective against interference from future distributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical method DRAPE using cross-modal attention for instance-specific prompts, null-space projection on the projector, and CLIP prototype routing for inference. These build on prior established techniques without any equations or claims that reduce by construction to the method's own fitted parameters or self-citations. Performance claims rest on benchmark experiments rather than derivations, and no load-bearing step equates a prediction to its input definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework rests on standard assumptions of prompt tuning and continual learning; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5536 in / 976 out tokens · 49297 ms · 2026-05-12T03:23:21.116985+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
DRAPE applies null-space gradient projection to the shared projector... M(t) = ... SVD ... Π(t) = V⊥V⊤⊥ ... ∇W′ = ∇W Π(t)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear
CLIP-based prototype routing for task-label-free generator selection

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 2 internal anchors

[1]

Continual llava: Continual instruction tuning in large vision-language models

Meng Cao, Yuyang Liu, Yingfei Liu, Tiancai Wang, Jiahua Dong, Henghui Ding, Xiangyu Zhang, Ian Reid, and Xiaodan Liang. Continual llava: Continual instruction tuning in large vision-language models. arXiv preprint arXiv:2411.02564, 2024

work page arXiv 2024
[2]

Coin: A benchmark of continual instruction tuning for multimodel large language models.Advances in neural information processing systems, 37:57817–57840, 2024

Cheng Chen, Junchen Zhu, Xu Luo, Heng T Shen, Jingkuan Song, and Lianli Gao. Coin: A benchmark of continual instruction tuning for multimodel large language models.Advances in neural information processing systems, 37:57817–57840, 2024

work page 2024
[3]

Sefe: Superficial and essential forgetting eliminator for multimodal continual instruction tuning.arXiv preprint arXiv:2505.02486, 2025

Jinpeng Chen, Runmin Cong, Yuzhi Zhao, Hongzheng Yang, Guangneng Hu, Horace Ho Shing Ip, and Sam Kwong. Sefe: Superficial and essential forgetting eliminator for multimodal continual instruction tuning.arXiv preprint arXiv:2505.02486, 2025

work page arXiv 2025
[4]

Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

work page 2023
[5]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[6]

Loramoe: Alleviate world knowledge for- getting in large language models via moe-style plugin,

Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, et al. Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment.arXiv preprint arXiv:2312.09979, 4(7), 2023

work page arXiv 2023
[7]

Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4): 128–135, 1999

Robert M French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3(4): 128–135, 1999

work page 1999
[8]

Dynamic mixture of curriculum lora experts for continual multimodal instruction tuning.arXiv preprint arXiv:2506.11672, 2025

Chendi Ge, Xin Wang, Zeyang Zhang, Hong Chen, Jiapei Fan, Longtao Huang, Hui Xue, and Wenwu Zhu. Dynamic mixture of curriculum lora experts for continual multimodal instruction tuning.arXiv preprint arXiv:2506.11672, 2025

work page arXiv 2025
[9]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

work page 2017
[10]

Hide-llava: Hierarchical decoupling for continual instruction tuning of multimodal large language model

Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da-Han Wang, Xu-Yao Zhang, and Cheng-Lin Liu. Hide-llava: Hierarchical decoupling for continual instruction tuning of multimodal large language model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 13572–13586, 2025

work page 2025
[11]

A comprehensive survey on continual learning in generative models

Haiyang Guo, Fanhu Zeng, Fei Zhu, Jiayi Wang, Xukai Wang, Jingang Zhou, Hongbo Zhao, Wenzhuo Liu, Shijie Ma, Da-Han Wang, et al. A comprehensive survey on continual learning in generative models. arXiv preprint arXiv:2506.13045, 2025

work page arXiv 2025
[12]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018

work page 2018
[13]

The many faces of robustness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021

work page 2021
[14]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[15]

Cl-moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering

Tianyu Huai, Jie Zhou, Xingjiao Wu, Qin Chen, Qingchun Bai, Ze Zhou, and Liang He. Cl-moe: Enhancing multimodal large language model with dual momentum mixture-of-experts for continual visual question answering. InProceedings of the computer vision and pattern recognition conference, pages 19608–19617, 2025

work page 2025
[16]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019. 10

work page 2019
[17]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

work page 2017
[18]

Generating instance-level prompts for rehearsal-free continual learning

Dahuin Jung, Dongyoon Han, Jihwan Bang, and Hwanjun Song. Generating instance-level prompts for rehearsal-free continual learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11847–11857, 2023

work page 2023
[19]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014

work page 2014
[20]

Revisit visual prompt tuning: The expressiveness of prompt experts

Minh Le, Anh Nguyen, Huy Nguyen, Chau Nguyen, Anh Tuan Tran, and Nhat Ho. Revisit visual prompt tuning: The expressiveness of prompt experts. InThe F ourteenth International Conference on Learning Representations

work page
[21]

Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 14369–14387, 2024

work page 2024
[22]

Clevr-math: A dataset for compositional lan- guage, visual and mathematical reasoning

Adam Dahlgren Lindström and Savitha Sam Abraham. Clevr-math: A dataset for compositional language, visual and mathematical reasoning.arXiv preprint arXiv:2208.05358, 2022

work page arXiv 2022
[23]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[24]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024
[25]

Visual instruction tuning.Advances in neural information processing systems, 36, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024

work page 2024
[26]

All you need is one: Capsule prompt tuning with a single vector.arXiv preprint arXiv:2510.16670, 2025

Yiyang Liu, James C Liang, Heng Fan, Wenhao Yang, Yiming Cui, Xiaotian Han, Lifu Huang, Dongfang Liu, Qifan Wang, and Cheng Han. All you need is one: Capsule prompt tuning with a single vector.arXiv preprint arXiv:2510.16670, 2025

work page arXiv 2025
[27]

The flan collection: Designing data and methods for effective instruction tuning

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. InInternational conference on machine learning, pages 22631–22648. PMLR, 2023

work page 2023
[28]

Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning

Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning.arXiv preprint arXiv:2110.13214, 2021

work page arXiv 2021
[29]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022

work page 2022
[30]

Generation and comprehension of unambiguous object descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016

work page 2016
[31]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019

work page 2019
[32]

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. InProceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015

work page 2015
[33]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 11

work page 2021
[34]

Overcoming catastrophic forgetting in incremental few-shot learning by finding flat minima.Advances in neural information processing systems, 34:6747–6761, 2021

Guangyuan Shi, Jiaxin Chen, Wenlong Zhang, Li-Ming Zhan, and Xiao-Ming Wu. Overcoming catastrophic forgetting in incremental few-shot learning by finding flat minima.Advances in neural information processing systems, 34:6747–6761, 2021

work page 2021
[35]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

work page 2019
[36]

Coda-prompt: Continual decomposed attention- based prompting for rehearsal-free continual learning

James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, and Zsolt Kira. Coda-prompt: Continual decomposed attention- based prompting for rehearsal-free continual learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11909–11919, 2023

work page 2023
[37]

Metamorph: Multimodal understanding and generation via instruction tuning

Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17001–17012, 2025

work page 2025
[38]

Rehearsal-free modular and compositional continual learning for language models

Mingyang Wang, Heike Adel, Lukas Lange, Jannik Strötgen, and Hinrich Schütze. Rehearsal-free modular and compositional continual learning for language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 2: Short Papers), pages 469–480, 2024

work page 2024
[39]

Orthogonal subspace learning for language model continual learning

Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671, 2023

work page 2023
[40]

Inscl: A data-efficient continual learning paradigm for fine-tuning large language models with instructions

Yifan Wang, Yafei Liu, Chufan Shi, Haoling Li, Chen Chen, Haonan Lu, and Yujiu Yang. Inscl: A data-efficient continual learning paradigm for fine-tuning large language models with instructions. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long ...

work page 2024
[41]

Dualprompt: Complementary prompting for rehearsal-free continual learning

Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. InEuropean Conference on Computer Vision, pages 631–648. Springer, 2022

work page 2022
[42]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[43]

Continual learning for large language models: A survey.arXiv preprint arXiv:2402.01364, 2024

Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning for large language models: A survey.arXiv preprint arXiv:2402.01364, 2024

work page arXiv 2024
[44]

Idpg: An instance-dependent prompt generation method

Zhuofeng Wu, Sinong Wang, Jiatao Gu, Rui Hou, Yuxiao Dong, VG Vinod Vydiswaran, and Hao Ma. Idpg: An instance-dependent prompt generation method. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5507–5521, 2022

work page 2022
[45]

Visual instance-aware prompt tuning

Xi Xiao, Yunbei Zhang, Xingjian Li, Tianyang Wang, Xiao Wang, Yuxiang Wei, Jihun Hamm, and Min Xu. Visual instance-aware prompt tuning. InProceedings of the 33rd ACM International Conference on Multimedia, pages 2880–2889, 2025

work page 2025
[46]

Same: Stabilized mixture-of-experts for multimodal continual instruction tuning.arXiv preprint arXiv:2602.01990, 2026

Zhen-Hao Xie, Jun-Tao Tang, Yu-Cheng Shi, Han-Jia Ye, De-Chuan Zhan, and Da-Wei Zhou. Same: Stabilized mixture-of-experts for multimodal continual instruction tuning.arXiv preprint arXiv:2602.01990, 2026

work page arXiv 2026
[47]

Progressive lora for multimodal continual instruction tuning

Yahan Yu, Duzhen Zhang, Yong Ren, Xuanle Zhao, Xiuyi Chen, and Chenhui Chu. Progressive lora for multimodal continual instruction tuning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2779–2796, 2025

work page 2025
[48]

Modalprompt: Towards efficient multimodal continual instruction tuning with dual-modality guided prompt

Fanhu Zeng, Fei Zhu, Haiyang Guo, Xu-Yao Zhang, and Cheng-Lin Liu. Modalprompt: Towards efficient multimodal continual instruction tuning with dual-modality guided prompt. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12137–12152, 2025

work page 2025
[49]

A benchmark for compositional visual reasoning.Advances in neural information processing systems, 35:29776–29788, 2022

Aimen Zerroug, Mohit Vaishnav, Julien Colin, Sebastian Musslick, and Thomas Serre. A benchmark for compositional visual reasoning.Advances in neural information processing systems, 35:29776–29788, 2022. 12

work page 2022
[50]

Investigating the catastrophic forgetting in multimodal large language models

Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. Investigating the catastrophic forgetting in multimodal large language models. InNeurIPS 2023 Workshop on Instruction Tuning and Instruction F ollowing, 2023

work page 2023
[51]

Mm-llms: Recent advances in multimodal large language models.arXiv preprint arXiv:2401.13601, 2024

Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, Chenhui Chu, and Dong Yu. Mm-llms: Recent advances in multimodal large language models.arXiv preprint arXiv:2401.13601, 2024

work page arXiv 2024
[52]

Instruction tuning for large language models: A survey.ACM Computing Surveys, 58(7):1–36, 2026

Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Guoyin Wang, et al. Instruction tuning for large language models: A survey.ACM Computing Surveys, 58(7):1–36, 2026

work page 2026
[53]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Trainable Param

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.International Journal of Computer Vision, 130(9):2337–2348, 2022. 13 Appendix In this appendix, we provide additional details and supplementary analyses for DRAPE, including evaluation metrics, null-space projection analysis, additional benchmark res...

work page arXiv 2022