pith. sign in

arxiv: 2607.02010 · v1 · pith:C6VZI4ARnew · submitted 2026-07-02 · 💻 cs.AI

InduceKV: Fixed-Footprint Continual Adaptation of Multimodal LLMs via Inducing KV Memories

Pith reviewed 2026-07-03 13:39 UTC · model grok-4.3

classification 💻 cs.AI
keywords continual learningmultimodal LLMsKV cachefixed footprintretrieval-based adaptationinstruction tuningvisual question answeringdomain adaptation
0
0 comments X

The pith

InduceKV stores selected training prefixes as compact KV memories to enable continual adaptation of multimodal LLMs under a fixed memory budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models face the challenge of adapting to new tasks and domains without their memory footprint growing over time through repeated updates or expanding replay stores. InduceKV addresses this by externalizing adaptation into a fixed set of attention-ready memory entries, each consisting of a frozen retrieval key and compact layerwise key-value payloads that append directly to the model's self-attention cache. A bilevel selection procedure first fits a lightweight calibration for retrieval, then chooses the inducing set to balance current-task likelihood, anchor-based retention, and coverage within the frozen space. This yields consistent gains over PEFT, MoE, replay, and prompt-retrieval baselines across task-incremental instruction tuning, continual VQA, domain-incremental adaptation, and lifelong multimodal instruction tuning when memory budgets are matched.

Core claim

InduceKV constructs a compact inducing set of KV memories through bilevel selection, where a lightweight calibration fits retrieval while the selected memories balance current-task likelihood, anchor-based retention, and coverage in the frozen retrieval space, allowing the backbone to remain frozen and the adaptation state to stay bounded.

What carries the argument

Bilevel selection procedure that produces a compact inducing set of attention-ready KV memory entries from training prefixes.

If this is right

  • Consistent outperformance over PEFT, MoE, replay, and prompt-retrieval baselines under matched memory budgets holds across task-incremental instruction tuning, continual VQA, domain-incremental adaptation, and lifelong multimodal instruction tuning.
  • Gains remain after controlling for backbone strength, stage-1 CoIN, compute matching, and candidate-pool size.
  • Adaptation state stays bounded while the backbone model itself receives no updates.
  • The method externalizes task-specific state into retrieval-ready KV payloads rather than parameter changes or growing replay buffers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of adaptation state into fixed KV memories could simplify deployment across hardware with strict memory limits.
  • If the inducing set remains effective as task count grows, the approach may reduce reliance on replay buffers in other continual-learning regimes.
  • The bilevel balancing of likelihood, retention, and coverage might generalize to selecting memories for non-multimodal instruction streams.

Load-bearing premise

The bilevel selection reliably produces a compact inducing set whose performance holds when the backbone model stays frozen and no parameters are updated.

What would settle it

A head-to-head comparison under identical memory budgets in which InduceKV fails to improve over at least one of the PEFT, MoE, replay, or prompt-retrieval baselines in any of the four reported continual adaptation settings.

Figures

Figures reproduced from arXiv: 2607.02010 by Canran Xiao, Qianyu Chen, Runxuan Tang, Ziteng Feng.

Figure 1
Figure 1. Figure 1: INDUCEKV for budgeted continual MLLM adaptation. Our contributions are as follows: (i) We reframe continual MLLM adaptation as budgeted online inducing￾set selection for retrieval-based memory, making the stability–plasticity tension explicit as a memory allocation problem rather than an ever-growing parameter update process. (ii) We introduce a retrieval￾driven adaptation mechanism that stores task increm… view at source ↗
Figure 2
Figure 2. Figure 2: INDUCEKV pipeline for continual adaptation. For each incoming task t, the frozen MLLM extracts from each prefix x a unit-norm retrieval key r(x) and compressed layerwise KV payloads {(K¯ ℓ (x), V¯ ℓ (x))} L ℓ=1 , forming new entries that are merged with the previous memory into a candidate pool Ut . Under a fixed budget B, a bilevel optimizer constructs a compact inducing set: the inner level fits a minima… view at source ↗
Figure 3
Figure 3. Figure 3: Memory attention utilization. Rows are tasks/domains and columns are layers; heatmap values show memory-attention mass, while the last column reports normalized gain over NO-MEM. Does inducing-set selection reduce redundancy? We next test whether the bilevel selection objective produces a genuinely compact and diverse inducing set under the fixed memory budget. We compare full INDUCEKV with a no-coverage v… view at source ↗
Figure 4
Figure 4. Figure 4: Inducing-set diversity. 6 Conclusion We address continual adaptation of MLLMs under strict footprint constraints without updating backbone parameters, and propose INDUCEKV to externalize task increments into attention-compatible KV memory with budgeted inducing-set selection. Our empirical and theoretical results support online inducing-set selection as a principled alternative to parameter-updating pipeli… view at source ↗
Figure 5
Figure 5. Figure 5: Hyperparameter sensitivity of INDUCEKV (mean±std). C.3 Effect of Task Order We test whether INDUCEKV is order-robust. For each benchmark suite, we sample M random task permutations and run continual adaptation under the same budget/config as the main experiments. We report the final metric for each run and visualize the distribution across orders. Specifically, we measure: UCIT Avg (6 tasks), CoIN Avg (8 t… view at source ↗
Figure 6
Figure 6. Figure 6: Order sensitivity of continual adaptation. Violin plots show the distribution of final perfor￾mance over many random task orders (one dot per run). Stars indicate the default-order scores reported in Table ??. Across all suites, INDUCEKV achieves higher means and noticeably smaller variances than strong baselines, suggesting improved robustness to task ordering. between the top-1 and top-2 retrieved keys: … view at source ↗
Figure 7
Figure 7. Figure 7: When does INDUCEKV help most?. Each point is one evaluation example with x-axis retrieval ambiguity g(x) = s(1) − s(2) (smaller means more ambiguous retrieval) and y-axis per-example gain ∆(x) (Eq. (70)). Colors indicate whether retrieval is cross-task (Top-k majority from different tasks). Solid lines show binned mean gain for each group. 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Task index t 0.04 0.06 0.08 0.… view at source ↗
Figure 8
Figure 8. Figure 8: Calibration dynamics. The trace shows learned retrieval temperature τt , mean value gate λ¯ t , and per-task gain of Full over Fixed-τ, λ. We use task-/dataset-ID as a weak supervision signal. For each test example x from task t(x), let {e(j)} k j=1 denote the Top-k retrieved memory entries ranked by cosine similarity in the frozen retrieval space. We define the Top-k hit rate as hitk (x) ≜ 1 k k ∑ j=1 I h… view at source ↗
Figure 9
Figure 9. Figure 9: Retrieval hit rate vs. performance. x-axis: hit-rate bins of hit8(x) (Eq. (71)). Left y-axis: mean gain E[∆(x)] (INDUCEKV − no-mem) with ±1 s.e. bands. Right y-axis: mean accuracy of INDUCEKV (solid) and NO-MEM (dashed) with ±1 s.e. bands. with INDUCEKV’s soft retrieval (temperature τ) and value-gated injection (λℓ ), which down-weights uncertain memory contributions and prevents noisy retrieval from desta… view at source ↗
Figure 10
Figure 10. Figure 10: Cross-backbone reproducibility. Each bar shows the mean gain ∆avg (Eq. (72)) of INDUCEKV over the best baseline under the same footprint budget, averaged over five settings: (UCIT Avg, CoIN Avg, VQAv2 AP, Domain Overall, LiIT AvgAcc). Error bars denote std across seeds. Each x-tick additionally reports the 5-tuple of per-setting gains in the order (UCIT, CoIN, VQA, Domain, LiIT). 30 40 50 60 70 80 Inferen… view at source ↗
Figure 11
Figure 11. Figure 11: Memory–Compute–Quality trade-off on continual VQA. Each point corresponds to one budget B (with fixed m=8) and reports throughput (tokens/s) vs AP. Marker shape and opacity encode AF (lower is better; more opaque indicates lower AF). INDUCEKV traces a stronger Pareto frontier: at matched throughput it achieves higher AP, and at matched AP it runs faster, while maintaining low forgetting at moderate budget… view at source ↗
read the original abstract

Multimodal large language models must adapt to evolving tasks and domains, yet continual improvement under bounded deployment footprint remains difficult because repeated parameter updates or growing replay stores can accumulate adaptation state over time. We study fixed-footprint continual adaptation: the deployed adaptation state is kept under a fixed memory budget, while the backbone model is left unchanged and task-specific updates are externalized. We propose InduceKV, a retrieval-based method that stores each selected training prefix as an attention-ready memory entry, consisting of a frozen retrieval key and compact layerwise key--value (KV) payloads that can be appended to the model's self-attention cache. Under a strict memory budget, InduceKV constructs a compact inducing set through bilevel selection: a lightweight calibration is fit for retrieval, while the selected memory balances current-task likelihood, anchor-based retention, and coverage in the frozen retrieval space. Across task-incremental instruction tuning, continual VQA, domain-incremental adaptation, and lifelong multimodal instruction tuning, InduceKV consistently improves over PEFT, MoE, replay, and prompt-retrieval baselines under matched memory budgets. We further report backbone-matched, stage-1 CoIN, compute-matched, and scalability diagnostics, showing that the gains are not due to a stronger backbone, replay alone, or an unbounded candidate pool.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces InduceKV, a retrieval-based approach for fixed-footprint continual adaptation of multimodal LLMs. It externalizes task-specific state by storing selected training prefixes as attention-ready memory entries consisting of a frozen retrieval key and compact layerwise KV payloads that append to the self-attention cache. A bilevel selection procedure (lightweight calibration for retrieval keys plus balancing of current-task likelihood, anchor-based retention, and coverage) constructs a compact inducing set under a strict memory budget while leaving the backbone frozen. The central claim is that InduceKV yields consistent gains over PEFT, MoE, replay, and prompt-retrieval baselines across task-incremental instruction tuning, continual VQA, domain-incremental adaptation, and lifelong multimodal instruction tuning, with additional backbone-matched, compute-matched, and scalability diagnostics ruling out alternative explanations.

Significance. If the results hold under the reported controls, the work would provide a concrete mechanism for bounded-memory continual adaptation without parameter growth or unbounded replay, which is practically relevant for deployed multimodal models. The explicit use of inducing KV payloads and the suite of matched-budget diagnostics are positive features that strengthen the fixed-footprint framing.

major comments (2)
  1. [Bilevel selection procedure] Bilevel selection (described in the method section): the headline fixed-footprint claim rests on the selected inducing set improving performance with a frozen backbone and no additional parameters. No ablation isolating the three balancing terms (current-task likelihood, anchor-based retention, coverage) or testing stability of the selected set across task orderings is referenced, leaving the reliability of the compact set under-specified.
  2. [Results and diagnostics] Experimental claims (abstract and results): the assertion of consistent gains under matched memory budgets is load-bearing, yet the manuscript text supplies no quantitative deltas, error bars, or per-setting tables that would allow verification of the data-to-claim link against the listed baselines.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including at least one concrete performance number or effect size rather than the qualitative statement 'consistently improves'.
  2. [Method and experimental setup] Clarify how the memory budget is computed for the KV payloads versus the replay and prompt-retrieval baselines to make the 'matched' comparison fully transparent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will revise the manuscript to incorporate the suggested additions for greater clarity and verifiability.

read point-by-point responses
  1. Referee: [Bilevel selection procedure] Bilevel selection (described in the method section): the headline fixed-footprint claim rests on the selected inducing set improving performance with a frozen backbone and no additional parameters. No ablation isolating the three balancing terms (current-task likelihood, anchor-based retention, coverage) or testing stability of the selected set across task orderings is referenced, leaving the reliability of the compact set under-specified.

    Authors: We agree that additional analysis would strengthen the reliability of the bilevel selection procedure. In the revised manuscript we will add a dedicated ablation subsection that isolates the contribution of each of the three balancing terms by reporting performance when each term is removed in turn. We will also include results across multiple task orderings to demonstrate stability of the selected inducing set. These experiments will use the same memory budgets and evaluation protocols as the main results. revision: yes

  2. Referee: [Results and diagnostics] Experimental claims (abstract and results): the assertion of consistent gains under matched memory budgets is load-bearing, yet the manuscript text supplies no quantitative deltas, error bars, or per-setting tables that would allow verification of the data-to-claim link against the listed baselines.

    Authors: We acknowledge that the main text would benefit from more explicit quantitative reporting. Although detailed per-setting tables appear in the appendix, we will add a consolidated summary table to the main results section that reports average deltas, standard errors across runs, and direct head-to-head comparisons against all baselines under matched memory budgets. Error bars will also be added to the primary figures. These changes will make the claimed gains directly verifiable from the main text. revision: yes

Circularity Check

0 steps flagged

No circularity: method is a described algorithm with external empirical claims

full rationale

The paper presents InduceKV as a retrieval-based procedure whose bilevel selection (calibration plus balancing of likelihood, retention, and coverage) is an explicit algorithmic construction, not a fitted quantity renamed as a prediction. No equations or steps in the abstract reduce the reported gains to the inputs by definition; the central claims are comparative improvements over PEFT/MoE/replay baselines under matched budgets, which are externally falsifiable. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5779 in / 1046 out tokens · 44806 ms · 2026-07-03T13:39:56.463407+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

220 extracted references · 24 canonical work pages · 5 internal anchors

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Are transformers effective for time series forecasting? , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  3. [3]

    International Conference on Machine Learning , pages=

    MOMENT: A Family of Open Time-series Foundation Models , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  4. [4]

    Timer-XL: Long-Context Transformers for Unified Time Series Forecasting , author=

  5. [5]

    ICLR 2025: The Thirteenth International Conference on Learning Representations , year=

    Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts , author=. ICLR 2025: The Thirteenth International Conference on Learning Representations , year=

  6. [6]

    arXiv preprint arXiv:2507.14507 , year=

    Diffusion models for time series forecasting: A survey , author=. arXiv preprint arXiv:2507.14507 , year=

  7. [7]

    International conference on machine learning , pages=

    Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting , author=. International conference on machine learning , pages=. 2021 , organization=

  8. [8]

    Advances in neural information processing systems , volume=

    Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting , author=. Advances in neural information processing systems , volume=

  9. [9]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Informer: Beyond efficient transformer for long sequence time-series forecasting , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  10. [10]

    The eleventh international conference on learning representations , year=

    Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting , author=. The eleventh international conference on learning representations , year=

  11. [11]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Construct-vl: Data-free continual structured vl concepts learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  12. [12]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Continual learning for visual search with backward consistent feature embedding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  13. [13]

    Advances in neural information processing systems , volume=

    Dark experience for general continual learning: a strong, simple baseline , author=. Advances in neural information processing systems , volume=

  14. [14]

    Workshop on Multi-Task and Lifelong Reinforcement Learning , year=

    Continual learning with tiny episodic memories , author=. Workshop on Multi-Task and Lifelong Reinforcement Learning , year=

  15. [15]

    IEEE Transactions on Instrumentation and Measurement , year=

    GALMOR: Memory-Constrained Continual Learning With Efficient Replay for Fault Diagnosis of Rotating Machinery , author=. IEEE Transactions on Instrumentation and Measurement , year=

  16. [16]

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year =

    Guilin Zhu and Dongyue Wu and Changxin Gao and Runmin Wang and Weidong Yang and Nong Sang , title =. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year =

  17. [17]

    arXiv preprint arXiv:2503.06683 , year =

    Dynamic Dictionary Learning for Remote Sensing Image Segmentation , author =. arXiv preprint arXiv:2503.06683 , year =

  18. [18]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

    SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

  19. [19]

    arXiv preprint arXiv:2501.13925 , year =

    GeoPixel: Pixel Grounding Large Multimodal Model in Remote Sensing , author =. arXiv preprint arXiv:2501.13925 , year =

  20. [20]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  21. [21]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  22. [22]

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year =

    Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation , author =. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year =

  23. [23]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    Learning at a glance: Towards interpretable data-limited continual semantic segmentation via semantic-invariance modelling , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2024 , publisher=

  24. [24]

    IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium , pages=

    Self-training and curriculum learning guided dynamic refined network for remote sensing class-incremental semantic segmentation , author=. IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium , pages=. 2024 , month=

  25. [25]

    IEEE Transactions on Geoscience and Remote Sensing , volume=

    Domain-Incremental Learning for Remote Sensing Semantic Segmentation With Multifeature Constraints in Graph Space , author=. IEEE Transactions on Geoscience and Remote Sensing , volume=. 2024 , publisher=

  26. [26]

    Science China Information Sciences , volume=

    Mitigating representation bias for class-incremental semantic segmentation of remote sensing images , author=. Science China Information Sciences , volume=. 2025 , doi=

  27. [27]

    IEEE Transactions on Geoscience and Remote Sensing , volume=

    MiSSNet: Memory-inspired semantic segmentation augmentation network for class-incremental learning in remote sensing images , author=. IEEE Transactions on Geoscience and Remote Sensing , volume=. 2024 , publisher=

  28. [28]

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year =

    Yirui Wu and Yuhang Xia and Hao Li and Lixin Yuan and Junyang Chen and Jun Liu and Tong Lu and Shaohua Wan , title =. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year =

  29. [29]

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year =

    Zhidong Yu and Xiaoman Liu and Jiajun Hu and Zhenbo Shi and Wei Yang , title =. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year =

  30. [30]

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year =

    Cheng Xu and Weiwen Zhang and Hongrui Zhang and Xuemiao Xu and Huaidong Zhang and Jing Zou and Jing Qin , title =. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , year =

  31. [31]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  32. [32]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    Hongmei Yin and Tingliang Feng and Fan Lyu and Fanhua Shang and Hongying Liu and Wei Feng and Liang Wan , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  33. [33]

    IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) , volume=

    Understanding video events: A survey of methods for automatic interpretation of semantic occurrences in video , author=. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) , volume=. 2009 , publisher=

  34. [34]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Revisiting the" video" in video-language understanding , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  35. [35]

    Advances in Neural Information Processing Systems , volume=

    A Practitioner's Guide to Real-World Continual Multimodal Pretraining , author=. Advances in Neural Information Processing Systems , volume=

  36. [36]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    Yuchen Zhu and Cheng Shi and Dingyou Wang and Jiajin Tang and Zhengxuan Wei and Yu Wu and Guanbin Li and Sibei Yang , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  37. [37]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    Kai Fang and Anqi Zhang and Guangyu Gao and Jianbo Jiao and Chi Harold Liu and Yunchao Wei , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  38. [38]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

    Maoxian Wan and Kaige Li and Qichuan Geng and Weimin Shi and Zhong Zhou , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

  39. [39]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

    Ruitao Wu and Yifan Zhao and Jia Li , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

  40. [40]

    Advances in Neural Information Processing Systems , year=

    OVS Meets Continual Learning: Towards Sustainable Open-Vocabulary Segmentation , author=. Advances in Neural Information Processing Systems , year=

  41. [41]

    Advances in Neural Information Processing Systems , year=

    Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation , author=. Advances in Neural Information Processing Systems , year=

  42. [42]

    Advances in Neural Information Processing Systems , year=

    Open-Vocabulary Part Segmentation via Progressive and Boundary-Aware Strategy , author=. Advances in Neural Information Processing Systems , year=

  43. [43]

    Advances in Neural Information Processing Systems , year=

    Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers , author=. Advances in Neural Information Processing Systems , year=

  44. [44]

    Advances in Neural Information Processing Systems , year=

    OPMapper: Enhancing Open-Vocabulary Semantic Segmentation with Multi-Guidance Information , author=. Advances in Neural Information Processing Systems , year=

  45. [45]

    Advances in Neural Information Processing Systems , year=

    LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation , author=. Advances in Neural Information Processing Systems , year=

  46. [46]

    Advances in Neural Information Processing Systems , year=

    Continual Gaussian Mixture Distribution Modeling for Class Incremental Semantic Segmentation , author=. Advances in Neural Information Processing Systems , year=

  47. [47]

    Advances in Neural Information Processing Systems , year=

    Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation , author=. Advances in Neural Information Processing Systems , year=

  48. [48]

    Forty-second International Conference on Machine Learning , year=

    Divide and Conquer: Exploring Language-centric Tree Reasoning for Video Question-Answering , author=. Forty-second International Conference on Machine Learning , year=

  49. [49]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  50. [50]

    arXiv preprint arXiv:2507.00469 , year=

    Bisecle: Binding and Separation in Continual Learning for Video Language Understanding , author=. arXiv preprint arXiv:2507.00469 , year=

  51. [51]

    Proceedings of the 32nd ACM International Conference on Multimedia , pages=

    Gpt4video: A unified multimodal large language model for lnstruction-followed understanding and safety-aware generation , author=. Proceedings of the 32nd ACM International Conference on Multimedia , pages=

  52. [52]

    2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=

    Dam: Dynamic adapter merging for continual video qa learning , author=. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=. 2025 , organization=

  53. [53]

    arXiv preprint arXiv:2502.00843 , year=

    VLM-assisted continual learning for visual question answering in self-driving , author=. arXiv preprint arXiv:2502.00843 , year=

  54. [54]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    VQAGuider: Guiding Multimodal Large Language Models to Answer Complex Video Questions , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  55. [55]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Feature Decomposition-Recomposition in Large Vision-Language Model for Few-Shot Class-Incremental Learning , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  56. [56]

    LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    Longvu: Spatiotemporal adaptive compression for long video-language understanding , author=. arXiv preprint arXiv:2410.17434 , year=

  57. [57]

    arXiv preprint arXiv:2503.14963 , year=

    Continual multimodal contrastive learning , author=. arXiv preprint arXiv:2503.14963 , year=

  58. [58]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

    Learning without forgetting for vision-language models , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

  59. [59]

    International conference on machine learning , pages=

    Open-vclip: Transforming clip to an open-vocabulary video model via interpolated weight optimization , author=. International conference on machine learning , pages=. 2023 , organization=

  60. [60]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  61. [61]

    Advances in Neural Information Processing Systems , volume=

    BMU-MoCo: Bidirectional momentum update for continual video-language modeling , author=. Advances in Neural Information Processing Systems , volume=

  62. [62]

    Advances in Neural Information Processing Systems , volume=

    Vilco-bench: Video language continual learning benchmark , author=. Advances in Neural Information Processing Systems , volume=

  63. [63]

    Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

    Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting , author=. arXiv preprint arXiv:2508.04227 , year=

  64. [64]

    Forty-second International Conference on Machine Learning , year=

    Dynamic Mixture of Curriculum LoRA Experts for Continual Multimodal Instruction Tuning , author=. Forty-second International Conference on Machine Learning , year=

  65. [65]

    International conference on machine learning , pages=

    Deep canonical correlation analysis , author=. International conference on machine learning , pages=. 2013 , organization=

  66. [66]

    arXiv preprint arXiv:2110.08733 , year=

    LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation , author=. arXiv preprint arXiv:2110.08733 , year=

  67. [67]

    IEEE Transactions on Geoscience and Remote Sensing , volume=

    Historical information-guided class-incremental semantic segmentation in remote sensing images , author=. IEEE Transactions on Geoscience and Remote Sensing , volume=. 2022 , publisher=

  68. [68]

    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=

    Automated high-resolution earth observation image interpretation: Outcome of the 2020 Gaofen challenge , author=. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing , volume=. 2021 , publisher=

  69. [69]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages=

    isaid: A large-scale dataset for instance segmentation in aerial images , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages=

  70. [70]

    Proceedings of COMPSTAT'2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers , pages=

    Large-scale machine learning with stochastic gradient descent , author=. Proceedings of COMPSTAT'2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers , pages=. 2010 , organization=

  71. [71]

    ISPRS: Leopoldsh

    ISPRS semantic labeling contest , author=. ISPRS: Leopoldsh

  72. [72]

    Proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages=

    Deepglobe 2018: A challenge to parse the earth through satellite images , author=. Proceedings of the IEEE conference on computer vision and pattern recognition workshops , pages=

  73. [73]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Dynamic Multi-Layer Null Space Projection for Vision-Language Continual Learning , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  74. [74]

    Automation in Construction , volume=

    Context-aware vision-language model agent enriched with domain-specific ontology for construction site safety monitoring , author=. Automation in Construction , volume=. 2025 , publisher=

  75. [75]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Overcoming Dual Drift for Continual Long-Tailed Visual Question Answering , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  76. [76]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Pretrained language models as visual planners for human assistance , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  77. [77]

    Proceedings of the Nineteenth ACM Conference on Recommender Systems , pages=

    Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models , author=. Proceedings of the Nineteenth ACM Conference on Recommender Systems , pages=

  78. [78]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

    Foundation models defining a new era in vision: a survey and outlook , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

  79. [79]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Maple: Multi-modal prompt learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  80. [80]

    International Journal of Computer Vision , volume=

    Learning to prompt for vision-language models , author=. International Journal of Computer Vision , volume=. 2022 , publisher=

Showing first 80 references.