CRANE: Knowledge Editing for Reasoning MLLMs

Han Huang; Hao Wang; Liang Wang; Mengqi Zhang; Qiang Liu; Shu Wu

arxiv: 2606.09033 · v1 · pith:ND5RHZ3Wnew · submitted 2026-06-08 · 💻 cs.CV · cs.CL

CRANE: Knowledge Editing for Reasoning MLLMs

Han Huang , Hao Wang , Mengqi Zhang , Shu Wu , Qiang Liu , Liang Wang This is my paper

Pith reviewed 2026-06-27 17:04 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords knowledge editingreasoning MLLMschain-of-thoughtmultimodal modelsretrieval-augmented editingcognitive routing reward

0 comments

The pith

CRANE uses retrieval plus a routing reward to edit facts in reasoning MLLMs without changing base weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reasoning multimodal models that output explicit chain-of-thought steps resist standard knowledge editing because weight updates destroy the required format while non-updating methods fail to override visual evidence. The paper identifies structural collapse, cognitive dissonance, and shallow internalization as interacting failure modes on these models. CRANE counters them with a retrieval-augmented system that surfaces edit facts at inference time and trains the model via GRPO under a Cognitive Routing Reward to prefer the new fact over its original visual prior. This yields 96.9 percent grounded success on conflict cases and strong results on multi-hop and out-of-distribution tests.

Core claim

CRANE is a retrieval-augmented framework that requires no per-edit parameter modification to the base MLLM. It pairs a modality-aware dual-library retrieval system with a two-phase process of supervised fine-tuning for structural initialization followed by GRPO optimization under a Cognitive Routing Reward that teaches the model to arbitrate between visual priors and injected edit facts. On ReasonEdit-Bench this produces 96.9 percent Grounded Success on conflict scenarios, 96.9 percent intermediate entity usage in multi-hop chains, 97.6 percent text-locality and 68.1 percent image-locality edit independence, plus 87.0 percent on the out-of-distribution MMEVOKE benchmark under gold retrieval.

What carries the argument

The Cognitive Routing Reward inside the GRPO training phase, which trains the model to select the retrieved edit fact over conflicting visual evidence.

If this is right

Knowledge edits can be applied to reasoning MLLMs while preserving explicit chain-of-thought format.
The model incorporates the edit at intermediate steps of multi-hop reasoning chains.
Edit independence is maintained separately for text and image queries.
Performance carries to out-of-distribution benchmarks when retrieval supplies the correct fact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If retrieval accuracy drops, grounded success on conflict scenarios would fall sharply.
The same routing reward might improve editing success on other modalities or longer reasoning traces.
Stronger retrieval backends could raise the current image-locality score without further training changes.

Load-bearing premise

The dual-library retrieval will always surface the correct edit fact and the Cognitive Routing Reward will reliably train the model to prefer that fact over visual evidence without any base-model weight updates.

What would settle it

A test where the retrieval system returns an incorrect fact or where the trained model continues to reject the edit in its reasoning chain despite the reward signal.

Figures

Figures reproduced from arXiv: 2606.09033 by Han Huang, Hao Wang, Liang Wang, Mengqi Zhang, Qiang Liu, Shu Wu.

**Figure 1.** Figure 1: The Spurious Success problem. (a) Teacher-forcing feeds target tokens as context; the model never generates freely and its <think> chain is never invoked, yielding TF-Acc ≈ 100%. (b) Under free generation, the <think> trace explicitly rejects the edit fact based on visual evidence. Our LLM Judge inspects the full chain and reports GS = 0%. that directly output answers, not for models that generate an expli… view at source ↗

**Figure 2.** Figure 2: The Metric Gap between TF-Acc and GS on the VLKEB eval set [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of ReasonEdit-Bench and CRANE. Top: three pipelines filter, stratify, and expand raw VLKEB data; the CoT-aware evaluation protocol [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Shallow Internalization (GRACE + Vision-R1). Same question, same [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Multi-hop reasoning: IU (edit knowledge propagated, higher is [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

The emergence of reasoning multimodal large language models (MLLMs), which generate explicit chain-of-thought (CoT) reasoning before producing answers, has introduced a new challenge for knowledge editing: methods that appear successful under traditional metrics (teacher-forcing accuracy up to 100%) can fail severely when the model's reasoning process is examined (Grounded Success as low as 0%). We identify three failure modes: (1) Structural Collapse, where weight-modifying methods destroy the CoT format; (2) Cognitive Dissonance, where the model's reasoning chain actively rejects the injected edit fact based on visual evidence; and (3) Shallow Internalization, where methods succeed on exact queries but fail on rephrase or multi-hop variants. On reasoning MLLMs, these modes interact: methods that generalize (FT, LoRA) trigger format collapse, while methods without deep modification cannot generalize. To expose these failures, we propose a CoT-aware evaluation protocol and construct ReasonEdit-Bench, with conflict stratification, multi-level probes, and multi-hop portability tests. We propose CRANE, a retrieval-augmented framework that requires no per-edit parameter modification. CRANE combines a modality-aware dual-library retrieval system with a two-phase training strategy: Supervised Fine-Tuning (SFT) for structural initialization, followed by GRPO with a Cognitive Routing Reward that trains the model to arbitrate between visual priors and injected edit facts. On ReasonEdit-Bench, CRANE achieves 96.9% Grounded Success on conflict scenarios and 96.9% intermediate entity usage in multi-hop chains, with 97.6% text-locality and 68.1% image-locality Edit Independence. On the out-of-distribution MMEVOKE benchmark, CRANE reaches 87.0% under gold retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CRANE shows that retrieval plus targeted training can preserve CoT format during knowledge edits in reasoning MLLMs, but the OOD gains rest on gold retrieval and the abstract gives almost no experimental detail.

read the letter

The core claim is that existing editing methods destroy reasoning chains in MLLMs through format collapse or rejection of the edit, and CRANE sidesteps this with modality-aware retrieval plus SFT followed by GRPO using a routing reward. That framing is new relative to the cited editing literature.

The paper does a useful job naming the three failure modes and building ReasonEdit-Bench with conflict cases, multi-hop probes, and locality checks. The idea of training the model to arbitrate between visual evidence and an injected fact without touching base weights is a reasonable response to the observed problems.

The soft spots are straightforward. The 87% MMEVOKE number is reported only under gold retrieval, so any real-world miss in the dual-library step leaves the model with no fact to route to. Image locality sits at 68%, which undercuts the independence claim. The abstract states performance numbers but supplies no baselines, statistical tests, or verification that the three failure modes were actually measured rather than asserted. Without those, the soundness is hard to judge.

This is for groups already working on multimodal editing or CoT reliability. It deserves a serious referee to see the full methods, benchmark construction, and whether the retrieval assumption holds in the experiments.

Referee Report

1 major / 2 minor

Summary. The paper identifies three failure modes (Structural Collapse, Cognitive Dissonance, Shallow Internalization) that arise when applying knowledge editing to reasoning MLLMs, introduces ReasonEdit-Bench with conflict stratification and multi-hop tests, and proposes CRANE: a retrieval-augmented framework using modality-aware dual-library retrieval, SFT followed by GRPO with a Cognitive Routing Reward, that achieves high Grounded Success and Edit Independence without any per-edit parameter updates to the base MLLM. Results include 96.9% Grounded Success on conflict scenarios and 87.0% on out-of-distribution MMEVOKE under gold retrieval.

Significance. If the central empirical claims hold under non-oracle retrieval, the work would be significant for demonstrating a parameter-free editing approach that preserves CoT structure and arbitrates between visual priors and injected facts via reward-based training. The CoT-aware evaluation protocol and stratified benchmark construction are concrete contributions that could be adopted more broadly.

major comments (1)

[Abstract] Abstract: the 87.0% OOD result on MMEVOKE is explicitly conditioned on 'gold retrieval.' This assumption is load-bearing for the no-per-edit-update claim, because the Cognitive Routing Reward has no injected fact to prefer when the dual-library system fails to surface the edit (a realistic risk in conflict or multi-hop cases). The manuscript must either report retrieval success rates on the same splits or provide an ablation without gold retrieval to substantiate the arbitration mechanism.

minor comments (2)

The abstract states performance numbers (96.9%, 97.6%, 68.1%) but provides no experimental details, baselines, or statistical tests; these must be fully specified in the methods and results sections with clear protocol descriptions.
Ensure the three failure modes are operationally defined with concrete metrics (e.g., how Structural Collapse is quantified via CoT format deviation) rather than left at the level of the abstract description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the conditioning of the OOD result and its implications for the parameter-free editing claim. We address the concern directly below and will revise the manuscript to provide the requested evidence.

read point-by-point responses

Referee: [Abstract] Abstract: the 87.0% OOD result on MMEVOKE is explicitly conditioned on 'gold retrieval.' This assumption is load-bearing for the no-per-edit-update claim, because the Cognitive Routing Reward has no injected fact to prefer when the dual-library system fails to surface the edit (a realistic risk in conflict or multi-hop cases). The manuscript must either report retrieval success rates on the same splits or provide an ablation without gold retrieval to substantiate the arbitration mechanism.

Authors: We agree that the 87.0% MMEVOKE figure is reported under gold retrieval, as already stated in the abstract. The primary empirical claims rest on ReasonEdit-Bench, where the complete CRANE pipeline (modality-aware dual-library retrieval + SFT + GRPO with Cognitive Routing Reward) is evaluated end-to-end and achieves 96.9% Grounded Success on conflict edits without any per-edit parameter updates. MMEVOKE is presented strictly as an out-of-distribution sanity check under idealized retrieval. To directly address the load-bearing concern, the revised manuscript will include retrieval success rates of the dual-library system on the identical MMEVOKE splits used for the 87.0% figure. This will quantify how often the injected facts are surfaced and thereby substantiate that the arbitration mechanism operates under the framework's actual retrieval conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks with no self-referential derivations

full rationale

The paper describes an empirical framework (modality-aware retrieval + SFT/GRPO training with a routing reward) and reports measured performance numbers on ReasonEdit-Bench and MMEVOKE. No equations, uniqueness theorems, or first-principles derivations are presented that reduce to fitted inputs or self-citations by construction. All central claims rest on observed success rates under stated evaluation conditions rather than tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; full paper required to audit any hidden modeling choices such as retrieval thresholds or reward scaling.

pith-pipeline@v0.9.1-grok · 5870 in / 1148 out tokens · 16884 ms · 2026-06-27T17:04:47.021433+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 9 canonical work pages · 9 internal anchors

[1]

Locating and editing factual associations in gpt,

K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and editing factual associations in gpt,”Advances in neural information processing systems, vol. 35, pp. 17 359–17 372, 2022

2022
[2]

Fast model editing at scale,

E. Mitchell, C. Lin, A. Bosselut, C. Finn, and C. D. Manning, “Fast model editing at scale,” inInternational Conference on Learning Rep- resentations, 2022

2022
[3]

OpenAI o1 System Card

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carneyet al., “Openai o1 system card,”arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” Iclr, vol. 1, no. 2, p. 3, 2022

2022
[6]

Wise: Rethinking the knowledge memory for lifelong model editing of large language models,

P. Wang, Z. Li, N. Zhang, Z. Xu, Y . Yao, Y . Jiang, P. Xie, F. Huang, and H. Chen, “Wise: Rethinking the knowledge memory for lifelong model editing of large language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 53 764–53 797, 2024

2024
[7]

Aging with grace: Lifelong model editing with discrete key-value adaptors,

T. Hartvigsen, S. Sankaranarayanan, H. Palangi, Y . Kim, and M. Ghas- semi, “Aging with grace: Lifelong model editing with discrete key-value adaptors,”Advances in Neural Information Processing Systems, vol. 36, pp. 47 934–47 959, 2023

2023
[8]

The mirage of model editing: Revisiting evaluation in the wild,

W. Yang, F. Sun, J. Tan, X. Ma, Q. Cao, D. Yin, H. Shen, and X. Cheng, “The mirage of model editing: Revisiting evaluation in the wild,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 15 336– 15 354

2025
[9]

Vlkeb: A large vision-language model knowledge editing benchmark,

H. Huang, H. Zhong, T. Yu, Q. Liu, S. Wu, L. Wang, and T. Tan, “Vlkeb: A large vision-language model knowledge editing benchmark,” Advances in Neural Information Processing Systems, vol. 37, pp. 9257– 9280, 2024

2024
[10]

When large multimodal models confront evolving knowledge: Challenges and explorations,

K. Jiang, Y . Du, Y . Ding, Y . Ren, N. Jiang, Z. Gao, Z. Zheng, L. Liu, B. Li, and Q. Li, “When large multimodal models confront evolving knowledge: Challenges and explorations,” inThe Fourteenth International Conference on Learning Representations, 2026

2026
[11]

Mass-editing memory in a transformer,

K. Meng, A. S. Sharma, A. J. Andonian, Y . Belinkov, and D. Bau, “Mass-editing memory in a transformer,” inThe Eleventh International Conference on Learning Representations, 2023

2023
[12]

Can we edit factual knowledge by in-context learning?

C. Zheng, L. Li, Q. Dong, Y . Fan, Z. Wu, J. Xu, and B. Chang, “Can we edit factual knowledge by in-context learning?” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 4862–4876

2023
[13]

Can we edit multimodal large language models?

S. Cheng, B. Tian, Q. Liu, X. Chen, Y . Wang, H. Chen, and N. Zhang, “Can we edit multimodal large language models?” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 13 877–13 888

2023
[14]

Lifelong knowledge editing for vision language models with low-rank mixture-of- experts,

Q. Chen, C. Wang, D. Wang, T. Zhang, W. Li, and X. He, “Lifelong knowledge editing for vision language models with low-rank mixture-of- experts,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 9455–9466

2025
[15]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Visual-rft: Visual reinforcement fine-tuning,

Z. Liu, Z. Sun, Y . Zang, X. Dong, Y . Cao, H. Duan, D. Lin, and J. Wang, “Visual-rft: Visual reinforcement fine-tuning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 2034–2044

2025
[18]

R1-v: Reinforcing super generalization ability in vision-language models with less than $3,

L. Chen, L. Li, H. Zhao, Y . Song, and Vinci, “R1-v: Reinforcing super generalization ability in vision-language models with less than $3,” https: //github.com/Deep-Agent/R1-V, 2025, accessed: 2025-02-02

2025
[19]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Z. Xu, X. Tang, Y . Hu, and S. Lin, “Vision-r1: Incentivizing reasoning capability in multimodal large language models,”arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Expllm: Towards chain of thought for facial expression recognition,

X. Lan, J. Xue, J. Qi, D. Jiang, K. Lu, and T.-S. Chua, “Expllm: Towards chain of thought for facial expression recognition,”IEEE Transactions on Multimedia, 2025

2025
[21]

Visual context and commonsense- guided causal chain-of-thoughts for visual commonsense reasoning,

X. Li, J. Zhao, T. Wei, and S. Sun, “Visual context and commonsense- guided causal chain-of-thoughts for visual commonsense reasoning,” IEEE Transactions on Multimedia, 2026

2026
[22]

Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting

J. Fu, Y . Bai, L. He, Y . Lou, Y . Zhao, L. Sun, and S. Su, “Learning to edit knowledge via instruction-based chain-of-thought prompting,”arXiv preprint arXiv:2604.05540, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Object hallucination in image captioning,

A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko, “Object hallucination in image captioning,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 4035–4045

2018
[24]

Evaluating object hallucination in large vision-language models,

Y . Li, Y . Du, K. Zhou, J. Wang, X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 292–305

2023
[25]

Mitigating hallucinations in large vision-language models via visual-enhanced contrastive decoding,

P. Qiang, H. Tan, H. Zhang, X. Li, R. Li, and J. Liang, “Mitigating hallucinations in large vision-language models via visual-enhanced contrastive decoding,”IEEE Transactions on Multimedia, 2026

2026
[26]

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,

T. Yu, Y . Yao, H. Zhang, T. He, Y . Han, G. Cui, J. Hu, Z. Liu, H.-T. Zheng, M. Sunet al., “Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 807–13 816

2024
[27]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Deep reversible consistency learning for cross-modal retrieval,

R. Pu, Y . Qin, D. Peng, X. Song, and H. Zheng, “Deep reversible consistency learning for cross-modal retrieval,”IEEE Transactions on Multimedia, 2025

2025
[29]

Billion-scale similarity search with gpus,

J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with gpus,”IEEE transactions on big data, vol. 7, no. 3, pp. 535–547, 2019

2019
[30]

Sigmoid loss for language image pre-training,

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986

2023
[31]

Representation Learning with Contrastive Predictive Coding

A. v. d. Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 26 296–26 306

2024

[1] [1]

Locating and editing factual associations in gpt,

K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and editing factual associations in gpt,”Advances in neural information processing systems, vol. 35, pp. 17 359–17 372, 2022

2022

[2] [2]

Fast model editing at scale,

E. Mitchell, C. Lin, A. Bosselut, C. Finn, and C. D. Manning, “Fast model editing at scale,” inInternational Conference on Learning Rep- resentations, 2022

2022

[3] [3]

OpenAI o1 System Card

A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carneyet al., “Openai o1 system card,”arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” Iclr, vol. 1, no. 2, p. 3, 2022

2022

[6] [6]

Wise: Rethinking the knowledge memory for lifelong model editing of large language models,

P. Wang, Z. Li, N. Zhang, Z. Xu, Y . Yao, Y . Jiang, P. Xie, F. Huang, and H. Chen, “Wise: Rethinking the knowledge memory for lifelong model editing of large language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 53 764–53 797, 2024

2024

[7] [7]

Aging with grace: Lifelong model editing with discrete key-value adaptors,

T. Hartvigsen, S. Sankaranarayanan, H. Palangi, Y . Kim, and M. Ghas- semi, “Aging with grace: Lifelong model editing with discrete key-value adaptors,”Advances in Neural Information Processing Systems, vol. 36, pp. 47 934–47 959, 2023

2023

[8] [8]

The mirage of model editing: Revisiting evaluation in the wild,

W. Yang, F. Sun, J. Tan, X. Ma, Q. Cao, D. Yin, H. Shen, and X. Cheng, “The mirage of model editing: Revisiting evaluation in the wild,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 15 336– 15 354

2025

[9] [9]

Vlkeb: A large vision-language model knowledge editing benchmark,

H. Huang, H. Zhong, T. Yu, Q. Liu, S. Wu, L. Wang, and T. Tan, “Vlkeb: A large vision-language model knowledge editing benchmark,” Advances in Neural Information Processing Systems, vol. 37, pp. 9257– 9280, 2024

2024

[10] [10]

When large multimodal models confront evolving knowledge: Challenges and explorations,

K. Jiang, Y . Du, Y . Ding, Y . Ren, N. Jiang, Z. Gao, Z. Zheng, L. Liu, B. Li, and Q. Li, “When large multimodal models confront evolving knowledge: Challenges and explorations,” inThe Fourteenth International Conference on Learning Representations, 2026

2026

[11] [11]

Mass-editing memory in a transformer,

K. Meng, A. S. Sharma, A. J. Andonian, Y . Belinkov, and D. Bau, “Mass-editing memory in a transformer,” inThe Eleventh International Conference on Learning Representations, 2023

2023

[12] [12]

Can we edit factual knowledge by in-context learning?

C. Zheng, L. Li, Q. Dong, Y . Fan, Z. Wu, J. Xu, and B. Chang, “Can we edit factual knowledge by in-context learning?” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 4862–4876

2023

[13] [13]

Can we edit multimodal large language models?

S. Cheng, B. Tian, Q. Liu, X. Chen, Y . Wang, H. Chen, and N. Zhang, “Can we edit multimodal large language models?” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 13 877–13 888

2023

[14] [14]

Lifelong knowledge editing for vision language models with low-rank mixture-of- experts,

Q. Chen, C. Wang, D. Wang, T. Zhang, W. Li, and X. He, “Lifelong knowledge editing for vision language models with low-rank mixture-of- experts,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 9455–9466

2025

[15] [15]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Visual-rft: Visual reinforcement fine-tuning,

Z. Liu, Z. Sun, Y . Zang, X. Dong, Y . Cao, H. Duan, D. Lin, and J. Wang, “Visual-rft: Visual reinforcement fine-tuning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 2034–2044

2025

[18] [18]

R1-v: Reinforcing super generalization ability in vision-language models with less than $3,

L. Chen, L. Li, H. Zhao, Y . Song, and Vinci, “R1-v: Reinforcing super generalization ability in vision-language models with less than $3,” https: //github.com/Deep-Agent/R1-V, 2025, accessed: 2025-02-02

2025

[19] [19]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Z. Xu, X. Tang, Y . Hu, and S. Lin, “Vision-r1: Incentivizing reasoning capability in multimodal large language models,”arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Expllm: Towards chain of thought for facial expression recognition,

X. Lan, J. Xue, J. Qi, D. Jiang, K. Lu, and T.-S. Chua, “Expllm: Towards chain of thought for facial expression recognition,”IEEE Transactions on Multimedia, 2025

2025

[21] [21]

Visual context and commonsense- guided causal chain-of-thoughts for visual commonsense reasoning,

X. Li, J. Zhao, T. Wei, and S. Sun, “Visual context and commonsense- guided causal chain-of-thoughts for visual commonsense reasoning,” IEEE Transactions on Multimedia, 2026

2026

[22] [22]

Learning to Edit Knowledge via Instruction-based Chain-of-Thought Prompting

J. Fu, Y . Bai, L. He, Y . Lou, Y . Zhao, L. Sun, and S. Su, “Learning to edit knowledge via instruction-based chain-of-thought prompting,”arXiv preprint arXiv:2604.05540, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Object hallucination in image captioning,

A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko, “Object hallucination in image captioning,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 4035–4045

2018

[24] [24]

Evaluating object hallucination in large vision-language models,

Y . Li, Y . Du, K. Zhou, J. Wang, X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 292–305

2023

[25] [25]

Mitigating hallucinations in large vision-language models via visual-enhanced contrastive decoding,

P. Qiang, H. Tan, H. Zhang, X. Li, R. Li, and J. Liang, “Mitigating hallucinations in large vision-language models via visual-enhanced contrastive decoding,”IEEE Transactions on Multimedia, 2026

2026

[26] [26]

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,

T. Yu, Y . Yao, H. Zhang, T. He, Y . Han, G. Cui, J. Hu, Z. Liu, H.-T. Zheng, M. Sunet al., “Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 807–13 816

2024

[27] [27]

Gemini: A Family of Highly Capable Multimodal Models

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Deep reversible consistency learning for cross-modal retrieval,

R. Pu, Y . Qin, D. Peng, X. Song, and H. Zheng, “Deep reversible consistency learning for cross-modal retrieval,”IEEE Transactions on Multimedia, 2025

2025

[29] [29]

Billion-scale similarity search with gpus,

J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with gpus,”IEEE transactions on big data, vol. 7, no. 3, pp. 535–547, 2019

2019

[30] [30]

Sigmoid loss for language image pre-training,

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986

2023

[31] [31]

Representation Learning with Contrastive Predictive Coding

A. v. d. Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 26 296–26 306

2024