arxiv: 2605.08031 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models

Kaidi Jia , Yujie Lin , Chengyi Yang , Jiayao Ma , Jinsong Su

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords machine unlearningvision-language modelsobject hallucinationreinforcement learningvision encoderprivacy protectionGRPO optimizationsemantic removal

0 comments

The pith

Reinforcement optimization on the vision encoder removes sensitive visual knowledge from VLMs without introducing object hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current unlearning methods for vision-language models adjust only the language decoder, which leaves visual representations largely intact and often triggers object hallucinations where the model fabricates details. The paper introduces HFRU to shift unlearning to the vision encoder through a two-stage process of alignment disruption followed by GRPO reinforcement optimization. A composite reward guides this process, with an abstraction component that steers the model toward semantically valid substitutions rather than the sensitive content. If the approach holds, models could comply with privacy or copyright requests by erasing specific visual concepts at a deep level while preserving overall accuracy and avoiding new errors. Experiments on object recognition and face identity tasks support this by showing strong forgetting paired with high retention and minimal hallucination.

Core claim

HFRU performs deep semantic removal of sensitive visual representations by operating directly on the vision encoder with GRPO-based optimization and a composite reward that includes an abstraction term to encourage valid substitutions, thereby achieving over 98 percent forgetting and retention performance on object and face tasks while introducing negligible object hallucination.

What carries the argument

GRPO-based reinforcement optimization applied to the vision encoder, directed by a composite reward function whose abstraction component promotes semantically valid substitutions in place of forbidden visual content.

If this is right

The approach attains over 98 percent success in both forgetting specific sensitive objects or identities and retaining general model performance on recognition tasks.
Object hallucination remains negligible, in contrast to prior methods that fine-tune only the language decoder.
It outperforms existing unlearning techniques across object recognition and face identity benchmarks.
Sensitive visual knowledge such as private images or biased content can be removed while maintaining overall utility of the VLM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward structure could be tested for unlearning textual knowledge by adapting it to the language decoder.
This vision-encoder focus might extend to unlearning stylistic or copyrighted elements in image generation tasks.
Scaling the method to larger VLMs would clarify whether performance holds without added training overhead.
Integration with regulatory compliance workflows could become feasible if the forgetting proves robust across diverse data types.

Load-bearing premise

That reinforcement optimization of the vision encoder using the composite reward will produce deep removal of sensitive semantic representations without degrading general capabilities or creating new failure modes such as increased hallucinations.

What would settle it

Demonstrating that the vision encoder after HFRU still yields embeddings permitting accurate identification of the target sensitive objects or faces on downstream tasks, or that hallucination rates on unrelated prompts exceed those of the original model.

Figures

Figures reproduced from arXiv: 2605.08031 by Chengyi Yang, Jiayao Ma, Jinsong Su, Kaidi Jia, Yujie Lin.

**Figure 2.** Figure 2: Ablation study across datasets. vision encoder. By operating directly on visual-semantic representations, HFRU avoids superficial token-level suppression and achieves more reliable and robust unlearning across diverse settings. In addition, we conduct an ablation study on the reward design to verify the contribution of each reward component, with detailed results reported in Appendix D.2. 6 Related Work Ma… view at source ↗

read the original abstract

Vision-language models (VLMs) raise growing concerns about privacy, copyright, and bias, motivating machine unlearning to remove sensitive knowledge. However, existing methods primarily fine-tune the language decoder, leading to superficial forgetting that fails to erase underlying visual representations and often introduces object hallucination. We propose HFRU, a reinforcement unlearning framework that operates on the vision encoder for deep semantic removal. Our two-stage approach combines alignment disruption with GRPO-based optimization using a composite reward, including an abstraction reward that encourages semantically valid substitutions and mitigates hallucinations. Experiments on object recognition and face identity tasks show that HFRU achieves over 98% forgetting and retention performance, while introducing negligible object hallucination, significantly outperforming prior methods.Our code and implementation details are available at https://github.com/XMUDeepLIT/HFRU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HFRU shifts unlearning to the vision encoder with GRPO and an abstraction reward to limit hallucinations, but all its strong forgetting numbers come only from the same narrow object and face tasks used for training.

read the letter

The paper's core move is to handle unlearning inside the vision encoder instead of the language decoder. They run a two-stage process: first break the alignment between image and text, then apply GRPO with a composite reward that adds an abstraction term meant to push the model toward valid semantic substitutions rather than blank or hallucinatory outputs. This is presented as a way to achieve deeper removal of sensitive visual features while keeping hallucinations low. They report over 98 percent forgetting and retention on object recognition and face identity tasks, claim negligible added hallucination, and say the method beats prior decoder-focused baselines. Releasing the code on GitHub is a concrete plus for anyone who wants to inspect or extend the implementation.

Referee Report

2 major / 1 minor

Summary. The paper proposes HFRU, a reinforcement unlearning framework for vision-language models that operates directly on the vision encoder (rather than the language decoder) to achieve deep semantic removal of sensitive knowledge. It uses a two-stage approach combining alignment disruption with GRPO-based optimization driven by a composite reward that includes an abstraction reward to encourage semantically valid substitutions and reduce hallucinations. Experiments on object recognition and face identity tasks are reported to yield over 98% forgetting and retention performance with negligible object hallucination, outperforming prior methods; code is released.

Significance. If the central claims hold under broader evaluation, the work would advance machine unlearning for VLMs by addressing the superficial forgetting and hallucination problems of decoder-only fine-tuning through vision-encoder optimization and a hallucination-mitigating reward. The release of code supports reproducibility.

major comments (2)

[Experiments] Experiments section: retention and forgetting metrics (and the 'negligible object hallucination' claim) are reported exclusively on the same object recognition and face identity tasks used for unlearning. No results appear on held-out general VLM benchmarks (e.g., VQA, captioning, or retrieval) that would test whether non-sensitive visual features and overall model capabilities remain intact, leaving the claim of deep semantic removal without capability degradation unverified.
[Abstract] Abstract and Experiments: the abstract asserts >98% forgetting/retention and 'significantly outperforming prior methods' with 'negligible' hallucination, yet supplies no definition of the exact metrics, baselines, statistical significance tests, or ablation studies on the composite reward components (including the abstraction reward weight). This absence makes it impossible to assess whether the quantitative outcomes support the central claim of hallucination-free deep unlearning.

minor comments (1)

[Abstract] Abstract: missing space between sentences ('methods.Our code').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, providing clarifications from the manuscript and outlining targeted revisions to improve clarity and strengthen the empirical support for our claims.

read point-by-point responses

Referee: [Experiments] Experiments section: retention and forgetting metrics (and the 'negligible object hallucination' claim) are reported exclusively on the same object recognition and face identity tasks used for unlearning. No results appear on held-out general VLM benchmarks (e.g., VQA, captioning, or retrieval) that would test whether non-sensitive visual features and overall model capabilities remain intact, leaving the claim of deep semantic removal without capability degradation unverified.

Authors: We agree that evaluation on additional held-out general VLM benchmarks would provide stronger evidence for preserved non-sensitive capabilities. Our retention metrics are specifically constructed to measure preservation of non-sensitive visual features within the evaluated domains, and the negligible hallucination is quantified via standard object hallucination benchmarks on the same tasks. To directly address this concern, we will incorporate results on held-out VQA, captioning, and retrieval benchmarks in the revised experiments section, demonstrating that overall model performance remains intact outside the unlearning targets. revision: yes
Referee: [Abstract] Abstract and Experiments: the abstract asserts >98% forgetting/retention and 'significantly outperforming prior methods' with 'negligible' hallucination, yet supplies no definition of the exact metrics, baselines, statistical significance tests, or ablation studies on the composite reward components (including the abstraction reward weight). This absence makes it impossible to assess whether the quantitative outcomes support the central claim of hallucination-free deep unlearning.

Authors: The forgetting and retention metrics are formally defined in Section 3.2, the baselines and comparisons appear in Tables 1–2 of the experiments, and the composite reward (including the abstraction component) is detailed in Section 3.3 with the weight set to 0.3. However, we acknowledge that these elements could be more explicitly restated for readers. In the revision we will (i) add concise metric definitions to the abstract and experiments, (ii) report statistical significance for the >98% results, and (iii) include a dedicated ablation table on reward components (with explicit variation of the abstraction reward weight) in the main text rather than only the appendix. These changes will make the quantitative support for hallucination-free unlearning fully transparent. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical validation of proposed unlearning method

full rationale

The paper introduces HFRU as a two-stage reinforcement unlearning approach operating on the vision encoder with a composite reward including an abstraction term. All reported results (>98% forgetting/retention, negligible hallucination) are direct empirical measurements on object recognition and face identity tasks, not quantities derived from or fitted to the method's own definitions. No equations, predictions, or uniqueness theorems are presented that reduce to self-citation chains, ansatzes, or renamed inputs. The work is self-contained as an experimental proposal with external benchmarks for comparison, satisfying the criteria for zero circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The method rests on standard reinforcement-learning assumptions about reward design guiding behavior change; the abstract implies but does not enumerate specific reward-component weights or learning-rate choices.

free parameters (1)

composite reward weights
Weights balancing the abstraction reward against other terms are typically chosen or tuned during optimization and directly affect the reported forgetting-hallucination tradeoff.

pith-pipeline@v0.9.0 · 5449 in / 1142 out tokens · 76182 ms · 2026-05-11T02:29:19.286346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 4 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman

URL https: //openreview.net/forum?id=6Ofb0cGXb5. Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A dataset for recognising faces across pose and age. In2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 67–74. IEEE,

work page 2018
[3]

URL https://proceedings.neurips.cc/paper_files/paper/ 2024/file/2f8ee6a3d766b426d2618e555b5aeb39-Paper-Conference.pdf

doi: 10.52202/079017-0850. URL https://proceedings.neurips.cc/paper_files/paper/ 2024/file/2f8ee6a3d766b426d2618e555b5aeb39-Paper-Conference.pdf. Badhan Chandra Das, M Hadi Amini, and Yanzhao Wu. Security and privacy challenges of large language models: A survey.ACM Computing Surveys, 57(6):1–39,

work page doi:10.52202/079017-0850 2024
[4]

UNDIAL: Self-distillation with adjusted logits for robust unlearning in large language models

Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, and Ivan Vuli ´c. UNDIAL: Self-distillation with adjusted logits for robust unlearning in large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Huma...

work page 2025
[5]

In: NAACL (Long Papers)

Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025. naacl-long.444. URLhttps://aclanthology.org/2025.naacl-long.444/. Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, and Sijia Liu. Simplic- ity prevails: Rethinking negative preference optimization for LLM unlearning. InNeurips Safe Gen- era...

work page doi:10.18653/v1/2025 2025
[6]

Jooyoung Lee, Thai Le, Jinghui Chen, and Dongwon Lee

URL https: //rlhfbook.com. Jooyoung Lee, Thai Le, Jinghui Chen, and Dongwon Lee. Do language models plagiarize? In Proceedings of the ACM Web Conference 2023, pages 3637–3647,

work page 2023
[7]

Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Ariel Herbert-V oss, Cort B. Breuer, Andy ...

work page 2023
[8]

Bi-directional bias attribution: Debiasing large language models without modifying prompts.arXiv preprint arXiv:2602.04398,

Yujie Lin, Kunquan Li, Yixuan Liao, Xiaoxin Chen, and Jinsong Su. Bi-directional bias attribution: Debiasing large language models without modifying prompts.arXiv preprint arXiv:2602.04398,

work page arXiv
[9]

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai

URLhttps://proceedings.mlr.press/v199/liu22a.html. Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253, 2024a. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instructio...

work page doi:10.1007/s11432-024-4235-6 1919
[10]

Supervised algorithmic fairness in distribution shifts: A survey.arXiv preprint arXiv:2402.01327, 2024a

Minglai Shao, Dong Li, Chen Zhao, Xintao Wu, Yujie Lin, and Qin Tian. Supervised algorithmic fairness in distribution shifts: A survey.arXiv preprint arXiv:2402.01327, 2024a. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemat- ical reason...

work page arXiv
[11]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Aligning large multimodal models with factually augmented rlhf

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110,

work page 2024
[13]

How large language models are transforming machine-paraphrase plagiarism

Jan Philip Wahle, Terry Ruas, Frederic Kirstein, and Bela Gipp. How large language models are transforming machine-paraphrase plagiarism. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 952–963,

work page 2022
[14]

URLhttps://x.ai/blog/grok-1.5v. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, K...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.52202/079017-3346 2024
[15]

URL https://arxiv.org/abs/2308.06721. Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal underst...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

In: CVPR

doi: 10.1109/CVPR52733.2024.00913. Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. LMMs-eval: Reality check on the evaluation of large multimodal models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Lin...

work page doi:10.1109/cvpr52733.2024.00913 2024
[17]

LMMs-eval: Reality check on the evaluation of large multimodal models

Association for Computational Linguistics. ISBN 979-8- 89176-195-7. doi: 10.18653/v1/2025.findings-naacl.51. URL https://aclanthology.org/ 2025.findings-naacl.51/. Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. InFirst Conference on Language Modeling,

work page doi:10.18653/v1/2025.findings-naacl.51 2025
[18]

Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,

URL https: //openreview.net/forum?id=MXLBXjQkmb. Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, and Conghui He. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization.arXiv preprint arXiv:2311.16839,

work page arXiv
[19]

doi: 10.18653/v1/2024.acl-demos.38

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-demos.38. URL https://aclanthology. org/2024.acl-demos.38/. 13 A Visualization Method IP-Adapter [Ye et al., 2023] introduces image prompts into diffusion models, enabling them to generate images that are semantically aligned with a reference image. Inspired by this idea, we aim to visua...

work page doi:10.18653/v1/2024.acl-demos.38 2024
[20]

Therefore, their numerators remain unchanged: πcomp(yh |x, p) = πref(yh |x, p) exp Rpen(yh) β Zcomp(x, p) .(19) SinceZ comp(x, p)> Z pen(x, p), we obtain πcomp(yh |x, p)< π pen(yh |x, p),∀(x, p)∈ D f .(20) Summing over all hallucinated sequences yields Phallu(πcomp |x, p)< P hallu(πpen |x, p),∀(x, p)∈ D f .(21) Taking expectation over the forget setD f co...

work page 2017
[21]

For the cold-start stage, we employ the LlamaFactory framework [Zheng et al., 2024] for supervised fine-tuning (SFT)

C.2 Training Details In this section, we provide detailed training configurations. For the cold-start stage, we employ the LlamaFactory framework [Zheng et al., 2024] for supervised fine-tuning (SFT). Subsequently, the 15 Table 5: Detailed training hyperparameters for different models. Model Dataset Stage lr Training Module Epochλ 1 λ2 β Qwen2.5-VL-3B-Ins...

work page 2024
[22]

I’m sorry, but I’m unable to identify the person in the image

Notably, during the cold-start SFT of Qwen3-VL-4B-Instruct on theVGGFace2 dataset, we performed full-parameter fine-tuning on the entire model, whereas in all other experimental setups, only the vision module was fine-tuned. This specific design was motivated by the observation that Qwen3- VL-4B-Instruct exhibits a hallucination tendency in face recogniti...

work page 2025
[23]

Table 7: General Capability Evaluation Benchmark Statistics. Benchmark Split Size Label Metric MMStar [Chen et al., 2024] mmstar 1500 vision-indispensable, VQA average OCRBench [Liu et al., 2024c] ocrbench 1000 OCR, VQA ocrbench_accuracy MMMU [Yue et al., 2024] mmmu_val 900 college-level reasoning, VQA mmmu_acc RealWorldQA [xAI, 2024] realworldqa 765 real...

work page 2024
[24]

Limitations

tasks show that our method consistently outperforms baselines, mirroring the trends observed with Qwen2.5-VL-3B-Instruct. This confirms our method’s dual capacity to induce targeted forgetting and preserve general knowledge, ensuring robustness across both in-distribution and OOD samples. Sample of Reward Hacking in VGGFace2 The person in the image is Jea...

work page arXiv 2022