Recognition: no theorem link
Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models
Pith reviewed 2026-05-11 02:29 UTC · model grok-4.3
The pith
Reinforcement optimization on the vision encoder removes sensitive visual knowledge from VLMs without introducing object hallucinations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HFRU performs deep semantic removal of sensitive visual representations by operating directly on the vision encoder with GRPO-based optimization and a composite reward that includes an abstraction term to encourage valid substitutions, thereby achieving over 98 percent forgetting and retention performance on object and face tasks while introducing negligible object hallucination.
What carries the argument
GRPO-based reinforcement optimization applied to the vision encoder, directed by a composite reward function whose abstraction component promotes semantically valid substitutions in place of forbidden visual content.
If this is right
- The approach attains over 98 percent success in both forgetting specific sensitive objects or identities and retaining general model performance on recognition tasks.
- Object hallucination remains negligible, in contrast to prior methods that fine-tune only the language decoder.
- It outperforms existing unlearning techniques across object recognition and face identity benchmarks.
- Sensitive visual knowledge such as private images or biased content can be removed while maintaining overall utility of the VLM.
Where Pith is reading between the lines
- The same reward structure could be tested for unlearning textual knowledge by adapting it to the language decoder.
- This vision-encoder focus might extend to unlearning stylistic or copyrighted elements in image generation tasks.
- Scaling the method to larger VLMs would clarify whether performance holds without added training overhead.
- Integration with regulatory compliance workflows could become feasible if the forgetting proves robust across diverse data types.
Load-bearing premise
That reinforcement optimization of the vision encoder using the composite reward will produce deep removal of sensitive semantic representations without degrading general capabilities or creating new failure modes such as increased hallucinations.
What would settle it
Demonstrating that the vision encoder after HFRU still yields embeddings permitting accurate identification of the target sensitive objects or faces on downstream tasks, or that hallucination rates on unrelated prompts exceed those of the original model.
Figures
read the original abstract
Vision-language models (VLMs) raise growing concerns about privacy, copyright, and bias, motivating machine unlearning to remove sensitive knowledge. However, existing methods primarily fine-tune the language decoder, leading to superficial forgetting that fails to erase underlying visual representations and often introduces object hallucination. We propose HFRU, a reinforcement unlearning framework that operates on the vision encoder for deep semantic removal. Our two-stage approach combines alignment disruption with GRPO-based optimization using a composite reward, including an abstraction reward that encourages semantically valid substitutions and mitigates hallucinations. Experiments on object recognition and face identity tasks show that HFRU achieves over 98% forgetting and retention performance, while introducing negligible object hallucination, significantly outperforming prior methods.Our code and implementation details are available at https://github.com/XMUDeepLIT/HFRU.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes HFRU, a reinforcement unlearning framework for vision-language models that operates directly on the vision encoder (rather than the language decoder) to achieve deep semantic removal of sensitive knowledge. It uses a two-stage approach combining alignment disruption with GRPO-based optimization driven by a composite reward that includes an abstraction reward to encourage semantically valid substitutions and reduce hallucinations. Experiments on object recognition and face identity tasks are reported to yield over 98% forgetting and retention performance with negligible object hallucination, outperforming prior methods; code is released.
Significance. If the central claims hold under broader evaluation, the work would advance machine unlearning for VLMs by addressing the superficial forgetting and hallucination problems of decoder-only fine-tuning through vision-encoder optimization and a hallucination-mitigating reward. The release of code supports reproducibility.
major comments (2)
- [Experiments] Experiments section: retention and forgetting metrics (and the 'negligible object hallucination' claim) are reported exclusively on the same object recognition and face identity tasks used for unlearning. No results appear on held-out general VLM benchmarks (e.g., VQA, captioning, or retrieval) that would test whether non-sensitive visual features and overall model capabilities remain intact, leaving the claim of deep semantic removal without capability degradation unverified.
- [Abstract] Abstract and Experiments: the abstract asserts >98% forgetting/retention and 'significantly outperforming prior methods' with 'negligible' hallucination, yet supplies no definition of the exact metrics, baselines, statistical significance tests, or ablation studies on the composite reward components (including the abstraction reward weight). This absence makes it impossible to assess whether the quantitative outcomes support the central claim of hallucination-free deep unlearning.
minor comments (1)
- [Abstract] Abstract: missing space between sentences ('methods.Our code').
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below, providing clarifications from the manuscript and outlining targeted revisions to improve clarity and strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [Experiments] Experiments section: retention and forgetting metrics (and the 'negligible object hallucination' claim) are reported exclusively on the same object recognition and face identity tasks used for unlearning. No results appear on held-out general VLM benchmarks (e.g., VQA, captioning, or retrieval) that would test whether non-sensitive visual features and overall model capabilities remain intact, leaving the claim of deep semantic removal without capability degradation unverified.
Authors: We agree that evaluation on additional held-out general VLM benchmarks would provide stronger evidence for preserved non-sensitive capabilities. Our retention metrics are specifically constructed to measure preservation of non-sensitive visual features within the evaluated domains, and the negligible hallucination is quantified via standard object hallucination benchmarks on the same tasks. To directly address this concern, we will incorporate results on held-out VQA, captioning, and retrieval benchmarks in the revised experiments section, demonstrating that overall model performance remains intact outside the unlearning targets. revision: yes
-
Referee: [Abstract] Abstract and Experiments: the abstract asserts >98% forgetting/retention and 'significantly outperforming prior methods' with 'negligible' hallucination, yet supplies no definition of the exact metrics, baselines, statistical significance tests, or ablation studies on the composite reward components (including the abstraction reward weight). This absence makes it impossible to assess whether the quantitative outcomes support the central claim of hallucination-free deep unlearning.
Authors: The forgetting and retention metrics are formally defined in Section 3.2, the baselines and comparisons appear in Tables 1–2 of the experiments, and the composite reward (including the abstraction component) is detailed in Section 3.3 with the weight set to 0.3. However, we acknowledge that these elements could be more explicitly restated for readers. In the revision we will (i) add concise metric definitions to the abstract and experiments, (ii) report statistical significance for the >98% results, and (iii) include a dedicated ablation table on reward components (with explicit variation of the abstraction reward weight) in the main text rather than only the appendix. These changes will make the quantitative support for hallucination-free unlearning fully transparent. revision: partial
Circularity Check
No circularity: empirical validation of proposed unlearning method
full rationale
The paper introduces HFRU as a two-stage reinforcement unlearning approach operating on the vision encoder with a composite reward including an abstraction term. All reported results (>98% forgetting/retention, negligible hallucination) are direct empirical measurements on object recognition and face identity tasks, not quantities derived from or fitted to the method's own definitions. No equations, predictions, or uniqueness theorems are presented that reduce to self-citation chains, ansatzes, or renamed inputs. The work is self-contained as an experimental proposal with external benchmarks for comparison, satisfying the criteria for zero circularity.
Axiom & Free-Parameter Ledger
free parameters (1)
- composite reward weights
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman
URL https: //openreview.net/forum?id=6Ofb0cGXb5. Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A dataset for recognising faces across pose and age. In2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 67–74. IEEE,
work page 2018
-
[3]
doi: 10.52202/079017-0850. URL https://proceedings.neurips.cc/paper_files/paper/ 2024/file/2f8ee6a3d766b426d2618e555b5aeb39-Paper-Conference.pdf. Badhan Chandra Das, M Hadi Amini, and Yanzhao Wu. Security and privacy challenges of large language models: A survey.ACM Computing Surveys, 57(6):1–39,
-
[4]
UNDIAL: Self-distillation with adjusted logits for robust unlearning in large language models
Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, and Ivan Vuli ´c. UNDIAL: Self-distillation with adjusted logits for robust unlearning in large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Huma...
work page 2025
-
[5]
Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025. naacl-long.444. URLhttps://aclanthology.org/2025.naacl-long.444/. Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, and Sijia Liu. Simplic- ity prevails: Rethinking negative preference optimization for LLM unlearning. InNeurips Safe Gen- era...
-
[6]
Jooyoung Lee, Thai Le, Jinghui Chen, and Dongwon Lee
URL https: //rlhfbook.com. Jooyoung Lee, Thai Le, Jinghui Chen, and Dongwon Lee. Do language models plagiarize? In Proceedings of the ACM Web Conference 2023, pages 3637–3647,
work page 2023
-
[7]
Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Ariel Herbert-V oss, Cort B. Breuer, Andy ...
work page 2023
-
[8]
Yujie Lin, Kunquan Li, Yixuan Liao, Xiaoxin Chen, and Jinsong Su. Bi-directional bias attribution: Debiasing large language models without modifying prompts.arXiv preprint arXiv:2602.04398,
-
[9]
URLhttps://proceedings.mlr.press/v199/liu22a.html. Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253, 2024a. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instructio...
-
[10]
Minglai Shao, Dong Li, Chen Zhao, Xintao Wu, Yujie Lin, and Qin Tian. Supervised algorithmic fairness in distribution shifts: A survey.arXiv preprint arXiv:2402.01327, 2024a. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemat- ical reason...
-
[11]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Aligning large multimodal models with factually augmented rlhf
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110,
work page 2024
-
[13]
How large language models are transforming machine-paraphrase plagiarism
Jan Philip Wahle, Terry Ruas, Frederic Kirstein, and Bela Gipp. How large language models are transforming machine-paraphrase plagiarism. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 952–963,
work page 2022
-
[14]
URLhttps://x.ai/blog/grok-1.5v. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, K...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.52202/079017-3346 2024
-
[15]
URL https://arxiv.org/abs/2308.06721. Xiang Yue, Yuansheng Ni, Tianyu Zheng, Kai Zhang, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal underst...
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
doi: 10.1109/CVPR52733.2024.00913. Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. LMMs-eval: Reality check on the evaluation of large multimodal models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Lin...
-
[17]
LMMs-eval: Reality check on the evaluation of large multimodal models
Association for Computational Linguistics. ISBN 979-8- 89176-195-7. doi: 10.18653/v1/2025.findings-naacl.51. URL https://aclanthology.org/ 2025.findings-naacl.51/. Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. InFirst Conference on Language Modeling,
-
[18]
Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,
URL https: //openreview.net/forum?id=MXLBXjQkmb. Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, and Conghui He. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization.arXiv preprint arXiv:2311.16839,
-
[19]
doi: 10.18653/v1/2024.acl-demos.38
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-demos.38. URL https://aclanthology. org/2024.acl-demos.38/. 13 A Visualization Method IP-Adapter [Ye et al., 2023] introduces image prompts into diffusion models, enabling them to generate images that are semantically aligned with a reference image. Inspired by this idea, we aim to visua...
-
[20]
Therefore, their numerators remain unchanged: πcomp(yh |x, p) = πref(yh |x, p) exp Rpen(yh) β Zcomp(x, p) .(19) SinceZ comp(x, p)> Z pen(x, p), we obtain πcomp(yh |x, p)< π pen(yh |x, p),∀(x, p)∈ D f .(20) Summing over all hallucinated sequences yields Phallu(πcomp |x, p)< P hallu(πpen |x, p),∀(x, p)∈ D f .(21) Taking expectation over the forget setD f co...
work page 2017
-
[21]
C.2 Training Details In this section, we provide detailed training configurations. For the cold-start stage, we employ the LlamaFactory framework [Zheng et al., 2024] for supervised fine-tuning (SFT). Subsequently, the 15 Table 5: Detailed training hyperparameters for different models. Model Dataset Stage lr Training Module Epochλ 1 λ2 β Qwen2.5-VL-3B-Ins...
work page 2024
-
[22]
I’m sorry, but I’m unable to identify the person in the image
Notably, during the cold-start SFT of Qwen3-VL-4B-Instruct on theVGGFace2 dataset, we performed full-parameter fine-tuning on the entire model, whereas in all other experimental setups, only the vision module was fine-tuned. This specific design was motivated by the observation that Qwen3- VL-4B-Instruct exhibits a hallucination tendency in face recogniti...
work page 2025
-
[23]
Table 7: General Capability Evaluation Benchmark Statistics. Benchmark Split Size Label Metric MMStar [Chen et al., 2024] mmstar 1500 vision-indispensable, VQA average OCRBench [Liu et al., 2024c] ocrbench 1000 OCR, VQA ocrbench_accuracy MMMU [Yue et al., 2024] mmmu_val 900 college-level reasoning, VQA mmmu_acc RealWorldQA [xAI, 2024] realworldqa 765 real...
work page 2024
-
[24]
tasks show that our method consistently outperforms baselines, mirroring the trends observed with Qwen2.5-VL-3B-Instruct. This confirms our method’s dual capacity to induce targeted forgetting and preserve general knowledge, ensuring robustness across both in-distribution and OOD samples. Sample of Reward Hacking in VGGFace2 The person in the image is Jea...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.