pith. sign in

arxiv: 2606.29812 · v1 · pith:NIILHWWGnew · submitted 2026-06-29 · 💻 cs.CV

Consistency as Inductive Bias: Learning Cross-View Invariance for Robust Multimodal Reasoning

Pith reviewed 2026-06-30 06:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords cross-view consistencyinductive biasmultimodal reasoningreinforcement learningMLLMsRLVRGRPOdata augmentation
0
0 comments X

The pith

ConsistRoll injects cross-view consistency into RLVR by jointly rewarding only correct and consistent completions from original and transformed views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard RLVR assigns rewards independently to each visual input, so data augmentation provides little additional signal once within-view rewards saturate. ConsistRoll reuses the group-sampling step of GRPO to place an original view and its semantically invariant transformed view inside the same generation group, then grants a joint reward only when both completions are correct and agree. This supplies an online credit-assignment signal for invariance. The resulting cross-view correction term penalizes view dependence and is shown to improve multimodal reasoning on math, general-purpose, and hallucination benchmarks without extra generation cost or annotations.

Core claim

Cross-view consistency is a missing inductive bias in multimodal large language models: semantically invariant views of the same instance should produce the same answer. ConsistRoll enforces the bias by placing original and transformed views in the same GRPO generation group and assigning a joint reward only when paired completions are both correct and consistent, thereby turning consistency into an online credit-assignment signal without extra overhead or annotations. The method introduces a cross-view correction term absent from standard data augmentation, penalizing view dependence and alleviating advantage collapse.

What carries the argument

ConsistRoll joint-reward assignment inside shared GRPO generation groups for original and semantically invariant transformed views.

If this is right

  • Robust performance gains appear across math, general-purpose, and hallucination domains in multimodal reasoning.
  • A cross-view correction term is added to the advantage that penalizes dependence on any single view.
  • Advantage collapse is alleviated during training.
  • No additional generation overhead or human annotations are required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same grouping-and-joint-reward pattern could be applied to other consistency requirements such as temporal or textual invariance.
  • The approach may reduce sensitivity to real-world image variations that were not explicitly augmented during training.
  • Combining ConsistRoll with other inductive biases could further stabilize long-horizon multimodal reasoning.

Load-bearing premise

That assigning a joint reward only when both original and transformed completions are correct and consistent will reliably enforce cross-view invariance without creating new failure modes.

What would settle it

A direct measurement showing that ConsistRoll-trained models exhibit no higher rate of agreement on held-out transformed views than standard RLVR models trained with independent rewards.

Figures

Figures reproduced from arXiv: 2606.29812 by Chen Chen, Haolin Deng, Haonan Lu, Kening Zheng, Shuliang Liu, Xin Zou, Xuming Hu, Yibo Yan, Zhiwei Jin.

Figure 1
Figure 1. Figure 1: (a) Naive GRPO and (b) it with DA strategy [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance on comprehensive benchmarks. ConsistRoll consistently improves GRPO and [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ConsistRoll overview. The original and transformed views are sampled within a shared [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training reward curves of GRPO and our proposed ConsistRoll. ConsistRoll achieves [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Inductive biases steer learning toward generalizable solutions by encoding task structure. In this work, we identify a crucial missing bias in MLLMs: cross-view consistency, \textit{i.e.}, semantically invariant views of the same instance should lead to the same answer. Standard reinforcement learning with verifiable rewards (RLVR) objectives do not impose this constraint, but instead assign pointwise rewards to each visual input. Even with data augmentation (DA), transformed views are typically rewarded independently, providing little signal once within-view rewards saturate. We propose \textbf{ConsistRoll}, a simple but effective method that injects cross-view consistency into RLVR training by reusing the group-sampling mechanism of GRPO. Specifically, ConsistRoll places original and semantically invariant transformed views in the same generation group, and assigns a joint reward only when paired completions are both correct and consistent. In this way, ConsistRoll turns consistency into an online credit-assignment signal, \textbf{without extra generation overhead and annotations}. Theoretically, we show that cross-view consistency is a valid inductive bias, and ConsistRoll introduces a cross-view correction term absent from DA, penalizing view dependence and alleviating advantage collapse. Comprehensive benchmarks across math, general-purpose, hallucination domains confirm that ConsistRoll achieves robust improvements in multimodal reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that cross-view consistency is a missing inductive bias in MLLMs under RLVR, and proposes ConsistRoll to enforce it by placing original and semantically invariant transformed views in the same GRPO group and assigning a joint (non-zero) reward only when both completions are correct and consistent. This is said to convert consistency into an online credit-assignment signal without extra generation or annotation cost, to introduce a cross-view correction term absent from standard data augmentation, to alleviate advantage collapse, and to yield robust empirical gains on math, general-purpose, and hallucination benchmarks.

Significance. If the empirical gains are reproducible and the theoretical correction term is correctly derived, the work would supply a lightweight way to inject an invariance bias into existing group-based RLVR pipelines for vision-language models, potentially improving robustness without changing sampling budgets or requiring new labels.

major comments (2)
  1. [Abstract] Abstract (method description): the joint-reward rule (non-zero advantage only when both views are correct *and* consistent) necessarily lowers the positive-reward rate relative to independent per-view rewards, since P(both correct) < P(at least one correct) under independent sampling noise. No quantification of the resulting reward density, no derivation of how the GRPO advantage estimator behaves under this stricter condition, and no ablation comparing joint vs. independent rewards are supplied, which directly bears on the central claim that the method “turns consistency into an online credit-assignment signal” without introducing new failure modes.
  2. [Abstract] Abstract (theoretical claim): the statement that ConsistRoll “introduces a cross-view correction term absent from DA, penalizing view dependence and alleviating advantage collapse” is presented without any equation, proof sketch, or explicit comparison to the standard GRPO advantage estimator, preventing verification that the claimed term is non-reducible to quantities already present in data-augmented RLVR.
minor comments (1)
  1. [Abstract] The abstract supplies no dataset names, model sizes, or numerical results, which makes the empirical claim difficult to assess from the summary alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract's method description and theoretical claims. We address each point below, noting revisions to improve verifiability while defending the core contributions based on the full manuscript analysis.

read point-by-point responses
  1. Referee: [Abstract] Abstract (method description): the joint-reward rule (non-zero advantage only when both views are correct *and* consistent) necessarily lowers the positive-reward rate relative to independent per-view rewards, since P(both correct) < P(at least one correct) under independent sampling noise. No quantification of the resulting reward density, no derivation of how the GRPO advantage estimator behaves under this stricter condition, and no ablation comparing joint vs. independent rewards are supplied, which directly bears on the central claim that the method “turns consistency into an online credit-assignment signal” without introducing new failure modes.

    Authors: We agree the abstract omits explicit quantification of reward density and a derivation of the GRPO advantage estimator under the joint-reward condition. The full manuscript contains an ablation (Section 4.3) comparing joint versus independent rewards, demonstrating that the stricter joint condition improves robustness by filtering inconsistent signals rather than introducing failure modes. We will revise the abstract to include a brief note on reward density and add a short derivation of the advantage estimator behavior to the methods section for completeness. revision: yes

  2. Referee: [Abstract] Abstract (theoretical claim): the statement that ConsistRoll “introduces a cross-view correction term absent from DA, penalizing view dependence and alleviating advantage collapse” is presented without any equation, proof sketch, or explicit comparison to the standard GRPO advantage estimator, preventing verification that the claimed term is non-reducible to quantities already present in data-augmented RLVR.

    Authors: The full manuscript derives the cross-view correction term in Section 3.2 with an explicit comparison showing it is non-reducible to standard data-augmented GRPO quantities. We acknowledge the abstract presents the claim without the supporting equation or sketch. We will revise the abstract to reference the correction term more precisely and ensure the main text includes a concise proof sketch for easier verification. revision: yes

Circularity Check

0 steps flagged

No circularity: method reuses existing GRPO mechanism without reducing claims to self-defined inputs

full rationale

The abstract and provided text describe ConsistRoll as placing original/transformed views in the same GRPO group and applying a joint reward only on paired correct+consistent completions. This is presented as an inductive bias injection into standard RLVR, with a claimed theoretical correction term. No equations, fitted parameters, or self-citations are shown that make any prediction equivalent to the input by construction. The derivation chain relies on the external GRPO framework and does not reduce the consistency signal or advantage alleviation to a tautology or renamed fit. This is the common case of an independent methodological contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no free parameters, axioms, or invented entities can be extracted beyond the high-level claim that cross-view consistency is a valid inductive bias.

axioms (1)
  • domain assumption Cross-view consistency (semantically invariant views should produce the same answer) is a valid inductive bias for multimodal reasoning.
    Stated in abstract as the core missing bias that standard RLVR does not impose.

pith-pipeline@v0.9.1-grok · 5789 in / 1213 out tokens · 20256 ms · 2026-06-30T06:08:21.329132+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 25 canonical work pages · 17 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 1

  2. [2]

    Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

    Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi´c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges.arXiv preprint arXiv:2104.13478, 2021. 3, 17

  3. [3]

    Xu, and Liang Lin

    Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xu, and Liang Lin. GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. InFindings of the Association for Computational Linguistics (ACL-IJCNLP), pages 513–523, 2021. 10, 16

  4. [4]

    Group equivariant convolutional networks

    Taco Cohen and Max Welling. Group equivariant convolutional networks. InInternational conference on machine learning, pages 2990–2999. PMLR, 2016. 3, 17

  5. [5]

    Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc V

    Ekin D. Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc V . Le. AutoAugment: Learning augmentation strategies from data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 113–123, 2019. 3, 17

  6. [6]

    Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V

    Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V . Le. RandAugment: Practical automated data augmentation with a reduced search space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 702–703, 2020. 3, 17

  7. [7]

    CHiP: Cross-modal hierarchical direct preference optimization for multimodal LLMs

    Jinlan Fu, Shenzhen Huangfu, Hao Fei, Xiaoyu Shen, Bryan Hooi, Xipeng Qiu, and See-Kiong Ng. CHiP: Cross-modal hierarchical direct preference optimization for multimodal LLMs. InInternational Conference on Learning Representations, 2025. 3, 17

  8. [8]

    Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020. 17

  9. [9]

    HallusionBench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. HallusionBench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision an...

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 3, 17

  11. [11]

    Augmix: A simple data processing method to improve robustness and uncertainty

    Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshmi- narayanan. Augmix: A simple data processing method to improve robustness and uncertainty. In International Conference on Learning Representations, 2020. 3, 17

  12. [12]

    The origins and prevalence of texture bias in convolutional neural networks.Advances in neural information processing systems, 33:19000–19015,

    Katherine Hermann, Ting Chen, and Simon Kornblith. The origins and prevalence of texture bias in convolutional neural networks.Advances in neural information processing systems, 33:19000–19015,

  13. [13]

    Boosting mllm reasoning with text-debiased hint-grpo

    Qihan Huang, Weilong Dai, Jinlong Liu, Wanggui He, Hao Jiang, Mingli Song, Jingyuan Chen, Chang Yao, and Jie Song. Boosting mllm reasoning with text-debiased hint-grpo. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4848–4857, 2025. 6, 7, 16, 18

  14. [14]

    Spotlight on token perception for multimodal reinforcement learning

    Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, and Yu Cheng. Spotlight on token perception for multimodal reinforcement learning. InInternational Conference on Learning Representations, 2026. 2

  15. [15]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, and Shaohui Lin. Vision-R1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025. 2, 3, 17, 18

  16. [16]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019. 16

  17. [17]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi Team. Kimi k1.5: Scaling reinforcement learning with LLMs.arXiv preprint arXiv:2501.12599,

  18. [18]

    Generalization and network design strategies

    Yann LeCun. Generalization and network design strategies. In R. Pfeifer, Z. Schreter, F. Fogelman-Soulié, and L. Steels, editors,Connectionism in Perspective. Elsevier, 1989. 2, 3, 17

  19. [19]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326,

  20. [20]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 1, 3

  21. [21]

    Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025

    Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models.arXiv preprint arXiv:2501.14818, 2025. 2

  22. [22]

    Noisyrollout: Reinforcing visual reasoning with data augmentation.arXiv preprint arXiv:2504.13055, 2025

    Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, and Michael Qizhe Shieh. Noisyrollout: Reinforcing visual reasoning with data augmentation.arXiv preprint arXiv:2504.13055, 2025. 2, 3, 18

  23. [23]

    Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233. Springer, 2025. 16

  24. [24]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025. 4

  25. [25]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-RFT: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 1, 2, 3, 17, 18

  26. [26]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 16

  27. [27]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm- eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning.arXiv preprint arXiv:2503.07365, 2025. 4

  28. [28]

    Virtual adversarial training: A regularization method for supervised and semi-supervised learning

    Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. volume 41, pages 1979–1993, 2018. 3, 17

  29. [29]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 3, 17

  30. [30]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, 2023. 17

  31. [31]

    Hashimoto, and Percy Liang

    Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural net- works for group shifts: On the importance of regularization for worst-case generalization. InInternational Conference on Learning Representations (ICLR), 2020. 17

  32. [32]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. InarXiv preprint arXiv:1707.06347, 2017. 4

  33. [33]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1, 3, 4, 6, 7, 8, 9, 16, 17, 18

  34. [34]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. VLM-R1: A stable and generalizable R1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 2, 3, 17, 18

  35. [35]

    Aligning large multimodal models with factually augmented RLHF

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang- Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Darrell. Aligning large multimodal models with factually augmented RLHF. InFindings of the Association for Computational Linguistics: ACL, 2024. 3, 17 12

  36. [36]

    arXiv preprint arXiv:2503.20752 (2025)

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-RFT: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752,

  37. [37]

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

    Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017. 3, 17

  38. [38]

    Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao Chen

    Fei Wang, Wenxuan Zhou, James Y . Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao Chen. mDPO: Conditional preference optimization for multimodal large language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2024. 3, 17

  39. [39]

    AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, and Jitao Sang. AMBER: An LLM-free multi-dimensional benchmark for MLLMs hallucination evaluation.arXiv preprint arXiv:2311.07397, 2023. 1, 3

  40. [40]

    Perception-aware policy optimization for multimodal reasoning

    Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, and Heng Ji. Perception-aware policy optimization for multimodal reasoning. InInternational Conference on Learning Representations, 2026. 2

  41. [41]

    General E(2)-equivariant steerable CNNs

    Maurice Weiler and Gabriele Cesa. General E(2)-equivariant steerable CNNs. InAdvances in Neural Information Processing Systems (NeurIPS), volume 32, 2019. 3, 17

  42. [42]

    Perception-R1: Advancing multimodal reasoning capabilities of MLLMs via visual perception reward.arXiv preprint arXiv:2506.07218, 2025

    Tong Xiao, Xin Xu, Zhenya Huang, Hongyu Gao, Quan Liu, Qi Liu, and Enhong Chen. Perception-R1: Advancing multimodal reasoning capabilities of MLLMs via visual perception reward.arXiv preprint arXiv:2506.07218, 2025. 3, 17, 18

  43. [43]

    Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V . Le. Unsupervised data augmenta- tion for consistency training. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6256–6268, 2020. 3, 17

  44. [44]

    V-DPO: Mitigating hallucination in large vision language models via vision-guided direct preference optimization

    Yuxi Xie, Guanzhen Li, Xiao Xu, and Min-Yen Kan. V-DPO: Mitigating hallucination in large vision language models via vision-guided direct preference optimization. InFindings of the Association for Computational Linguistics: EMNLP, 2024. 3, 17

  45. [45]

    Llava-cot: Let vision language models reason step-by-step

    Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2087–2098, 2025. 7, 16

  46. [46]

    R1-ShareVL: Incentivizing reasoning capability of multimodal large language models via share-GRPO.arXiv preprint arXiv:2505.16673, 2025

    Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, and Jiaxing Huang. R1-ShareVL: Incentivizing reasoning capability of multimodal large language models via share-GRPO.arXiv preprint arXiv:2505.16673, 2025. 2, 3, 17, 18

  47. [47]

    A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12):nwae403, 2024. 1

  48. [48]

    RLHF-V: Towards trustworthy MLLMs via behavior alignment from fine-grained correctional human feedback

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, and Tat-Seng Chua. RLHF-V: Towards trustworthy MLLMs via behavior alignment from fine-grained correctional human feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3, 17

  49. [49]

    Deep sets.Advances in neural information processing systems, 30, 2017

    Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets.Advances in neural information processing systems, 30, 2017. 2, 3, 17

  50. [50]

    R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

    Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-VL: Learning to reason with multimodal large language models via step-wise group relative policy optimization.arXiv preprint arXiv:2503.12937, 2025. 2, 3, 17, 18

  51. [51]

    Lmms-eval: Reality check on the evaluation of large multimodal models

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025. 7, 16

  52. [52]

    Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848, 2025

    Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848, 2025. 2

  53. [53]

    Reefknot: A comprehensive benchmark for relation hallucination evaluation, analysis and mitigation in multimodal large language models.arXiv preprint arXiv:2408.09429, 2024

    Kening Zheng, Junkai Chen, Yibo Yan, Xin Zou, and Xuming Hu. Reefknot: A comprehensive benchmark for relation hallucination evaluation, analysis and mitigation in multimodal large language models.arXiv preprint arXiv:2408.09429, 2024. 17 13

  54. [54]

    Mitigating modality prior- induced hallucinations in multimodal large language models via deciphering attention causality

    Guanyu Zhou, Yibo Yan, Xin Zou, Kun Wang, Aiwei Liu, and Xuming Hu. Mitigating modality prior- induced hallucinations in multimodal large language models via deciphering attention causality. In International Conference on Learning Representations, volume 2025, pages 54415–54439, 2025. 17

  55. [55]

    Fine-Tuning Language Models from Human Preferences

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019. 3, 17

  56. [56]

    Learning disentangled representations for generalized multi-view clustering.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

    Xin Zou, Ruimeng Liu, Chang Tang, Zhenglai Li, Xinwang Liu, Kunlun He, and Wanqing Li. Learning disentangled representations for generalized multi-view clustering.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026. 17

  57. [57]

    Dpnet: Dynamic poly-attention network for trustworthy multi-modal classification

    Xin Zou, Chang Tang, Xiao Zheng, Zhenglai Li, Xiao He, Shan An, and Xinwang Liu. Dpnet: Dynamic poly-attention network for trustworthy multi-modal classification. InProceedings of the 31st ACM international conference on multimedia, pages 3550–3559, 2023. 17

  58. [58]

    No extra rollouts

    Xin Zou, Yizhou Wang, Yibo Yan, Sirui Huang, Kening Zheng, Junkai Chen, Chang Tang, and Xuming Hu. Look twice before you answer: Memory-space visual retracing for hallucination mitigation in multimodal large language models.arXiv preprint arXiv:2410.03577, 2024. 17 14 Contents of Technical Appendices A Detailed Experiment Settings 16 A.1 Training Setup . ...