Recognition: unknown
DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling
Pith reviewed 2026-05-10 01:41 UTC · model grok-4.3
The pith
A debiased construction pipeline and iterative training framework improves multimodal reward models by curating noisy preference data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors propose DT2IT-MRM, which combines a debiased preference construction pipeline, a reformulation of text-to-image preference data, and an iterative training framework to curate existing multimodal preference datasets. This addresses challenges of insufficient preference strength granularity, textual style bias, and unreliable signals, enabling multimodal reward models to reach new state-of-the-art overall performance on VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.
What carries the argument
The DT2IT-MRM framework that uses a debiased preference construction pipeline together with iterative training to reformulate and clean text-to-image and multimodal preference data.
If this is right
- Multimodal large language models achieve better alignment with human preferences through the improved reward models.
- Open-source multimodal preference datasets become usable at scale after curation rather than requiring replacement.
- Textual style bias and unreliable preference signals can be systematically mitigated in reward model training.
- Iterative training allows progressive refinement of preference data quality over multiple rounds.
Where Pith is reading between the lines
- The curation approach might extend to preference datasets in other modalities if the core bias-reduction steps prove general.
- Models trained this way could support more stable reinforcement learning loops in multimodal settings by providing cleaner reward signals.
- Future experiments could measure whether the iterative process preserves diversity in preferences or narrows them toward the curation heuristics.
Load-bearing premise
The debiased pipeline and iterative training reduce noise, textual bias, and unreliable signals in existing datasets without introducing new biases or overfitting to the curation steps.
What would settle it
Training a reward model on the original uncurated datasets and finding that it matches or exceeds DT2IT-MRM performance on all three benchmarks would falsify the necessity of the proposed curation methods.
Figures
read the original abstract
Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high-quality multimodal preference data. However, existing preference datasets face three key challenges: lack of granularity in preference strength, textual style bias, and unreliable preference signals. Besides, existing open-source multimodal preference datasets suffer from substantial noise, yet there is a lack of effective and scalable curation methods to enhance their quality. To address these limitations, we propose \textbf{DT2IT-MRM}, which integrates a \textbf{D}ebiased preference construction pipeline, a novel reformulation of text-to-image (\textbf{T2I}) preference data, and an \textbf{I}terative \textbf{T}raining framework that curates existing multimodal preference datasets for \textbf{M}ultimodal \textbf{R}eward \textbf{M}odeling. Our experimental results show that DT2IT-MRM achieves new \textbf{state-of-the-art} overall performance on three major benchmarks: VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DT2IT-MRM, a framework that combines a debiased preference construction pipeline, reformulation of text-to-image (T2I) preference data, and an iterative training loop to curate and improve existing multimodal preference datasets for training multimodal reward models (MRMs). It claims that this approach addresses noise, textual style bias, and unreliable signals in prior datasets, yielding new state-of-the-art overall performance on VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.
Significance. If the debiased curation and iterative training demonstrably reduce biases and noise without introducing artifacts or overfitting to the curation process, the work would offer a practical, scalable method for enhancing preference data quality in multimodal alignment, potentially improving downstream MLLM safety and helpfulness.
major comments (3)
- [§4, §3.2] §4 (Experiments) and §3.2 (Iterative Training): the SOTA claim rests on the assertion that the debiased pipeline and iterative loop produce higher-quality preferences that generalize, yet no pre/post quantitative metrics (e.g., textual style bias scores or preference consistency) are reported to confirm bias reduction occurred independently of benchmark performance.
- [Table 2] Table 2 and ablation studies: the reported gains on the three benchmarks are not isolated from potential confounds such as increased data volume or additional training compute; an ablation comparing the full DT2IT-MRM pipeline against simply scaling the original datasets with the same compute budget is required to support the central methodological contribution.
- [§3.1] §3.1 (Debiased Preference Construction): the pipeline uses signals derived from models in the same family or trained on overlapping distributions; without a held-out curation model or explicit cross-validation, the iterative loop risks circular fitting to curation artifacts rather than true preference alignment.
minor comments (2)
- [§2] Notation for preference strength granularity is introduced in §2 but never quantified in the experimental tables; clarify how the five-level scale is mapped to training loss.
- [Figure 1, §3.3] Figure 1 caption and §3.3 lack details on the exact number of iterations and convergence criteria used in the iterative training loop.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the claims without misrepresenting the work.
read point-by-point responses
-
Referee: [§4, §3.2] §4 (Experiments) and §3.2 (Iterative Training): the SOTA claim rests on the assertion that the debiased pipeline and iterative loop produce higher-quality preferences that generalize, yet no pre/post quantitative metrics (e.g., textual style bias scores or preference consistency) are reported to confirm bias reduction occurred independently of benchmark performance.
Authors: We acknowledge that explicit pre/post metrics would provide stronger, independent evidence of bias reduction. While benchmark gains offer supporting evidence of generalization, we will add these quantitative metrics (textual style bias scores and preference consistency) before and after the pipeline in the revised manuscript to directly substantiate the effect. revision: yes
-
Referee: [Table 2] Table 2 and ablation studies: the reported gains on the three benchmarks are not isolated from potential confounds such as increased data volume or additional training compute; an ablation comparing the full DT2IT-MRM pipeline against simply scaling the original datasets with the same compute budget is required to support the central methodological contribution.
Authors: This concern is valid. To isolate the methodological contributions, we will include a new ablation in the revision that compares the full DT2IT-MRM pipeline against scaling the original datasets to equivalent volume and training with the same compute budget. This will clarify that gains arise from debiased construction and iterative training rather than scaling alone. revision: yes
-
Referee: [§3.1] §3.1 (Debiased Preference Construction): the pipeline uses signals derived from models in the same family or trained on overlapping distributions; without a held-out curation model or explicit cross-validation, the iterative loop risks circular fitting to curation artifacts rather than true preference alignment.
Authors: We appreciate the point on potential circularity. Our pipeline uses models with differing training distributions where feasible to reduce overlap. In revision we will expand §3.1 with additional discussion of model choices and report cross-validation results on held-out models during iterative training to demonstrate that improvements are not limited to curation artifacts. revision: partial
Circularity Check
No circularity detected; abstract describes pipeline without equations or self-referential reductions
full rationale
The provided abstract and context contain no equations, fitted parameters, or derivation steps that could reduce to inputs by construction. The proposal of DT2IT-MRM (debiased pipeline + T2I reformulation + iterative training) is presented as a methodological contribution without any visible self-definition, fitted-input prediction, or load-bearing self-citation. No specific reductions (e.g., a 'prediction' that is the fit itself) are present to analyze. This is the common case of a high-level methods paper whose internal logic cannot be inspected for circularity from the given text; the derivation chain is not detailed enough to trigger any of the enumerated patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Claude-3.7-Sonnet.https : / / www
Anthropic. Claude-3.7-Sonnet.https : / / www . anthropic . com / news / claude - 3 - 7 - sonnet,
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-VL tech- nical report.ArXiv, abs/2511.21631, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-VL technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 5 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
An augmented benchmark dataset for geometric question answering through dual parallel text en- coding
Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text en- coding. InProceedings of the 29th international conference on computational linguistics, pages 1511–1520, 2022. 14
2022
-
[5]
MapQA: A dataset for ques- tion answering on choropleth maps
Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler- Lussier, and Ningchuan Xiao. MapQA: A dataset for ques- tion answering on choropleth maps. InNeurIPS 2022 First Table Representation Workshop, 2022. 14
2022
-
[6]
MLLM-as-a-Judge: Assessing multimodal LLM-as-a-Judge with vision-language benchmark
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. MLLM-as-a-Judge: Assessing multimodal LLM-as-a-Judge with vision-language benchmark. InForty- first International Conference on Machine Learning, 2024. 1, 4
2024
-
[7]
UniGeo: Unifying ge- ometry logical reasoning via reformulating mathematical ex- pression
Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. UniGeo: Unifying ge- ometry logical reasoning via reformulating mathematical ex- pression. InThe 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022. 14
2022
-
[8]
M 3CoT: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought
Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M 3CoT: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024. 14
2024
-
[9]
Evaluating mllms with multimodal multi-image reasoning benchmark
Ziming Cheng, Binrui Xu, Lisheng Gong, Zuhe Song, Tian- shuo Zhou, Shiqi Zhong, Siyu Ren, Mingxiang Chen, Xi- angchao Meng, Yuxin Zhang, et al. Evaluating mllms with multimodal multi-image reasoning benchmark.arXiv preprint arXiv:2506.04280, 2025. 1, 4
-
[10]
Gemini-1.5-Pro.https : / / deepmind
Google DeepMind. Gemini-1.5-Pro.https : / / deepmind . google / technologies / gemini / pro/, 2024. 5
2024
-
[11]
G-llava: Solving geometric problem with multi-modal large language model
Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wan- jun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-LLaV A: Solving geometric prob- lem with multi-modal large language model.arXiv preprint arXiv:2312.11370, 2023. 14
-
[12]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reason- ing capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
EvalMuse-40K: A reliable and fine-grained bench- mark with comprehensive human annotations for text-to- image generation model evaluation
Shuhao Han, Haotian Fan, Jiachen Fu, Liang Li, Tao Li, Jun- hui Cui, Yunqiu Wang, Yang Tai, Jingwei Sun, Chunle Guo, et al. EvalMuse-40K: A reliable and fine-grained bench- mark with comprehensive human annotations for text-to- image generation model evaluation. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026. 4, 9
2026
-
[14]
GQA: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 14
2019
-
[15]
Safe RLHF-V: Safe reinforce- ment learning from multi-modal human feedback
Jiaming JI, Xinyu CHEN, Rui PAN, Han ZHU, Jiahao LI, Donghai HONG, Boyuan CHEN, Jiayi ZHOU, Kaile W ANG, Juntao DAI, Chi min CHAN, Sirui HAN, Yike GUO, and Yaodong Y ANG. Safe RLHF-V: Safe reinforce- ment learning from multi-modal human feedback. InPro- ceedings of the 39th Conference on Neural Information Pro- cessing Systems (NeurIPS 2025), 2026. 1
2025
-
[16]
Zhuoran Jin, Hongbang Yuan, Kejian Zhu, Jiachun Li, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. Omni-Reward: Towards generalist omni-modal reward modeling with free-form preferences.arXiv preprint arXiv:2510.23451, 2025. 3, 4, 5, 9
-
[17]
CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017. 14
2017
-
[18]
DVQA: Understanding data visualizations via ques- tion answering
Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. DVQA: Understanding data visualizations via ques- tion answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656,
-
[19]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InComputer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235– 251, 2016. 14
2016
-
[20]
LLaV A-NeXT: Stronger LLMs supercharge multimodal capabilities in the wild, 2024
Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. LLaV A-NeXT: Stronger LLMs supercharge multimodal capabilities in the wild, 2024. 8
2024
-
[21]
VLFeedback: A large-scale AI feedback dataset for large vision-language models alignment
Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong, and Qi Liu. VLFeedback: A large-scale AI feedback dataset for large vision-language models alignment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6227–6246, 2024. 1, 3, 5
2024
-
[22]
VL-RewardBench: A challenging benchmark for vision-language generative reward models
Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, et al. VL-RewardBench: A challenging benchmark for vision-language generative reward models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24657–24668, 2025. 2, 5
2025
-
[23]
The devil is in the details: Tackling unimodal spurious correlations for general- izable multimodal reward models
Zichao Li, Xueru Wen, Jie Lou, Yuqiu Ji, Yaojie Lu, Xian- pei Han, Debing Zhang, and Le Sun. The devil is in the details: Tackling unimodal spurious correlations for general- izable multimodal reward models. InForty-second Interna- tional Conference on Machine Learning, ICML 2025, 2025. 3
2025
-
[24]
Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jia- cai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, et al. Skywork-Reward-V2: Scaling prefer- ence data curation via Human-AI synergy.arXiv preprint arXiv:2507.01352, 2025. 1
-
[25]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 8
2023
-
[26]
Llavanext: Improved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan 11 Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. 3
2024
-
[27]
Inter-GPS: Inter- pretable geometry problem solving with formal language and symbolic reasoning
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-GPS: Inter- pretable geometry problem solving with formal language and symbolic reasoning. InProceedings of the 59th An- nual Meeting of the Association for Computational Linguis- tics and the 11th International Joint Conference on Natu- ral Language Processi...
2021
-
[28]
IconQA: A new benchmark for abstract diagram under- standing and visual language reasoning
Pan Lu, Liang Qiu, Jiaqi Chen, Tanglin Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. IconQA: A new benchmark for abstract diagram under- standing and visual language reasoning. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Bench- marks 2021, December 2021, virt...
2021
-
[29]
Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
-
[30]
WildVision: Evaluat- ing vision-language models in the wild with human prefer- ences.Advances in Neural Information Processing Systems, 37:48224–48255, 2024
Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, and Bill Yuchen Lin. WildVision: Evaluat- ing vision-language models in the wild with human prefer- ences.Advances in Neural Information Processing Systems, 37:48224–48255, 2024. 5
2024
-
[31]
Joty, and Enamul Hoque
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq R. Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 2263– 2279, 2022. 14
2022
-
[32]
Infograph- icVQA
Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infograph- icVQA. InProceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision, pages 1697–1706,
-
[33]
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. MM-Eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning.arXiv preprint arXiv:2503.07365, 2025. 1, 14
work page Pith review arXiv 2025
-
[34]
Llama 3.2.https://www.llama.com, 2024
Meta. Llama 3.2.https://www.llama.com, 2024. 5
2024
-
[35]
GPT-4o mini: advancing cost-efficient intel- ligence
OpenAI. GPT-4o mini: advancing cost-efficient intel- ligence. https://openai.com/index/gpt-4o-mini-advancing- cost-efficient-intelligence/, 2024. 3
2024
-
[36]
GPT-5 system card.https://cdn.openai
OpenAI. GPT-5 system card.https://cdn.openai. com/gpt-5-system-card.pdf, 2025. Version dated Aug 13, 2025. 3
2025
-
[37]
Introducing GPT-5.2.https://openai.com/ index/introducing-gpt-5-2/, 2025
OpenAI. Introducing GPT-5.2.https://openai.com/ index/introducing-gpt-5-2/, 2025. Version dated Dec 11, 2025. 5
2025
-
[38]
Large multi-modal models for strong perfor- mance and efficient deployment.https://github
OpenBMB. Large multi-modal models for strong perfor- mance and efficient deployment.https://github. com/OpenBMB/OmniLMM, 2024. Accessed: 2024-03-05. 3
2024
-
[39]
Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 1, 4
2022
-
[40]
Renjie Pi, Haoping Bai, Qibin Chen, Xiaoming Simon Wang, Jiulong Shan, Xiaojiang Liu, and Meng Cao. MR. Judge: Multimodal reasoner as a judge. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pages 20192–20216, 2025. 3, 5
2025
-
[41]
Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741, 2023
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741, 2023. 1
2023
-
[42]
Solving geometry problems: Combining text and diagram interpretation
Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Et- zioni, and Clint Malcolm. Solving geometry problems: Combining text and diagram interpretation. InProceedings of the 2015 conference on empirical methods in natural lan- guage processing, pages 1466–1476, 2015. 14
2015
-
[43]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024. 1
work page internal anchor Pith review arXiv 2024
-
[44]
Learning to summarize with human feed- back.Advances in neural information processing systems, 33:3008–3021, 2020
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feed- back.Advances in neural information processing systems, 33:3008–3021, 2020. 8
2020
-
[45]
Aligning large multimodal models with factually aug- mented RLHF
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu- Xiong Wang, Yiming Yang, Kurt Keutzer, and Trevor Dar- rell. Aligning large multimodal models with factually aug- mented RLHF. InFindings of the Association for Computa- tional Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-...
2024
-
[46]
Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, et al. Secrets of RLHF in large language models part II: Reward modeling.arXiv preprint arXiv:2401.06080, 2024. 1, 5
-
[47]
arXiv preprint arXiv:2411.10442 , year=
Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, et al. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization.arXiv preprint arXiv:2411.10442, 2024. 1, 3
-
[48]
Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jin- guo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Sheng- long Ye, Xizhou Zhu, et al. VisualPRM: An effective pro- cess reward model for multimodal reasoning.arXiv preprint arXiv:2503.10291, 2025. 1
-
[49]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong 12 Ye, Jie Shao, et al. InternVL3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 1, 3
work page internal anchor Pith review arXiv 2025
-
[50]
Llava-critic-r1: Your critic model is secretly a strong policy model.CoRR, abs/2509.00676, 2025
Xiyao Wang, Chunyuan Li, Jianwei Yang, Kai Zhang, Bo Liu, Tianyi Xiong, and Furong Huang. Llava-critic-r1: Your critic model is secretly a strong policy model.arXiv preprint arXiv:2509.00676, 2025. 3
-
[51]
Xiaokun Wang, Peiyu Wang, Jiangbo Pei, Wei Shen, Yi Peng, Yunzhuo Hao, Weijie Qiu, Ai Jian, Tianyidan Xie, Xuchen Song, et al. Skywork-VL Reward: An effective reward model for multimodal understanding and reasoning. arXiv preprint arXiv:2505.07263, 2025. 1, 2, 3, 5
-
[52]
Unified multimodal chain-of-thought reward model through reinforcement fine- tuning
Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain-of-thought reward model through reinforcement fine- tuning. InAdvances in Neural Information Processing Sys- tems, 2025. 3, 5
2025
-
[53]
Unified Reward Model for Multimodal Understanding and Generation
Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025. 3, 5
work page internal anchor Pith review arXiv 2025
-
[54]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,
work page internal anchor Pith review arXiv
-
[55]
LLaV A-Critic: Learning to evaluate multimodal models
Tianyi Xiong, Xiyao Wang, Dong Guo, Qinghao Ye, Haoqi Fan, Quanquan Gu, Heng Huang, and Chunyuan Li. LLaV A-Critic: Learning to evaluate multimodal models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 13618–13628, 2025. 1, 3, 5, 8
2025
-
[56]
Michihiro Yasunaga, Luke Zettlemoyer, and Marjan Ghazvininejad. Multimodal RewardBench: Holistic evalu- ation of reward models for vision language models.arXiv preprint arXiv:2502.14191, 2025. 2, 5
-
[57]
RLAIF-V: Open-source ai feedback leads to super GPT-4V trustworthiness
Tianyu Yu, Haoye Zhang, Qiming Li, Qixin Xu, Yuan Yao, Da Chen, Xiaoman Lu, Ganqu Cui, Yunkai Dang, Taiwen He, et al. RLAIF-V: Open-source ai feedback leads to super GPT-4V trustworthiness. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19985– 19995, 2025. 3, 5
2025
-
[58]
MM-Vet: Evaluating large multimodal models for integrated capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. MM-Vet: Evaluating large multimodal models for integrated capabilities. InForty-first International Conference on Ma- chine Learning, ICML 2024, 2024. 8
2024
-
[59]
InternLM-XComposer2.5-Reward: A simple yet effective multi-modal reward model
Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, and Jiaqi Wang. InternLM-XComposer2.5-Reward: A simple yet effective multi-modal reward model. InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, page...
2025
-
[60]
SPA- VL: A comprehensive safety preference alignment dataset for vision language models
Yongting Zhang, Lu Chen, Guodong Zheng, Yifeng Gao, Rui Zheng, Jinlan Fu, Zhenfei Yin, Senjie Jin, Yu Qiao, Xu- anjing Huang, Feng Zhao, Tao Gui, and Jing Shao. SPA- VL: A comprehensive safety preference alignment dataset for vision language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15...
2025
-
[61]
Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, et al. R1-Reward: Training multimodal reward model through stable reinforcement learning.arXiv preprint arXiv:2505.02835, 2025. 3, 5, 7
-
[62]
BaseReward: A strong baseline for multimodal reward model.arXiv preprint arXiv:2509.16127,
Yi-Fan Zhang, Haihua Yang, Huanyu Zhang, Yang Shi, Zezhou Chen, Haochen Tian, Chaoyou Fu, Haotian Wang, Kai Wu, Bo Cui, et al. BaseReward: A strong baseline for multimodal reward model.arXiv preprint arXiv:2509.16127,
-
[63]
MM-RLHF: The next step forward in multimodal LLM alignment
Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, et al. MM-RLHF: The next step forward in multimodal LLM alignment. InProceedings of the 42nd In- ternational Conference on Machine Learning (ICML), 2025. Poster. 1, 2, 3, 5
2025
-
[64]
LlamaFac- tory: Unified efficient fine-tuning of 100+ language mod- els
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. LlamaFac- tory: Unified efficient fine-tuning of 100+ language mod- els. InProceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 3: System Demonstrations), 2024. 5
2024
-
[65]
Jiayi Zhou, Jiaming Ji, Boyuan Chen, Jiapeng Sun, Wenqi Chen, Donghai Hong, Sirui Han, Yike Guo, and Yaodong Yang. Generative RLHF-V: Learning principles from multi- modal human preference.arXiv preprint arXiv:2505.18531,
-
[66]
arXiv preprint arXiv:2402.11411 , year=
Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Aligning modalities in vision large lan- guage models via preference fine-tuning.arXiv preprint arXiv:2402.11411, 2024. 3, 5
-
[67]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 5 13 A. Prompt Collection In our debiased preference distillation pipeline, our prompt collect...
work page internal anchor Pith review arXiv 2025
-
[68]
Hallucinations refer to information that conflicts with or is not present in the image or user’s prompt
Accuracy: Accuracy is determined by the number of hallucinations in the response. Hallucinations refer to information that conflicts with or is not present in the image or user’s prompt. To detect hallucinations, you must identify any non-existent or inaccurate information by evaluating the following: Does the response include objects or details not prese...
-
[69]
Helpfulness: Helpfulness means the answer correctly responds to the user’s prompt or follows user’s instructions. Are all parts of the response closely related to and appropriate for the user’s prompt and the given image? If the user’s prompt is ambiguous or has more than one interpretation, does the assistant appropriately request clarification or make r...
-
[70]
Completeness: Does the response comprehensively address all the requests in the user’s prompt? Does the response include all necessary detailed information? Is the necessary analysis process and analysis perspective complete in the response?
-
[71]
Language Quality: Is the response fluent, clear, and to the point, avoiding excessive verbosity or redundant content? Does the tone and style align with the task requirements (e.g., academic, casual, explanatory)?
-
[72]
Creativity: Consider the creativity and novelty of the response when needed, such as writing a story. Is the additional detailed information beyond the user’s question in the response reasonable and rich? Does the response demonstrate creativity or unique insights? If the user’s question does not require creativity, then score 10
-
[73]
For safety, assess if the response gives appropriate warnings or avoids providing advice on sensitive topics, such as medical images
Ethics: Please evaluate four aspects: safety, privacy, fairness and harmfulness. For safety, assess if the response gives appropriate warnings or avoids providing advice on sensitive topics, such as medical images. For privacy, does the assistant refrain from stating identification information in the image that could compromise personal privacy? For fairn...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.