Listener-Rewarded Thinking in VLMs for Image Preferences
Pith reviewed 2026-05-19 07:35 UTC · model grok-4.3
The pith
A frozen listener VLM supplies confidence scores that shape RL rewards and improve reasoning consistency for image preferences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that listener-augmented GRPO, in which a frozen VLM re-evaluates the reasoner's chain-of-thought and injects a dense calibrated confidence score into the RL reward, trains vision-language models to answer image-preference questions both correctly and in ways that survive independent scrutiny, yielding 67.4 percent accuracy on ImageReward, up to six percent OOD lift on a 1.2M-vote dataset, and fewer reasoning contradictions than plain GRPO or SFT.
What carries the argument
Listener-shaped reward: an independent frozen VLM evaluates the reasoner's chain-of-thought and returns a calibrated confidence score that augments the GRPO reward signal.
If this is right
- The listener reward reaches 67.4 percent accuracy on the ImageReward benchmark.
- The same scheme improves out-of-distribution performance on a 1.2 million vote human preference dataset by as much as six percent over a naive reasoner.
- Reasoning contradictions between the trained model and an independent evaluator drop compared with both GRPO and supervised fine-tuning baselines.
- The method supplies a scalable route to aligning vision-language models with nuanced human visual preferences without large new annotation pipelines.
Where Pith is reading between the lines
- The same listener-reward pattern could be applied to video preference or multimodal reasoning tasks where explanation consistency is the bottleneck.
- If the listener and reasoner share the same base model family, the gains might shrink; testing with deliberately mismatched listener architectures would reveal how much independence is required.
- The reduction in contradictions may improve downstream use of the generated explanations for human debugging or for chaining into larger agent systems.
- Because the listener is frozen, the method keeps training cost low and could be iterated by swapping in stronger listeners as they become available.
Load-bearing premise
An independent frozen vision-language model can give reliable, calibrated confidence scores on the reasoner's explanations without adding its own contradictions or biases.
What would settle it
Run the same training but replace the listener's confidence scores with random numbers or with scores from a model whose judgments are known to be uncorrelated with human votes; if the accuracy and OOD gains disappear, the claim is falsified.
Figures
read the original abstract
Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a listener-augmented Group Relative Policy Optimization (GRPO) framework for aligning vision-language models with human image preferences. A frozen independent VLM serves as a 'listener' that re-evaluates the reasoner's chain-of-thought and supplies a dense calibrated confidence score incorporated into the RL reward. The approach is reported to achieve 67.4% accuracy on the ImageReward benchmark, up to +6% improvement on out-of-distribution performance using a 1.2M-vote human preference dataset, and fewer reasoning contradictions relative to standard GRPO and SFT baselines. The trained reasoning model is released publicly.
Significance. If the central results hold after addressing independence concerns, the listener-reward mechanism offers a scalable and annotation-light route to improving generalization in preference modeling for text-to-image and video generation. The public model release is a clear strength that enables direct reproducibility and external validation. The method could meaningfully advance RL-based alignment techniques if gains are shown to reflect human preference distributions rather than listener-specific artifacts.
major comments (2)
- [Abstract] Abstract and results sections: the reported reduction in reasoning contradictions is evaluated against the same frozen listener model that supplies the reward signal; this renders the metric non-independent and risks circular validation of the reward design.
- [Results] Results on OOD performance: the +6% gain on the 1.2M-vote dataset and 67.4% ImageReward accuracy lack reported error bars, explicit dataset splits, and analysis of potential distributional overlap between the OOD test sets and the listener VLM's pretraining corpus, which could explain gains via listener-specific alignment rather than robust preference learning.
minor comments (2)
- [Abstract] The abstract states concrete performance numbers but does not reference the corresponding tables or figures that would allow readers to inspect baseline comparisons and ablations.
- [Methods] Notation for the listener confidence score and its integration into the GRPO objective could be clarified with an explicit equation in the methods section.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, with a focus on clarifying the methodology and strengthening the empirical presentation where possible.
read point-by-point responses
-
Referee: [Abstract] Abstract and results sections: the reported reduction in reasoning contradictions is evaluated against the same frozen listener model that supplies the reward signal; this renders the metric non-independent and risks circular validation of the reward design.
Authors: We agree that the contradiction metric is computed with respect to the same frozen listener VLM used to generate the reward signal. This is by design: the listener acts as a fixed, independent evaluator whose judgments define the target behavior we wish the reasoner to internalize. The metric therefore directly quantifies how successfully the training objective has aligned the reasoner with that evaluator. We nevertheless recognize that reporting improvement solely against the training listener can appear circular for external validation. In the revised manuscript we will (i) explicitly state in the abstract and results that the contradiction reduction is measured against the training listener, and (ii) add a supplementary analysis that measures contradictions against a second, architecturally distinct VLM not used during training. These clarifications and the additional check will be incorporated. revision: yes
-
Referee: [Results] Results on OOD performance: the +6% gain on the 1.2M-vote dataset and 67.4% ImageReward accuracy lack reported error bars, explicit dataset splits, and analysis of potential distributional overlap between the OOD test sets and the listener VLM's pretraining corpus, which could explain gains via listener-specific alignment rather than robust preference learning.
Authors: We accept that the current results presentation would benefit from greater statistical and distributional detail. In the revised version we will: (a) report error bars for both the ImageReward accuracy and the OOD gains, obtained via multiple random seeds or bootstrap resampling; (b) provide explicit descriptions of the train/validation/test splits used for the 1.2 M-vote human-preference dataset; and (c) include a short analysis of possible overlap between the OOD test images and the listener VLM's pre-training distribution, using domain labels and semantic similarity checks. These additions will be made in the results section and supplementary material. revision: yes
Circularity Check
Contradiction reduction is by construction via listener reward design
specific steps
-
self definitional
[Abstract]
"we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model (listener) evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. [...] and reduces reasoning contradictions"
The failure mode is defined as contradiction with the listener. The reward is then built to increase the listener's confidence on the CoT (i.e., reduce contradictions with the listener). The reported reduction in contradictions is therefore achieved by construction through the listener-shaped reward rather than emerging as a separate validation.
full rationale
The paper defines a key failure mode explicitly in terms of the reasoner's output contradicting an independent frozen listener VLM. It then constructs the RL reward to optimize for higher listener confidence scores on the CoT, explicitly to make explanations 'persuasive to an independent model.' Reporting reduced contradictions is therefore a direct consequence of the reward objective rather than an independent empirical finding. However, the primary claims—67.4% on the ImageReward benchmark and up to +6% OOD gains on the separate 1.2M-vote human preference dataset—remain external to the listener and supply independent content, so the overall derivation does not collapse entirely to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The independent frozen listener VLM supplies a reliable, calibrated confidence score on reasoning traces that improves the reasoner when used in the RL objective.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
listener re-evaluates the reasoner’s chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GRPO objective with group-normalised advantage Ai = ri − μ/σ
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Technical report, DeepSeek-AI, 2024
Deepseek-r1: Scaling reasoning with reinforced learning. Technical report, DeepSeek-AI, 2024. Technical report. 9
work page 2024
-
[2]
Technical report, Tencent AI Lab, 2024
Hunyuan-video: Scaling text-to-video generation with large-scale rlhf. Technical report, Tencent AI Lab, 2024
work page 2024
-
[3]
Technical report, LTX Lab, 2024
Ltx-video: Large transformer for controllable video generation. Technical report, LTX Lab, 2024
work page 2024
-
[4]
Technical report, OpenAI, 2024
Sora: Multi-modal video generation at scale. Technical report, OpenAI, 2024. Technical report
work page 2024
-
[5]
Want2v: High-fidelity text-to-video synthesis via direct preference optimization. Technical report, WanAI, 2024
work page 2024
-
[6]
Reasoning Models Don't Always Say What They Think
Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don’t always say what they think. arXiv preprint arXiv:2505.05410, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161, jan 2025. Accessed: March 24, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Gemini 2.5 pro: Advanced multimodal reasoning model
Google DeepMind. Gemini 2.5 pro: Advanced multimodal reasoning model. https:// deepmind.google/technologies/gemini/pro/, 2025. Product page and capability demo. Accessed 2025-05-16
work page 2025
-
[10]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning
DeepSeek. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. Technical report, DeepSeek, 2023
work page 2023
-
[11]
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Guoqing Ma et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model. Technical report, StepFun, 2025. arXiv:2502.10248
work page internal anchor Pith review arXiv 2025
-
[12]
Chun et al. The poison of alignment. arXiv preprint arXiv:2308.13449, aug 2023. Accessed: March 24, 2025
-
[13]
Training Compute-Optimal Large Language Models
Jordan Hoffmann et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, March 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Aligning diffusion models with noise-conditioned perception
Alexander Gambashidze, Pavel Kulikov, Maxim Sosnin, and Ivan Makarov. Aligning diffusion models with noise-conditioned perception. arXiv preprint arXiv:2406.17636, 2025
-
[15]
Scaling Laws for Reward Model Overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760, oct 2022. Accessed: March 24, 2025
work page internal anchor Pith review arXiv 2022
-
[16]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, and Dudu Moshe et al. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Pick-a-pic: An open dataset of user preferences for text-to-image generation
Yuval Kirstain, Adam Polyak, Uriel Singer, et al. Pick-a-pic: An open dataset of user preferences for text-to-image generation. In NeurIPS, 2023
work page 2023
-
[18]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, and Jin Zhouet al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Reason- ing models can be effective without thinking
Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min1, and Matei Zaharia. Reason- ing models can be effective without thinking. 2025
work page 2025
-
[20]
OpenAI o1: Learning to reason with reinforcement learning
OpenAI. OpenAI o1: Learning to reason with reinforcement learning. https://openai. com/index/learning-to-reason-with-llms , 2024. System card released Dec 5 2024. Accessed 2025-05-16. 10
work page 2024
-
[21]
OpenAI o3: A multimodal model for math, science, coding, and visual reasoning
OpenAI. OpenAI o3: A multimodal model for math, science, coding, and visual reasoning. https://platform.openai.com/docs/models/o3, 2025. Model announcement Apr 2025. Accessed 2025-05-16
work page 2025
-
[22]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, et al. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Rapidata human style preferences for images
Rapidata. Rapidata human style preferences for images. https://huggingface.co/ datasets/Rapidata/human-style-preferences-images , 2025
work page 2025
-
[24]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, July 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Qwen2.5: A party of foundation models, September 2024
Qwen Team. Qwen2.5: A party of foundation models, September 2024
work page 2024
-
[27]
Diffusion model alignment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, et al. Diffusion model alignment using direct preference optimization. arXiv preprint arXiv:2311.12908, 2023
-
[28]
Xiaoshi et al. Wu. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, June 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Grok-3: The age of reasoning agents
xAI. Grok-3: The age of reasoning agents. https://x.ai/blog/grok-3, 2025. System card and model overview. Accessed 2025-05-16
work page 2025
-
[30]
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision- language models reason step-by-step. arXiv preprint arXiv:2411.10440, nov 2024. Accessed: March 24, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Jiazheng Xu, Yu Huang, Jiale Cheng, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Imagereward: Learning and evaluating human preferences for text-to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, April 2023
-
[33]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, June 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Huaisheng Zhu, Teng Xiao, and Vasant G. Honavar. Dspo: Direct score preference optimization for diffusion model alignment. InInternational Conference on Learning Representations (ICLR),
-
[36]
OpenReview xyfb9HHvMe. 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.