arxiv: 2506.22832 · v3 · submitted 2025-06-28 · 💻 cs.CV · cs.AI

Listener-Rewarded Thinking in VLMs for Image Preferences

Alexander Gambashidze , Li Pengyi , Matvey Skripkin , Andrey Galichin , Anton Gusarov , Konstantin Sobolev , Andrey Kuznetsov , Ivan Oseledets This is my paper

Pith reviewed 2026-05-19 07:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords listener rewardvision language modelsimage preferencesreinforcement learningGRPOchain of thoughtpreference alignmentreasoning consistency

0 comments p. Extension

The pith

A frozen listener VLM supplies confidence scores that shape RL rewards and improve reasoning consistency for image preferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that adding an independent frozen vision-language model as a listener to evaluate and score a reasoner's chain-of-thought produces better rewards in group relative policy optimization. This listener-shaped reward pushes the reasoner to generate explanations that an outside model finds persuasive, which reduces internal contradictions while lifting accuracy and out-of-distribution performance. A sympathetic reader would care because existing reward models for aligning text-to-image generators often memorize training data and fail on new human preferences. The approach offers a data-efficient alternative that avoids heavy new annotation by reusing the listener's existing calibration. The reported gains are 67.4 percent accuracy on the ImageReward benchmark and up to six percent better results on a 1.2 million vote human preference set.

Core claim

The central claim is that listener-augmented GRPO, in which a frozen VLM re-evaluates the reasoner's chain-of-thought and injects a dense calibrated confidence score into the RL reward, trains vision-language models to answer image-preference questions both correctly and in ways that survive independent scrutiny, yielding 67.4 percent accuracy on ImageReward, up to six percent OOD lift on a 1.2M-vote dataset, and fewer reasoning contradictions than plain GRPO or SFT.

What carries the argument

Listener-shaped reward: an independent frozen VLM evaluates the reasoner's chain-of-thought and returns a calibrated confidence score that augments the GRPO reward signal.

If this is right

The listener reward reaches 67.4 percent accuracy on the ImageReward benchmark.
The same scheme improves out-of-distribution performance on a 1.2 million vote human preference dataset by as much as six percent over a naive reasoner.
Reasoning contradictions between the trained model and an independent evaluator drop compared with both GRPO and supervised fine-tuning baselines.
The method supplies a scalable route to aligning vision-language models with nuanced human visual preferences without large new annotation pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same listener-reward pattern could be applied to video preference or multimodal reasoning tasks where explanation consistency is the bottleneck.
If the listener and reasoner share the same base model family, the gains might shrink; testing with deliberately mismatched listener architectures would reveal how much independence is required.
The reduction in contradictions may improve downstream use of the generated explanations for human debugging or for chaining into larger agent systems.
Because the listener is frozen, the method keeps training cost low and could be iterated by swapping in stronger listeners as they become available.

Load-bearing premise

An independent frozen vision-language model can give reliable, calibrated confidence scores on the reasoner's explanations without adding its own contradictions or biases.

What would settle it

Run the same training but replace the listener's confidence scores with random numbers or with scores from a model whose judgments are known to be uncorrelated with human votes; if the accuracy and OOD gains disappear, the claim is falsified.

Figures

Figures reproduced from arXiv: 2506.22832 by Alexander Gambashidze, Andrey Galichin, Andrey Kuznetsov, Anton Gusarov, Ivan Oseledets, Konstantin Sobolev, Li Pengyi, Matvey Skripkin.

**Figure 2.** Figure 2: Listener–reasoner disagreement is a strong error signal. Each point aggregates ImageReward test pairs whose ℓ2 distance ∥(s instr 1 , sinstr 2 ) − (s reason 1 , sreason 2 )∥2 falls in a bin. Accuracy drops as the two score vectors diverge. 4.2 Soft rewards Let C = {V, P, T, A<t} denote the conditioning context (visual input V , prompt P, reasoning tokens T and partial answer A<t). The policy πθ outputs log… view at source ↗

**Figure 3.** Figure 3: Accuracy on the high-quality [23] modern dataset at different human agreement thresholds. Listener mechanism consistently improves generalization beyond the strong GRPO baseline. Supervised Fine-Tuning and Reasoners are initialized from the same Qwen2.5-VL-7B-Instruct checkpoint. 5 Experiments We initialize our models with Qwen 2.5-VL-7B-Instruct)[26] and evaluate on the ImageReward test set and a large, … view at source ↗

**Figure 4.** Figure 4: Majority voting across multiple reasoning rollouts improves models insignificantly in OOD. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: https://huggingface.co/alexgambashidze/qwen2.5vl_image_preference_reasoner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Listener-augmented rewards improve VLM reasoning consistency on preferences but may align more to the listener than to humans.

read the letter

The punchline here is that adding a listener reward to GRPO helps VLMs produce more consistent reasoning for image preferences, leading to top scores on benchmarks and better out-of-distribution results. What the paper does is identify that standard approaches suffer when the reasoner's chain-of-thought clashes with an independent model's view. They then use the listener to give a confidence score on that reasoning trace and fold it into the reward. This encourages explanations that hold up under external scrutiny. The results include 67.4 percent accuracy on ImageReward, which beats the baselines, and up to six percent gains on a large human preference dataset with over a million votes. They also show fewer contradictions overall. This approach is new in its specific reward shaping around listener agreement, and it builds on existing GRPO methods without overcomplicating things. Releasing the model on Hugging Face is a plus for anyone wanting to build on it. The soft spot is the one raised in the stress test. Since the reward comes directly from the listener's evaluation, the reasoner could end up exploiting patterns that the listener likes, which might not align perfectly with human preferences. Measuring reduced contradictions against the same listener adds some circularity to that claim. The abstract does not detail how the listener was chosen or provide ablations that rule out listener-specific effects, so the generalization story needs more support from the full methods and experiments. Overall, this paper is for researchers focused on aligning vision-language models with human visual preferences through reinforcement learning. It shows clear thinking on a practical failure mode and has enough empirical backing to merit a serious referee. I would send it out for peer review with the expectation that reviewers will probe the independence of the listener signal and the robustness of the OOD improvements.

Referee Report

2 major / 2 minor

Summary. The paper proposes a listener-augmented Group Relative Policy Optimization (GRPO) framework for aligning vision-language models with human image preferences. A frozen independent VLM serves as a 'listener' that re-evaluates the reasoner's chain-of-thought and supplies a dense calibrated confidence score incorporated into the RL reward. The approach is reported to achieve 67.4% accuracy on the ImageReward benchmark, up to +6% improvement on out-of-distribution performance using a 1.2M-vote human preference dataset, and fewer reasoning contradictions relative to standard GRPO and SFT baselines. The trained reasoning model is released publicly.

Significance. If the central results hold after addressing independence concerns, the listener-reward mechanism offers a scalable and annotation-light route to improving generalization in preference modeling for text-to-image and video generation. The public model release is a clear strength that enables direct reproducibility and external validation. The method could meaningfully advance RL-based alignment techniques if gains are shown to reflect human preference distributions rather than listener-specific artifacts.

major comments (2)

[Abstract] Abstract and results sections: the reported reduction in reasoning contradictions is evaluated against the same frozen listener model that supplies the reward signal; this renders the metric non-independent and risks circular validation of the reward design.
[Results] Results on OOD performance: the +6% gain on the 1.2M-vote dataset and 67.4% ImageReward accuracy lack reported error bars, explicit dataset splits, and analysis of potential distributional overlap between the OOD test sets and the listener VLM's pretraining corpus, which could explain gains via listener-specific alignment rather than robust preference learning.

minor comments (2)

[Abstract] The abstract states concrete performance numbers but does not reference the corresponding tables or figures that would allow readers to inspect baseline comparisons and ablations.
[Methods] Notation for the listener confidence score and its integration into the GRPO objective could be clarified with an explicit equation in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, with a focus on clarifying the methodology and strengthening the empirical presentation where possible.

read point-by-point responses

Referee: [Abstract] Abstract and results sections: the reported reduction in reasoning contradictions is evaluated against the same frozen listener model that supplies the reward signal; this renders the metric non-independent and risks circular validation of the reward design.

Authors: We agree that the contradiction metric is computed with respect to the same frozen listener VLM used to generate the reward signal. This is by design: the listener acts as a fixed, independent evaluator whose judgments define the target behavior we wish the reasoner to internalize. The metric therefore directly quantifies how successfully the training objective has aligned the reasoner with that evaluator. We nevertheless recognize that reporting improvement solely against the training listener can appear circular for external validation. In the revised manuscript we will (i) explicitly state in the abstract and results that the contradiction reduction is measured against the training listener, and (ii) add a supplementary analysis that measures contradictions against a second, architecturally distinct VLM not used during training. These clarifications and the additional check will be incorporated. revision: yes
Referee: [Results] Results on OOD performance: the +6% gain on the 1.2M-vote dataset and 67.4% ImageReward accuracy lack reported error bars, explicit dataset splits, and analysis of potential distributional overlap between the OOD test sets and the listener VLM's pretraining corpus, which could explain gains via listener-specific alignment rather than robust preference learning.

Authors: We accept that the current results presentation would benefit from greater statistical and distributional detail. In the revised version we will: (a) report error bars for both the ImageReward accuracy and the OOD gains, obtained via multiple random seeds or bootstrap resampling; (b) provide explicit descriptions of the train/validation/test splits used for the 1.2 M-vote human-preference dataset; and (c) include a short analysis of possible overlap between the OOD test images and the listener VLM's pre-training distribution, using domain labels and semantic similarity checks. These additions will be made in the results section and supplementary material. revision: yes

Circularity Check

1 steps flagged

Contradiction reduction is by construction via listener reward design

specific steps

self definitional [Abstract]
"we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model (listener) evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. [...] and reduces reasoning contradictions"

The failure mode is defined as contradiction with the listener. The reward is then built to increase the listener's confidence on the CoT (i.e., reduce contradictions with the listener). The reported reduction in contradictions is therefore achieved by construction through the listener-shaped reward rather than emerging as a separate validation.

full rationale

The paper defines a key failure mode explicitly in terms of the reasoner's output contradicting an independent frozen listener VLM. It then constructs the RL reward to optimize for higher listener confidence scores on the CoT, explicitly to make explanations 'persuasive to an independent model.' Reporting reduced contradictions is therefore a direct consequence of the reward objective rather than an independent empirical finding. However, the primary claims—67.4% on the ImageReward benchmark and up to +6% OOD gains on the separate 1.2M-vote human preference dataset—remain external to the listener and supply independent content, so the overall derivation does not collapse entirely to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; full training details, hyperparameters, and any fitted parameters in the GRPO objective or listener calibration are unavailable. The core addition is the listener-derived reward term.

axioms (1)

domain assumption The independent frozen listener VLM supplies a reliable, calibrated confidence score on reasoning traces that improves the reasoner when used in the RL objective.
This premise is required for the reward shaping to produce the reported gains.

pith-pipeline@v0.9.0 · 5827 in / 1324 out tokens · 34850 ms · 2026-05-19T07:35:13.605531+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

listener re-evaluates the reasoner’s chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GRPO objective with group-normalised advantage Ai = ri − μ/σ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 15 internal anchors

[1]

Technical report, DeepSeek-AI, 2024

Deepseek-r1: Scaling reasoning with reinforced learning. Technical report, DeepSeek-AI, 2024. Technical report. 9

work page 2024
[2]

Technical report, Tencent AI Lab, 2024

Hunyuan-video: Scaling text-to-video generation with large-scale rlhf. Technical report, Tencent AI Lab, 2024

work page 2024
[3]

Technical report, LTX Lab, 2024

Ltx-video: Large transformer for controllable video generation. Technical report, LTX Lab, 2024

work page 2024
[4]

Technical report, OpenAI, 2024

Sora: Multi-modal video generation at scale. Technical report, OpenAI, 2024. Technical report

work page 2024
[5]

Technical report, WanAI, 2024

Want2v: High-fidelity text-to-video synthesis via direct preference optimization. Technical report, WanAI, 2024

work page 2024
[6]

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schul- man, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, and Ethan Perez. Reasoning models don’t always say what they think. arXiv preprint arXiv:2505.05410, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161, jan 2025. Accessed: March 24, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Gemini 2.5 pro: Advanced multimodal reasoning model

Google DeepMind. Gemini 2.5 pro: Advanced multimodal reasoning model. https:// deepmind.google/technologies/gemini/pro/, 2025. Product page and capability demo. Accessed 2025-05-16

work page 2025
[10]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

DeepSeek. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. Technical report, DeepSeek, 2023

work page 2023
[11]

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Guoqing Ma et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model. Technical report, StepFun, 2025. arXiv:2502.10248

work page internal anchor Pith review arXiv 2025
[12]

The poison of alignment

Chun et al. The poison of alignment. arXiv preprint arXiv:2308.13449, aug 2023. Accessed: March 24, 2025

work page arXiv 2023
[13]

Training Compute-Optimal Large Language Models

Jordan Hoffmann et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, March 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Aligning diffusion models with noise-conditioned perception

Alexander Gambashidze, Pavel Kulikov, Maxim Sosnin, and Ivan Makarov. Aligning diffusion models with noise-conditioned perception. arXiv preprint arXiv:2406.17636, 2025

work page arXiv 2025
[15]

Scaling Laws for Reward Model Overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760, oct 2022. Accessed: March 24, 2025

work page internal anchor Pith review arXiv 2022
[16]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, and Dudu Moshe et al. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Pick-a-pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, et al. Pick-a-pic: An open dataset of user preferences for text-to-image generation. In NeurIPS, 2023

work page 2023
[18]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, and Jin Zhouet al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Reason- ing models can be effective without thinking

Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min1, and Matei Zaharia. Reason- ing models can be effective without thinking. 2025

work page 2025
[20]

OpenAI o1: Learning to reason with reinforcement learning

OpenAI. OpenAI o1: Learning to reason with reinforcement learning. https://openai. com/index/learning-to-reason-with-llms , 2024. System card released Dec 5 2024. Accessed 2025-05-16. 10

work page 2024
[21]

OpenAI o3: A multimodal model for math, science, coding, and visual reasoning

OpenAI. OpenAI o3: A multimodal model for math, science, coding, and visual reasoning. https://platform.openai.com/docs/models/o3, 2025. Model announcement Apr 2025. Accessed 2025-05-16

work page 2025
[22]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, et al. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Rapidata human style preferences for images

Rapidata. Rapidata human style preferences for images. https://huggingface.co/ datasets/Rapidata/human-style-preferences-images , 2025

work page 2025
[24]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, July 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024

work page 2024
[27]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, et al. Diffusion model alignment using direct preference optimization. arXiv preprint arXiv:2311.12908, 2023

work page arXiv 2023
[28]

Xiaoshi et al. Wu. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, June 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Grok-3: The age of reasoning agents

xAI. Grok-3: The age of reasoning agents. https://x.ai/blog/grok-3, 2025. System card and model overview. Accessed 2025-05-16

work page 2025
[30]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision- language models reason step-by-step. arXiv preprint arXiv:2411.10440, nov 2024. Accessed: March 24, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

Jiazheng Xu, Yu Huang, Jiale Cheng, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, April 2023

work page arXiv 2023
[33]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, June 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Huaisheng Zhu, Teng Xiao, and Vasant G. Honavar. Dspo: Direct score preference optimization for diffusion model alignment. InInternational Conference on Learning Representations (ICLR),

work page
[36]

OpenReview xyfb9HHvMe. 11

work page