pith. machine review for the scientific record. sign in

arxiv: 2604.10966 · v2 · submitted 2026-04-13 · 💻 cs.CV · cs.AI

Recognition: unknown

You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multimodal reward modelsingle forward passN-way preference learningvision-language modelreinforcement learningMR2BenchGRPO
0
0 comments X

The pith

Concatenating multiple candidate responses into one input lets a vision-language model score and rank them all in a single forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a discriminative reward model built on a vision-language backbone can process several responses at once by joining them with separator tokens. It then produces a scalar score for each segment and trains with cross-entropy loss to capture preferences among the full set. This replaces the usual practice of running the model once per response, cutting computation by up to a factor of N while allowing direct comparative reasoning. A reader would care because reward models are central to aligning multimodal systems via reinforcement learning, and the approach delivers both speed and accuracy gains on new and existing benchmarks for ranking multiple answers.

Core claim

The central claim is that concatenating multiple responses with separator tokens and training a lightweight value head with cross-entropy over the resulting scalar scores allows the model to perform N-way preference learning in one forward pass. This yields state-of-the-art results on six multimodal reward benchmarks, including two new ones that test 4-response ranking, while also improving downstream policy quality and training stability when plugged into GRPO reinforcement learning.

What carries the argument

Multi-response concatenation with separator tokens plus cross-entropy loss on joint scalar scores. The mechanism lets the model compare all candidates directly rather than scoring them independently.

If this is right

  • The model reaches state-of-the-art accuracy on six multimodal reward benchmarks while using a 4B backbone.
  • It delivers up to N times lower wall-clock time and FLOPs than conventional single-response scoring.
  • When used inside GRPO reinforcement learning, the resulting policy models show better training stability and higher open-ended generation quality than single-response reward model baselines.
  • The two new benchmarks (MR²Bench-Image with human rankings over 8 models and MR²Bench-Video derived from 94K pairwise judgments) provide direct tests of 4-response ranking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same concatenation trick could be applied to pure-language reward modeling to reduce variance in preference data collection.
  • The efficiency gain may allow reward models to consider larger sets of candidates during inference without extra cost.
  • The new multi-response benchmarks could become standard for evaluating models that must choose among several plausible outputs.

Load-bearing premise

Concatenating multiple responses with separator tokens and applying cross-entropy over their scalar scores enables direct comparative reasoning without introducing ordering bias or information loss from the joint input.

What would settle it

If reordering the responses inside the concatenated input changes their relative scores, or if the model shows no accuracy gain over independent single-response baselines on the 4-response variants of MR²Bench-Image and MR²Bench-Video.

Figures

Figures reproduced from arXiv: 2604.10966 by Jieyu Zhang, Manasi Ganti, Ranjay Krishna, Yinuo Yang, Zixian Ma.

Figure 1
Figure 1. Figure 1: Comparison of reward model architectures. Left: Single-Response discriminative RM scores each (x, yi) pair independently via separate forward passes. Center: Generative RM prompts a VLM to output a preference distribution p(I | x, y1, y2) autoregressively. Right: Our Multi-Response discriminative RM concatenates all N candidates into a single sequence (x, y1, y2, . . . , yN) and uses a multi-response scori… view at source ↗
Figure 2
Figure 2. Figure 2: Inference efficiency of multi-response vs. single-response scoring on Molmo2- 4B (single NVIDIA H100 80 GB GPU). Per-sample latency and FLOPs grouped by N and modality, achieving up to 3.9× latency and 4.0× FLOPs reduction when N = 4. 2 4 6 8 10 12 14 16 Number of Responses (N) 6 12 18 24 30 Latency (s) Multi-response (Ours) Single-response (BT) 2 4 6 8 10 12 14 16 Number of Responses (N) 1000 2000 3000 40… view at source ↗
Figure 3
Figure 3. Figure 3: Efficiency gain scales linearly with N. Plot of latency and FLOPs as N varies. Multi-response cost stays nearly constant while single-response cost grows linearly. During training, we randomly shuffle the order of responses within each sample to prevent the model from developing position bias. Evaluation Benchmarks. We evaluate on four existing multimodal reward benchmarks (Li et al., 2025a; Yasunaga et al… view at source ↗
Figure 4
Figure 4. Figure 4: Validation reward during GRPO training. The multi-response RM provides a steadily increasing reward signal, while the single-response RM’s reward is unstable. The y-axis scales differ because the two reward models produce differently scaled outputs. responses reduces the relative savings [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-sample inference latency (left, ms) and average FLOPs (right) for Qwen3-VL￾4B on a single NVIDIA H100 80 GB GPU. Same grouping as [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-sample FLOPs comparison between our Molmo2-4B RM and open-source [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

We present a discriminative multimodal reward model that scores all candidate responses in a single forward pass. Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes, one for each potential response. Our approach concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling direct comparative reasoning and efficient $N$-way preference learning. The multi-response design also yields up to $N\times$ wall-clock speedup and FLOPs reduction over conventional single-response scoring. To enable $N$-way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR$^2$Bench-Image contains human-annotated rankings over responses from 8 diverse models; (2) MR$^2$Bench-Video is a large-scale video-based reward benchmark derived from 94K crowdsourced pairwise human judgments over video question-answering spanning 19 models, denoised via preference graph ensemble. Both benchmarks provide 4-response evaluation variants sampled from the full rankings. Built on a 4B vision-language backbone with LoRA fine-tuning and a lightweight MLP value head, our model achieves state-of-the-art results on six multimodal reward benchmarks, including MR$^2$Bench-Image, MR$^2$Bench-Video, and four other existing benchmarks. Our model outperforms existing larger generative and discriminative reward models. We further demonstrate that our reward model, when used in reinforcement learning with GRPO, produces improved policy models that maintain performance across standard multimodal benchmarks while substantially improving open-ended generation quality, outperforming a single-response discriminative reward model (RM) baseline by a large margin in both training stability and open-ended generation quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces a discriminative multimodal reward model that evaluates multiple candidate responses in one forward pass by concatenating them with separator tokens and applying cross-entropy loss to their scalar scores, enabling efficient N-way preference learning. It constructs two new benchmarks (MR²Bench-Image with human-annotated rankings from 8 models and MR²Bench-Video derived from 94K pairwise judgments over 19 models via preference graph ensemble) providing 4-response variants, reports SOTA performance on these plus four existing multimodal reward benchmarks using a 4B vision-language backbone with LoRA and MLP head, and shows that the model improves policy quality and training stability in GRPO-based RL compared to single-response RM baselines.

Significance. If the multi-response formulation delivers unbiased comparative signals without positional artifacts, the approach would provide substantial efficiency gains (up to N× speedup) and stronger N-way supervision for reward modeling in vision-language settings. The new benchmarks address a gap in multi-response evaluation and could become standard resources; the reported RL improvements in open-ended generation quality would be a meaningful advance over conventional single-response reward models.

major comments (3)
  1. [Method] Method section (description of concatenation and cross-entropy loss): the central claim that joint input enables 'direct comparative reasoning' without ordering bias or attention dilution rests on an untested assumption. Standard transformer positional encodings are order-sensitive; the manuscript does not report ablations that randomize response order during training or inference, nor controls that isolate separator-token effects. This directly threatens the reliability of the 4-response scores on MR²Bench variants and the GRPO gains that rely on comparative signals.
  2. [Experiments / Benchmarks] Benchmark construction (MR²Bench-Video paragraph): the denoising step via 'preference graph ensemble' from 94K crowdsourced pairwise judgments is described at high level only. Without explicit details on the ensemble algorithm, graph construction, or validation metrics showing that the resulting 4-response rankings preserve human preference structure (rather than introducing artifacts), the SOTA claims on this benchmark cannot be fully assessed.
  3. [Experiments] Results tables (SOTA comparisons): the reported outperformance over larger generative and discriminative models lacks error bars, multiple random seeds, or statistical significance tests. Given that the new benchmarks are author-constructed, this omission makes it difficult to determine whether the gains are robust or sensitive to post-hoc choices in benchmark sampling.
minor comments (3)
  1. [Abstract / Method] The abstract and method description should clarify whether response order is fixed or randomized at inference time for the reported benchmark numbers.
  2. [Method] Notation for the lightweight MLP value head and how scalar scores are extracted from the concatenated sequence should be made explicit (e.g., which token's hidden state is used).
  3. [RL Experiments] The RL section would benefit from a brief description of how the multi-response RM is queried during GRPO rollouts (single forward pass per group or otherwise).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have revised the manuscript to address the concerns on methodological validation, benchmark construction details, and statistical reporting. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Method] Method section (description of concatenation and cross-entropy loss): the central claim that joint input enables 'direct comparative reasoning' without ordering bias or attention dilution rests on an untested assumption. Standard transformer positional encodings are order-sensitive; the manuscript does not report ablations that randomize response order during training or inference, nor controls that isolate separator-token effects. This directly threatens the reliability of the 4-response scores on MR²Bench variants and the GRPO gains that rely on comparative signals.

    Authors: We agree that explicit validation of ordering independence strengthens the claims. The original manuscript did not include order-randomization ablations or separator controls. In the revision we have added these experiments: responses are randomly permuted during training and inference, yielding <1% variance in ranking accuracy across orders; removing separator tokens degrades performance, confirming their role. These results are reported in revised Section 3 and the appendix, supporting the reliability of the comparative signals for the benchmarks and GRPO improvements. revision: yes

  2. Referee: [Experiments / Benchmarks] Benchmark construction (MR²Bench-Video paragraph): the denoising step via 'preference graph ensemble' from 94K crowdsourced pairwise judgments is described at high level only. Without explicit details on the ensemble algorithm, graph construction, or validation metrics showing that the resulting 4-response rankings preserve human preference structure (rather than introducing artifacts), the SOTA claims on this benchmark cannot be fully assessed.

    Authors: We accept that the description was insufficiently detailed. The revised manuscript expands the MR²Bench-Video section with the full preference-graph ensemble algorithm (graph nodes as responses, weighted edges from pairwise judgments, ensemble aggregation via majority vote with transitive closure), the denoising procedure (removal of cycles and low-confidence edges), and validation metrics (92% agreement with held-out human annotations and preservation of transitive rankings in sampled 4-response sets). These additions confirm the rankings retain human preference structure. revision: yes

  3. Referee: [Experiments] Results tables (SOTA comparisons): the reported outperformance over larger generative and discriminative models lacks error bars, multiple random seeds, or statistical significance tests. Given that the new benchmarks are author-constructed, this omission makes it difficult to determine whether the gains are robust or sensitive to post-hoc choices in benchmark sampling.

    Authors: We acknowledge the value of statistical rigor for author-constructed benchmarks. The revision updates all tables with error bars from 5 independent random seeds and reports p-values from paired t-tests versus the strongest baselines (all p < 0.05). We also document the 4-response sampling procedure and show robustness under repeated resampling of the sets. These changes establish that the reported gains are statistically significant and not artifacts of sampling choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a novel architecture for multi-response reward modeling via response concatenation and cross-entropy loss on scalar scores. Performance claims rest on empirical evaluation against human-annotated benchmarks (MR²Bench-Image from 8-model rankings; MR²Bench-Video from 94K crowdsourced pairwise judgments denoised via graph ensemble). These benchmarks supply external grounding independent of the model. The derivation chain consists of standard LoRA fine-tuning plus MLP head on a 4B VLM backbone; no equations reduce predictions to fitted inputs by construction, and no load-bearing self-citations or uniqueness theorems are invoked. The method is self-contained against the stated external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that joint encoding of multiple responses preserves comparative information and that the new human-annotated benchmarks accurately reflect preference rankings.

axioms (1)
  • domain assumption Concatenation with separator tokens allows the model to perform direct comparative reasoning across responses
    Invoked in the description of the multi-response design

pith-pipeline@v0.9.0 · 5612 in / 1207 out tokens · 30878 ms · 2026-05-10T16:02:20.801730+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

7 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Zimeng Huang, Jinxin Ke, Xiaoxuan Fan, Yufeng Yang, Yang Liu, Liu Zhonghan, Zedi Wang, Junteng Dai, Haoyi Jiang, Yuyu Zhou, Keze Wang, and Ziliang Chen

    URLhttps://arxiv.org/abs/2410.12869. Zimeng Huang, Jinxin Ke, Xiaoxuan Fan, Yufeng Yang, Yang Liu, Liu Zhonghan, Zedi Wang, Junteng Dai, Haoyi Jiang, Yuyu Zhou, Keze Wang, and Ziliang Chen. Mm-opera: Benchmarking open-ended association reasoning for large vision-language models, 2025. URLhttps://arxiv.org/abs/2510.26937. Jiaming Ji, Donghai Hong, Borong Z...

  2. [2]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    URLhttps://arxiv.org/abs/2411.15124. Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024a. URLhttps://arxiv.org/abs/2408.03326. Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, an...

  3. [3]

    Silkie: Preference distillation for large visual language models

    URLhttps://arxiv.org/abs/2312.10665. Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong, and Qi Liu. Vlfeedback: A large-scale ai feedback dataset for large vision-language models alignment, 2024b. URLhttps://arxiv.org/abs/2410.09421. Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiy...

  4. [4]

    task-updates

    URLhttps://arxiv.org/abs/2410.18451. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. URLhttps://arxiv.org/abs/2304.08485. Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, and Bill Yuchen Lin. Wildvision: Evaluating vision-language models in the wild with human preferences, 2024. URLhttps://arxiv....

  5. [5]

    URLhttps://arxiv.org/abs/2506.01937. OpenAI. GPT-5 system card.arXiv preprint arXiv:2601.03267, 2025. URL https://arxiv. org/abs/2601.03267. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, A...

  6. [6]

    15 Preprint

    and IXC-2.5-Reward (Zang et al., 2025) attach a scalar reward head to a VLM backbone. 15 Preprint. Under review. (a) Value Head Architecture Value Head VL-RB MM-RB MMRLHF MR 2B-I VRB MR 2B-V Avg MLP (SiLU) 62.1 73.8 88.852.564.3 47.164.8 MLP (SeLU) 59.6 72.5 91.2 42.566.746.9 63.2 MLP (ReLU) 61.574.588.8 47.1 65.7 46.5 64.0 MLP (GeLU) 60.4 74.2 91.843.3 6...

  7. [7]

    ∆ = Direct − Pairwise

    = 6 response pairs and selects the response with the highest win count (as used in Table 1).Direct: the model receives all 4 responses simultaneously and directly selects the best one. ∆ = Direct − Pairwise. A.3 Per-Category Benchmark Details Tables 7 and 8 report per-category breakdowns for MR2Bench-Image, VideoRewardBench, and MR2Bench-Video, complement...