ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

Boer Zhang; Jingpei Wu; Volker Tresp; Weixiang Shen; Xiao Han; Zifeng Ding

arxiv: 2606.11209 · v1 · pith:VD3RWGRInew · submitted 2026-04-23 · 💻 cs.CL · cs.AI· cs.LG

ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

Jingpei Wu , Xiao Han , Weixiang Shen , Boer Zhang , Zifeng Ding , Volker Tresp This is my paper

Pith reviewed 2026-07-04 20:03 UTC · model glm-5.2

classification 💻 cs.CL cs.AIcs.LG

keywords reasoningrewardprocessprocessthinkerrewardsstepacrossformat

0 comments

The pith

Score each reasoning step by whether it leads to a right answer

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ProcessThinker introduces a method for training multimodal language models to reason more effectively over video inputs without requiring a separately trained process reward model. The core mechanism is rollout-based process reward: for each intermediate step in a chain-of-thought reasoning trace, the system samples multiple continuations from that point onward and checks how often each continuation reaches the correct final answer. The empirical success rate becomes the step's reward. This converts a single sparse outcome signal (right or wrong final answer) into a dense, step-level credit assignment that can distinguish a reasoning trace that goes wrong only at the end from one that was unproductive from the start. The pipeline first fine-tunes the model to produce explicitly step-tagged reasoning traces, then applies Group Relative Policy Optimization (GRPO) with the rollout-based process reward. On four video reasoning benchmarks, the process-only reward configuration consistently outperforms both outcome-only rewards and a mixture of the two, raising the average accuracy of an 8-billion-parameter multimodal model from 56.30 to 59.72, with the largest gain on VideoMathQA (+6.47).

Core claim

The paper's central claim is that rollout-based process rewards — scoring each reasoning step by the empirical success rate of multiple continuations sampled from that step's prefix — provide denser and more effective credit assignment than outcome-only rewards, and that this holds without training a separate process reward model. The key empirical finding is that process-only rewards outperform both outcome-only and mixed reward configurations across all four video reasoning benchmarks tested, with the process-only variant achieving the best average score (59.72) compared to outcome-only (57.55) and mixed (57.91).

What carries the argument

Rollout-based process reward (continuation solvability): For each intermediate step s_i in a reasoning trace, sample M=4 continuations from the policy model conditioned on the prefix (s_1, ..., s_i). The step score c_i is the fraction of those continuations that produce the correct final answer. The trajectory-level process reward averages these step scores, capped at K_max=6 steps. This reward is combined with a format reward, a bounded step-count bonus, and a penalty gate (which zeros out rewards for responses whose process score falls below threshold τ=0.5 when the final answer is also wrong) within a GRPO training loop.

If this is right

Step-level credit assignment via continuation rollouts could be applied to any domain with verifiable final answers (math, code, logic puzzles), potentially improving reasoning quality beyond video QA.
The finding that process-only rewards outperform mixed rewards suggests that outcome rewards may introduce noise or conflicting gradients when combined with denser process signals, which has implications for reward design in RL-based post-training generally.
The method's reliance on the same final-answer verifier used in standard RLVR means it can be deployed wherever outcome-based RLVR is already in use, lowering the barrier to adoption compared to PRM-based approaches that require annotated training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The SFT warm-up degrades performance relative to the baseline (53.83 vs. 56.30 average), meaning the GRPO stage must recover this degradation and add further gains. Without a no-SFT, no-process-reward GRPO baseline, the paper cannot fully isolate whether the process reward mechanism or the combined SFT+GRPO pipeline drives the improvement. A clean ablation would strengthen the causal claim.
The rollout cost scales with the number of steps times M continuations per step times G GRPO group size, which could make this approach prohibitively expensive for longer reasoning traces or larger models. The paper acknowledges this limitation but does not quantify the compute overhead relative to outcome-only GRPO.
The sensitivity to step segmentation — how the teacher model decomposes reasoning into steps — is acknowledged but not systematically studied. If step boundaries are poorly chosen, the continuation-solvability signal could be noisy or misleading, which would limit the method's reliability on tasks where good step decomposition is non-trivial.

Load-bearing premise

The claim that process rewards drive the improvement rests on all GRPO variants sharing the same SFT warm-up, but since that SFT stage itself degrades performance below the baseline, the GRPO stage must both recover and exceed the original model. Without a GRPO-only variant that uses no process reward and no SFT, one cannot fully separate the contribution of the process reward from the combined effect of format-structured SFT followed by GRPO training.

What would settle it

If a GRPO variant trained with the same format-structured SFT warm-up but using only outcome rewards (no process reward, no continuation rollouts) matched or exceeded the process-only configuration's scores, the central claim that rollout-based process rewards are the active ingredient would be undermined.

Figures

Figures reproduced from arXiv: 2606.11209 by Boer Zhang, Jingpei Wu, Volker Tresp, Weixiang Shen, Xiao Han, Zifeng Ding.

read the original abstract

Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start. A common solution is to train a process reward model (PRM) for step-level supervision, but this typically requires large-scale high-quality chain-of-thought annotations and additional training cost. We propose ProcessThinker, a practical post-training pipeline that provides step-level process rewards without training an explicit PRM. ProcessThinker first rewrites reasoning traces into a step-tagged format for cold-start supervised fine-tuning, then applies GRPO with a standard format reward and our rollout-based process reward. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate (final-answer verification) as the step reward. This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self-contradictory progress across steps -- a key issue in logical reasoning. Across four challenging video benchmarks (Video-MMMU, MMVU, VideoMathQA, and LongVideoBench), ProcessThinker consistently improves over the baseline model Qwen3-VL-8B-Instruct

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rollout-based process rewards in GRPO for multimodal reasoning — sound idea, modest gains, missing error bars on the key comparison

read the letter

The main thing to know: this paper applies Monte Carlo rollout-based step-level credit assignment within a GRPO framework for multimodal/video reasoning, without training a separate PRM. The idea is straightforward and the execution is reasonable, but the gains are modest and the key ablation comparison lacks error bars, which is a real concern given the small effect sizes. It's a workshop paper and the scope is appropriately limited. The core contribution is the specific combination: VinePPO used MC rollouts for credit assignment in PPO for text-only LLMs, and Step-GRPO used rule-based step rewards in GRPO. ProcessThinker uses empirical continuation success rates as step rewards within GRPO for multimodal models. That combination is new. The method is cleanly described — Eq. 2-3 define the process reward as the empirical success rate of M=4 continuations from each step prefix, using the same final-answer verifier as RLVR. This is not circular: the step reward is computed from independent rollouts under the current policy, not from a fitted parameter. The reward design with format gating, bounded step bonus, and penalty for low-solvability steps is sensible engineering. The paper is honest about limitations, including rollout cost and noise from step segmentation. Now, the soft spots. The reader's report flags a missing no-SFT GRPO baseline as a critical gap. I disagree with that concern. Table 1 already contains the relevant control: the outcome-only variant uses the same SFT warm-up, same GRPO machinery, same format reward, and differs only in setting the process reward weight to zero. It scores 57.55 average; process-only scores 59.72. That +2.17 delta is the marginal contribution of the process reward, already isolated from SFT and from GRPO+format structure. The reader's concern is substantially addressed by the existing ablation. The more load-bearing problem is the absence of error bars or any significance testing. The key comparative differences are small: 2.17 points on average between outcome-only and process-only, and on individual benchmarks the gaps are sometimes under a point (LongVideoBench: 74.20 vs 75.40). With 1,250 RL prompts, G=4 samples, and M=4 noisy rollout estimates per step, run-to-run variance could easily shift scores by 1-2 points. Without a single replication, it's hard to know whether the ordering outcome-only < mixed < process-only reflects a real effect or noise. Several hyperparameters (α, β, Kmin, Lmin, Lmax) are also not stated in the paper. This is a workshop paper for people working on RLVR for multimodal reasoning. The method is practical and the idea is sound, but the evidence doesn't fully substantiate the central comparative claim that process rewards outperform outcome-only rewards. It deserves a serious referee if submitted to a full venue — the missing error bars and incomplete hyperparameter reporting are fixable and should be addressed before the claims can be fully accepted.

Referee Report

2 major / 6 minor

Summary. The paper proposes ProcessThinker, a post-training pipeline for multimodal LLMs that provides step-level process rewards without training a separate PRM. The method rewrites reasoning traces into a step-tagged format for SFT warm-up, then applies GRPO with a rollout-based process reward: for each intermediate step, M=4 continuations are sampled from the step prefix, and the empirical success rate (under the same final-answer verifier used in RLVR) serves as the step reward. Experiments on four video reasoning benchmarks show that ProcessThinker (process-only) improves over the Qwen3-VL-8B-Instruct baseline by +3.42 average, and that process-only rewards outperform both outcome-only and mixed reward configurations.

Significance. The core idea—estimating step utility via continuation solvability within a GRPO framework, without training a separate PRM—is a clean and practical contribution. The method is well-motivated by the credit-assignment problem in sparse-reward RLVR, and the ablation structure in Table 1 (outcome-only vs. mixed vs. process-only, all sharing the same SFT warm-up and GRPO machinery) is the right experimental design to isolate the contribution of the process reward. The approach is modality-agnostic and could transfer beyond video reasoning. The paper is honest about limitations, including rollout cost and noise.

major comments (2)

§3, Table 1: The central comparative claim—that process-only rewards outperform outcome-only rewards—rests on a 2.17-point average difference (59.72 vs. 57.55). On individual benchmarks, some gaps are very small (e.g., VideoMathQA: outcome-only and mixed both 27.86; LongVideoBench: 74.20 vs. 74.60 vs. 75.40). No error bars, confidence intervals, significance tests, or even a single replication are reported anywhere in the paper. With only 1,250 RL training prompts, G=4 samples per prompt, and M=4 continuation rollouts per step, run-to-run variance from random seed and sampling stochasticity could plausibly shift scores by 1–2 points. Without any variance estimate, it is impossible to determine whether the ordering outcome-only < mixed < process-only reflects a real effect or noise. This directly undermines the load-bearing comparative claim. At minimum, the authors should report results从
§2.2, Eq. (2): With M=4, the step score c_i can only take values {0, 0.25, 0.5, 0.75, 1.0}, which is a very coarse estimator. The paper acknowledges rollout noise in the conclusion but does not analyze its impact on training stability or final performance. A brief sensitivity analysis (e.g., M=2 vs. M=4 vs. M=8) would strengthen the claim that M=4 is sufficient, especially since the axiom ledger identifies 'empirical success rate of M=4 continuations is a sufficient estimator of step utility' as an ad-hoc assumption.

minor comments (6)

§2.2, Eq. (6): The reward weights λ_acc and λ_proc are stated to satisfy λ_acc + λ_proc = 1, but the specific values used for each variant (outcome-only, mixed, process-only) are not reported. These should be specified, as they are load-bearing for the ablation.
§2.2, Eq. (4): The step bonus B(K) uses a parameter α_r that is not defined or given a value. Please clarify.
§2.2, Eq. (5): The penalty gate uses τ=0.5, but the rationale for this threshold is not discussed. A brief justification or sensitivity note would help.
§3: The paper mentions VinePPO (Kazemnejad et al., 2025) as a closely related method using Monte Carlo rollouts for step-level credit assignment in PPO, but no experimental comparison is provided. Even a brief discussion of why a direct comparison is not feasible (different base model, modality, or framework) would contextualize the contribution.
§3: Training compute and wall-clock time for the rollout-based process reward (which requires M×K_max additional forward passes per response) are not reported. Given that the paper identifies efficiency as the main limitation, quantitative cost figures would strengthen this discussion.
Table 1: VIDEO-R1-7B uses a different backbone (Qwen2.5-VL) and is acknowledged as not directly comparable. Consider moving it to a footnote or separate reference table to avoid confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. Both major comments are well-taken: (1) the absence of variance estimates weakens the comparative claim between reward configurations, and (2) the coarseness of M=4 as a step-score estimator deserves direct analysis. We will address both in the revision.

read point-by-point responses

Referee: §3, Table 1: The central comparative claim—that process-only rewards outperform outcome-only rewards—rests on a 2.17-point average difference (59.72 vs. 57.55). No error bars, confidence intervals, significance tests, or replications are reported. With only 1,250 RL training prompts, run-to-run variance could plausibly shift scores by 1–2 points. Without variance estimates, it is impossible to determine whether the ordering outcome-only < mixed < process-only reflects a real effect or noise.

Authors: The referee is correct that the absence of any variance estimate is a genuine gap, and we will address it in the revision. We will report results from at least two additional independent training runs per reward configuration (different random seeds), and will include standard deviations across runs in Table 1. We will also add bootstrap confidence intervals on the evaluation accuracy for each benchmark. We agree that the 2.17-point average gap between process-only and outcome-only is modest, and that some per-benchmark gaps (e.g., VideoMathQA: outcome-only and mixed both 27.86) are too small to support strong claims on their own. We will accordingly soften the language from 'process-only outperforms' to 'process-only achieves the best average across our runs, though the advantage over outcome-only is modest and we cannot rule out noise on individual benchmarks.' If the replicated runs do not reproduce the ordering, we will report that honestly. We note that the monotonic ordering outcome-only < mixed < process-only holds on the average across all four benchmarks, and the largest single-benchmark gap (VideoMathQA: 27.86 vs. 31.67, +3.81 for process-only) is the one most relevant to multi-step reasoning, which is the setting the method targets. But we agree this does not substitute for proper variance estimation. revision: yes
Referee: §2.2, Eq. (2): With M=4, the step score c_i can only take values {0, 0.25, 0.5, 0.75, 1.0}, which is a very coarse estimator. The paper acknowledges rollout noise in the conclusion but does not analyze its impact on training stability or final performance. A sensitivity analysis (e.g., M=2 vs. M=4 vs. M=8) would strengthen the claim that M=4 is sufficient, especially since the axiom ledger identifies 'empirical success rate of M=4 continuations is a sufficient estimator of step utility' as an ad-hoc assumption.

Authors: This is a fair point. The coarseness of the M=4 estimator is a real limitation that we have acknowledged only in passing. We will add a sensitivity analysis varying M ∈ {2, 4, 8} and report both final performance and training stability (reward curve smoothness, gradient variance proxies). We expect M=2 to be noticeably noisier and M=8 to offer diminishing returns at roughly double the rollout cost, but we will report whatever the data shows. We will also add a brief discussion of why M=4 was chosen as the default: it offers five discrete levels (including the extremes 0 and 1), which is the minimum needed to distinguish 'never solvable,' 'sometimes solvable,' and 'always solvable' prefixes while keeping rollout cost within our compute budget. We agree that the sufficiency of M=4 is currently an ad-hoc assumption and will label it as such in the revised text, presenting the sensitivity analysis as the empirical justification (or partial justification, depending on results). revision: yes

Circularity Check

0 steps flagged

No circularity: rollout-based process reward is defined by independent MC sampling and external ground-truth verification, not by self-citation or fitted parameters

full rationale

The paper's central construction — the rollout-based process reward (Eqs. 2–3) — is defined as the empirical success rate of M continuations sampled from the current policy conditioned on each step prefix, verified against an external ground-truth answer. This is a model-free, parameter-free construction: no parameter is fitted to a target and then renamed as a prediction, and no self-citation chain is load-bearing for the reward definition. The ablation in Table 1 compares outcome-only (λ_acc=1, λ_proc=0) vs. process-only (λ_acc=0, λ_proc=1) under the same SFT warm-up and GRPO machinery, so the comparison is between genuinely different reward signals, not between a fit and its own target. The process reward uses the same final-answer verifier as RLVR, but this is shared infrastructure (external ground-truth checking), not a circular definition. VinePPO (Kazemnejad et al., 2025) is cited as related work with a similar Monte-Carlo rollout intuition for PPO, but the paper's GRPO-based construction is presented as its own contribution, not as a consequence of an unverified self-cited theorem. The absence of error bars and the SFT degradation concern are correctness/statistical risks, not circularity. No step in the derivation chain reduces to its inputs by construction.

Axiom & Free-Parameter Ledger

9 free parameters · 4 axioms · 0 invented entities

The paper introduces no new entities (particles, forces, dimensions, etc.). It introduces a training pipeline and reward construction, which are methods rather than postulated objects. The free parameters are hyperparameters of the reward function and training setup, several of which are not explicitly stated in the paper.

free parameters (9)

M (continuation rollouts per step) = 4
Number of continuations sampled per step prefix to estimate success rate; chosen by the authors, not derived.
Kmax (max steps scored) = 6
Caps the number of steps that receive process rewards; chosen by the authors.
τ (process reward threshold) = 0.5
Threshold below which process reward is replaced by penalty; chosen by the authors.
α (step bonus scale) = not stated
Scales the bounded step bonus B(K) in Eq. 4; value not specified in the paper.
β (format reward bonus) = not stated
Added to format reward in Eq. 6; value not specified.
λ_acc, λ_proc (reward weights) = varies by variant
Control the mixture of outcome and process rewards; λ_acc+λ_proc=1. Process-only uses λ_proc=1.
Kmin, Lmin, Lmax (format constraints) = not stated
Minimum/maximum step count and length bounds for format validity; values not specified.
19k (SFT dataset size) = 19000
Number of filtered samples kept for SFT from the 165k pool; chosen by the authors.
1,250 (RL prompt count) = 1250
Number of prompts sampled for GRPO training; chosen by the authors.

axioms (4)

domain assumption Step-level credit assignment improves learning over outcome-only rewards for long reasoning traces
Core assumption motivating the entire approach; stated in §1 and §2.2. Not independently proven — the experimental results are consistent with it but confounded by the SFT+GRPO pipeline.
ad hoc to paper Empirical success rate of M=4 continuations is a sufficient estimator of step utility
M=4 is a small sample size for estimating a probability; the paper does not justify this choice or report sensitivity to M. Used in Eq. 2.
domain assumption Step segmentation by the teacher model produces meaningful, non-redundant steps
The SFT data construction (§2.1) relies on the teacher model segmenting reasoning into meaningful steps. The paper acknowledges this is noisy and applies filtering, but does not validate step quality independently.
standard math GRPO with KL regularization to a reference policy is a stable training method for this setting
Standard GRPO recipe from Shao et al. (2024); used in §2.2.

pith-pipeline@v1.1.0-glm · 10875 in / 3096 out tokens · 142585 ms · 2026-07-04T20:03:56.111290+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

[1]

International Conference on Machine Learning , year =

VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment , author =. International Conference on Machine Learning , year =. 2410.01679 , archivePrefix=

work page arXiv
[2]

2025 , eprint =

Qwen3-VL Technical Report , author =. 2025 , eprint =

work page 2025
[3]

2025 , eprint =

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author =. 2025 , eprint =

work page 2025
[4]

2024 , eprint =

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. 2024 , eprint =

work page 2024
[5]

2025 , eprint =

Video-R1: Reinforcing Video Reasoning in Multimodal Large Language Models , author =. 2025 , eprint =

work page 2025
[6]

2025 , eprint =

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization , author =. 2025 , eprint =

work page 2025
[7]

2025 , eprint =

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models , author =. 2025 , eprint =

work page 2025
[8]

2025 , eprint =

DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO , author =. 2025 , eprint =

work page 2025
[9]

2025 , eprint =

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding , author =. 2025 , eprint =

work page 2025
[10]

2025 , eprint =

OneThinker: All-in-One Reasoning Model for Image and Video , author =. 2025 , eprint =

work page 2025
[11]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

work page
[12]

2025 , eprint =

TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning , author =. 2025 , eprint =

work page 2025
[13]

2023 , eprint =

Let's Verify Step by Step , author =. 2023 , eprint =

work page 2023
[14]

2024 , eprint =

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning , author =. 2024 , eprint =

work page 2024
[15]

2024 , eprint =

ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search , author =. 2024 , eprint =

work page 2024
[16]

2025 , eprint =

ReST-RL: Process Reward Guided Reinforcement Learning for Large Language Model Reasoning , author =. 2025 , eprint =

work page 2025
[17]

Training language models to follow instructions with human feedback

Training language models to follow instructions with human feedback , author =. arXiv preprint arXiv:2203.02155 , year =. 2203.02155 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[18]

2025 , note =

MM-Eureka: Reinforcement Learning for Multimodal Reasoning with Verifiable Rewards , author =. 2025 , note =

work page 2025
[19]

2024 , note =

PQM: Process Quality Models for Step-level Verification in Reasoning , author =. 2024 , note =

work page 2024
[20]

2024 , note =

Improving Multimodal Chain-of-Thought Reasoning in Vision-Language Models , author =. 2024 , note =

work page 2024
[21]

2025 , eprint =

Process Reward Models That Think , author =. 2025 , eprint =

work page 2025
[22]

2025 , eprint =

The Lessons of Developing Process Reward Models in Mathematical Reasoning , author =. 2025 , eprint =

work page 2025
[23]

2025 , eprint =

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning , author =. 2025 , eprint =

work page 2025
[24]

2025 , eprint =

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning , author =. 2025 , eprint =

work page 2025
[25]

2025 , eprint =

MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision , author =. 2025 , eprint =

work page 2025
[26]

2025 , eprint =

AURORA: Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification , author =. 2025 , eprint =

work page 2025
[27]

2025 , eprint =

SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning , author =. 2025 , eprint =

work page 2025
[28]

2024 , eprint =

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling , author =. 2024 , eprint =

work page 2024
[29]

2024 , eprint =

V-STaR: Training Verifiers for Self-Taught Reasoners , author =. 2024 , eprint =

work page 2024
[30]

2025 , eprint =

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling , author =. 2025 , eprint =

work page 2025
[31]

International Conference on Machine Learning (ICML) , year =

e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs , author =. International Conference on Machine Learning (ICML) , year =

work page
[32]

2025 , eprint =

Unlocking multimodal mathematical reasoning via process reward model , author =. 2025 , eprint =

work page 2025
[33]

2025 , eprint =

OpenThoughts: Data Recipes for Reasoning Models , author =. 2025 , eprint =

work page 2025
[34]

2024 , eprint =

OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models , author =. 2024 , eprint =

work page 2024
[35]

2025 , eprint =

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos , author =. 2025 , eprint =

work page 2025
[36]

2025 , eprint =

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding , author =. 2025 , eprint =

work page 2025
[37]

2025 , eprint =

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos , author =. 2025 , eprint =

work page 2025
[38]

2024 , eprint =

LongVideoBench: A Benchmark for Long-context Interleaved Video-language Understanding , author =. 2024 , eprint =

work page 2024
[39]

2025 , eprint =

R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO , author =. 2025 , eprint =

work page 2025
[40]

2025 , eprint =

Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense , author =. 2025 , eprint =

work page 2025
[41]

2025 , eprint =

Lessons from Training Grounded LLMs with Verifiable Rewards , author =. 2025 , eprint =

work page 2025
[42]

2025 , eprint =

Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains , author =. 2025 , eprint =

work page 2025
[43]

2025 , eprint =

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning , author =. 2025 , eprint =

work page 2025
[44]

2025 , eprint =

XRPO: Pushing the Limits of GRPO with Targeted Exploration and Exploitation , author =. 2025 , eprint =

work page 2025
[45]

2025 , eprint =

GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models , author =. 2025 , eprint =

work page 2025
[46]

2026 , eprint =

AMIR-GRPO: Inducing Implicit Preference Signals into GRPO , author =. 2026 , eprint =

work page 2026
[47]

2026 , eprint =

From Absolute to Relative: Rethinking Reward Shaping in Group-Based Reinforcement Learning , author =. 2026 , eprint =

work page 2026
[48]

2017 , eprint =

Proximal Policy Optimization Algorithms , author=. 2017 , eprint =

work page 2017

[1] [1]

International Conference on Machine Learning , year =

VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment , author =. International Conference on Machine Learning , year =. 2410.01679 , archivePrefix=

work page arXiv

[2] [2]

2025 , eprint =

Qwen3-VL Technical Report , author =. 2025 , eprint =

work page 2025

[3] [3]

2025 , eprint =

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author =. 2025 , eprint =

work page 2025

[4] [4]

2024 , eprint =

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. 2024 , eprint =

work page 2024

[5] [5]

2025 , eprint =

Video-R1: Reinforcing Video Reasoning in Multimodal Large Language Models , author =. 2025 , eprint =

work page 2025

[6] [6]

2025 , eprint =

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization , author =. 2025 , eprint =

work page 2025

[7] [7]

2025 , eprint =

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models , author =. 2025 , eprint =

work page 2025

[8] [8]

2025 , eprint =

DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO , author =. 2025 , eprint =

work page 2025

[9] [9]

2025 , eprint =

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding , author =. 2025 , eprint =

work page 2025

[10] [10]

2025 , eprint =

OneThinker: All-in-One Reasoning Model for Image and Video , author =. 2025 , eprint =

work page 2025

[11] [11]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

work page

[12] [12]

2025 , eprint =

TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning , author =. 2025 , eprint =

work page 2025

[13] [13]

2023 , eprint =

Let's Verify Step by Step , author =. 2023 , eprint =

work page 2023

[14] [14]

2024 , eprint =

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning , author =. 2024 , eprint =

work page 2024

[15] [15]

2024 , eprint =

ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search , author =. 2024 , eprint =

work page 2024

[16] [16]

2025 , eprint =

ReST-RL: Process Reward Guided Reinforcement Learning for Large Language Model Reasoning , author =. 2025 , eprint =

work page 2025

[17] [17]

Training language models to follow instructions with human feedback

Training language models to follow instructions with human feedback , author =. arXiv preprint arXiv:2203.02155 , year =. 2203.02155 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

2025 , note =

MM-Eureka: Reinforcement Learning for Multimodal Reasoning with Verifiable Rewards , author =. 2025 , note =

work page 2025

[19] [19]

2024 , note =

PQM: Process Quality Models for Step-level Verification in Reasoning , author =. 2024 , note =

work page 2024

[20] [20]

2024 , note =

Improving Multimodal Chain-of-Thought Reasoning in Vision-Language Models , author =. 2024 , note =

work page 2024

[21] [21]

2025 , eprint =

Process Reward Models That Think , author =. 2025 , eprint =

work page 2025

[22] [22]

2025 , eprint =

The Lessons of Developing Process Reward Models in Mathematical Reasoning , author =. 2025 , eprint =

work page 2025

[23] [23]

2025 , eprint =

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning , author =. 2025 , eprint =

work page 2025

[24] [24]

2025 , eprint =

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning , author =. 2025 , eprint =

work page 2025

[25] [25]

2025 , eprint =

MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision , author =. 2025 , eprint =

work page 2025

[26] [26]

2025 , eprint =

AURORA: Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification , author =. 2025 , eprint =

work page 2025

[27] [27]

2025 , eprint =

SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning , author =. 2025 , eprint =

work page 2025

[28] [28]

2024 , eprint =

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling , author =. 2024 , eprint =

work page 2024

[29] [29]

2024 , eprint =

V-STaR: Training Verifiers for Self-Taught Reasoners , author =. 2024 , eprint =

work page 2024

[30] [30]

2025 , eprint =

Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling , author =. 2025 , eprint =

work page 2025

[31] [31]

International Conference on Machine Learning (ICML) , year =

e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs , author =. International Conference on Machine Learning (ICML) , year =

work page

[32] [32]

2025 , eprint =

Unlocking multimodal mathematical reasoning via process reward model , author =. 2025 , eprint =

work page 2025

[33] [33]

2025 , eprint =

OpenThoughts: Data Recipes for Reasoning Models , author =. 2025 , eprint =

work page 2025

[34] [34]

2024 , eprint =

OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models , author =. 2024 , eprint =

work page 2024

[35] [35]

2025 , eprint =

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos , author =. 2025 , eprint =

work page 2025

[36] [36]

2025 , eprint =

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding , author =. 2025 , eprint =

work page 2025

[37] [37]

2025 , eprint =

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos , author =. 2025 , eprint =

work page 2025

[38] [38]

2024 , eprint =

LongVideoBench: A Benchmark for Long-context Interleaved Video-language Understanding , author =. 2024 , eprint =

work page 2024

[39] [39]

2025 , eprint =

R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO , author =. 2025 , eprint =

work page 2025

[40] [40]

2025 , eprint =

Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense , author =. 2025 , eprint =

work page 2025

[41] [41]

2025 , eprint =

Lessons from Training Grounded LLMs with Verifiable Rewards , author =. 2025 , eprint =

work page 2025

[42] [42]

2025 , eprint =

Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains , author =. 2025 , eprint =

work page 2025

[43] [43]

2025 , eprint =

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning , author =. 2025 , eprint =

work page 2025

[44] [44]

2025 , eprint =

XRPO: Pushing the Limits of GRPO with Targeted Exploration and Exploitation , author =. 2025 , eprint =

work page 2025

[45] [45]

2025 , eprint =

GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models , author =. 2025 , eprint =

work page 2025

[46] [46]

2026 , eprint =

AMIR-GRPO: Inducing Implicit Preference Signals into GRPO , author =. 2026 , eprint =

work page 2026

[47] [47]

2026 , eprint =

From Absolute to Relative: Rethinking Reward Shaping in Group-Based Reinforcement Learning , author =. 2026 , eprint =

work page 2026

[48] [48]

2017 , eprint =

Proximal Policy Optimization Algorithms , author=. 2017 , eprint =

work page 2017