pith. sign in

arxiv: 2606.11209 · v1 · pith:VD3RWGRInew · submitted 2026-04-23 · 💻 cs.CL · cs.AI· cs.LG

ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

Pith reviewed 2026-07-04 20:03 UTC · model glm-5.2

classification 💻 cs.CL cs.AIcs.LG
keywords reasoningrewardprocessprocessthinkerrewardsstepacrossformat
0
0 comments X

The pith

Score each reasoning step by whether it leads to a right answer

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ProcessThinker introduces a method for training multimodal language models to reason more effectively over video inputs without requiring a separately trained process reward model. The core mechanism is rollout-based process reward: for each intermediate step in a chain-of-thought reasoning trace, the system samples multiple continuations from that point onward and checks how often each continuation reaches the correct final answer. The empirical success rate becomes the step's reward. This converts a single sparse outcome signal (right or wrong final answer) into a dense, step-level credit assignment that can distinguish a reasoning trace that goes wrong only at the end from one that was unproductive from the start. The pipeline first fine-tunes the model to produce explicitly step-tagged reasoning traces, then applies Group Relative Policy Optimization (GRPO) with the rollout-based process reward. On four video reasoning benchmarks, the process-only reward configuration consistently outperforms both outcome-only rewards and a mixture of the two, raising the average accuracy of an 8-billion-parameter multimodal model from 56.30 to 59.72, with the largest gain on VideoMathQA (+6.47).

Core claim

The paper's central claim is that rollout-based process rewards — scoring each reasoning step by the empirical success rate of multiple continuations sampled from that step's prefix — provide denser and more effective credit assignment than outcome-only rewards, and that this holds without training a separate process reward model. The key empirical finding is that process-only rewards outperform both outcome-only and mixed reward configurations across all four video reasoning benchmarks tested, with the process-only variant achieving the best average score (59.72) compared to outcome-only (57.55) and mixed (57.91).

What carries the argument

Rollout-based process reward (continuation solvability): For each intermediate step s_i in a reasoning trace, sample M=4 continuations from the policy model conditioned on the prefix (s_1, ..., s_i). The step score c_i is the fraction of those continuations that produce the correct final answer. The trajectory-level process reward averages these step scores, capped at K_max=6 steps. This reward is combined with a format reward, a bounded step-count bonus, and a penalty gate (which zeros out rewards for responses whose process score falls below threshold τ=0.5 when the final answer is also wrong) within a GRPO training loop.

If this is right

  • Step-level credit assignment via continuation rollouts could be applied to any domain with verifiable final answers (math, code, logic puzzles), potentially improving reasoning quality beyond video QA.
  • The finding that process-only rewards outperform mixed rewards suggests that outcome rewards may introduce noise or conflicting gradients when combined with denser process signals, which has implications for reward design in RL-based post-training generally.
  • The method's reliance on the same final-answer verifier used in standard RLVR means it can be deployed wherever outcome-based RLVR is already in use, lowering the barrier to adoption compared to PRM-based approaches that require annotated training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The SFT warm-up degrades performance relative to the baseline (53.83 vs. 56.30 average), meaning the GRPO stage must recover this degradation and add further gains. Without a no-SFT, no-process-reward GRPO baseline, the paper cannot fully isolate whether the process reward mechanism or the combined SFT+GRPO pipeline drives the improvement. A clean ablation would strengthen the causal claim.
  • The rollout cost scales with the number of steps times M continuations per step times G GRPO group size, which could make this approach prohibitively expensive for longer reasoning traces or larger models. The paper acknowledges this limitation but does not quantify the compute overhead relative to outcome-only GRPO.
  • The sensitivity to step segmentation — how the teacher model decomposes reasoning into steps — is acknowledged but not systematically studied. If step boundaries are poorly chosen, the continuation-solvability signal could be noisy or misleading, which would limit the method's reliability on tasks where good step decomposition is non-trivial.

Load-bearing premise

The claim that process rewards drive the improvement rests on all GRPO variants sharing the same SFT warm-up, but since that SFT stage itself degrades performance below the baseline, the GRPO stage must both recover and exceed the original model. Without a GRPO-only variant that uses no process reward and no SFT, one cannot fully separate the contribution of the process reward from the combined effect of format-structured SFT followed by GRPO training.

What would settle it

If a GRPO variant trained with the same format-structured SFT warm-up but using only outcome rewards (no process reward, no continuation rollouts) matched or exceeded the process-only configuration's scores, the central claim that rollout-based process rewards are the active ingredient would be undermined.

Figures

Figures reproduced from arXiv: 2606.11209 by Boer Zhang, Jingpei Wu, Volker Tresp, Weixiang Shen, Xiao Han, Zifeng Ding.

Figure 1
Figure 1. Figure 1: Rollout-based process reward inside one GRPO update. For a question [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start. A common solution is to train a process reward model (PRM) for step-level supervision, but this typically requires large-scale high-quality chain-of-thought annotations and additional training cost. We propose ProcessThinker, a practical post-training pipeline that provides step-level process rewards without training an explicit PRM. ProcessThinker first rewrites reasoning traces into a step-tagged format for cold-start supervised fine-tuning, then applies GRPO with a standard format reward and our rollout-based process reward. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate (final-answer verification) as the step reward. This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self-contradictory progress across steps -- a key issue in logical reasoning. Across four challenging video benchmarks (Video-MMMU, MMVU, VideoMathQA, and LongVideoBench), ProcessThinker consistently improves over the baseline model Qwen3-VL-8B-Instruct

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 6 minor

Summary. The paper proposes ProcessThinker, a post-training pipeline for multimodal LLMs that provides step-level process rewards without training a separate PRM. The method rewrites reasoning traces into a step-tagged format for SFT warm-up, then applies GRPO with a rollout-based process reward: for each intermediate step, M=4 continuations are sampled from the step prefix, and the empirical success rate (under the same final-answer verifier used in RLVR) serves as the step reward. Experiments on four video reasoning benchmarks show that ProcessThinker (process-only) improves over the Qwen3-VL-8B-Instruct baseline by +3.42 average, and that process-only rewards outperform both outcome-only and mixed reward configurations.

Significance. The core idea—estimating step utility via continuation solvability within a GRPO framework, without training a separate PRM—is a clean and practical contribution. The method is well-motivated by the credit-assignment problem in sparse-reward RLVR, and the ablation structure in Table 1 (outcome-only vs. mixed vs. process-only, all sharing the same SFT warm-up and GRPO machinery) is the right experimental design to isolate the contribution of the process reward. The approach is modality-agnostic and could transfer beyond video reasoning. The paper is honest about limitations, including rollout cost and noise.

major comments (2)
  1. §3, Table 1: The central comparative claim—that process-only rewards outperform outcome-only rewards—rests on a 2.17-point average difference (59.72 vs. 57.55). On individual benchmarks, some gaps are very small (e.g., VideoMathQA: outcome-only and mixed both 27.86; LongVideoBench: 74.20 vs. 74.60 vs. 75.40). No error bars, confidence intervals, significance tests, or even a single replication are reported anywhere in the paper. With only 1,250 RL training prompts, G=4 samples per prompt, and M=4 continuation rollouts per step, run-to-run variance from random seed and sampling stochasticity could plausibly shift scores by 1–2 points. Without any variance estimate, it is impossible to determine whether the ordering outcome-only < mixed < process-only reflects a real effect or noise. This directly undermines the load-bearing comparative claim. At minimum, the authors should report results从
  2. §2.2, Eq. (2): With M=4, the step score c_i can only take values {0, 0.25, 0.5, 0.75, 1.0}, which is a very coarse estimator. The paper acknowledges rollout noise in the conclusion but does not analyze its impact on training stability or final performance. A brief sensitivity analysis (e.g., M=2 vs. M=4 vs. M=8) would strengthen the claim that M=4 is sufficient, especially since the axiom ledger identifies 'empirical success rate of M=4 continuations is a sufficient estimator of step utility' as an ad-hoc assumption.
minor comments (6)
  1. §2.2, Eq. (6): The reward weights λ_acc and λ_proc are stated to satisfy λ_acc + λ_proc = 1, but the specific values used for each variant (outcome-only, mixed, process-only) are not reported. These should be specified, as they are load-bearing for the ablation.
  2. §2.2, Eq. (4): The step bonus B(K) uses a parameter α_r that is not defined or given a value. Please clarify.
  3. §2.2, Eq. (5): The penalty gate uses τ=0.5, but the rationale for this threshold is not discussed. A brief justification or sensitivity note would help.
  4. §3: The paper mentions VinePPO (Kazemnejad et al., 2025) as a closely related method using Monte Carlo rollouts for step-level credit assignment in PPO, but no experimental comparison is provided. Even a brief discussion of why a direct comparison is not feasible (different base model, modality, or framework) would contextualize the contribution.
  5. §3: Training compute and wall-clock time for the rollout-based process reward (which requires M×K_max additional forward passes per response) are not reported. Given that the paper identifies efficiency as the main limitation, quantitative cost figures would strengthen this discussion.
  6. Table 1: VIDEO-R1-7B uses a different backbone (Qwen2.5-VL) and is acknowledged as not directly comparable. Consider moving it to a footnote or separate reference table to avoid confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. Both major comments are well-taken: (1) the absence of variance estimates weakens the comparative claim between reward configurations, and (2) the coarseness of M=4 as a step-score estimator deserves direct analysis. We will address both in the revision.

read point-by-point responses
  1. Referee: §3, Table 1: The central comparative claim—that process-only rewards outperform outcome-only rewards—rests on a 2.17-point average difference (59.72 vs. 57.55). No error bars, confidence intervals, significance tests, or replications are reported. With only 1,250 RL training prompts, run-to-run variance could plausibly shift scores by 1–2 points. Without variance estimates, it is impossible to determine whether the ordering outcome-only < mixed < process-only reflects a real effect or noise.

    Authors: The referee is correct that the absence of any variance estimate is a genuine gap, and we will address it in the revision. We will report results from at least two additional independent training runs per reward configuration (different random seeds), and will include standard deviations across runs in Table 1. We will also add bootstrap confidence intervals on the evaluation accuracy for each benchmark. We agree that the 2.17-point average gap between process-only and outcome-only is modest, and that some per-benchmark gaps (e.g., VideoMathQA: outcome-only and mixed both 27.86) are too small to support strong claims on their own. We will accordingly soften the language from 'process-only outperforms' to 'process-only achieves the best average across our runs, though the advantage over outcome-only is modest and we cannot rule out noise on individual benchmarks.' If the replicated runs do not reproduce the ordering, we will report that honestly. We note that the monotonic ordering outcome-only < mixed < process-only holds on the average across all four benchmarks, and the largest single-benchmark gap (VideoMathQA: 27.86 vs. 31.67, +3.81 for process-only) is the one most relevant to multi-step reasoning, which is the setting the method targets. But we agree this does not substitute for proper variance estimation. revision: yes

  2. Referee: §2.2, Eq. (2): With M=4, the step score c_i can only take values {0, 0.25, 0.5, 0.75, 1.0}, which is a very coarse estimator. The paper acknowledges rollout noise in the conclusion but does not analyze its impact on training stability or final performance. A sensitivity analysis (e.g., M=2 vs. M=4 vs. M=8) would strengthen the claim that M=4 is sufficient, especially since the axiom ledger identifies 'empirical success rate of M=4 continuations is a sufficient estimator of step utility' as an ad-hoc assumption.

    Authors: This is a fair point. The coarseness of the M=4 estimator is a real limitation that we have acknowledged only in passing. We will add a sensitivity analysis varying M ∈ {2, 4, 8} and report both final performance and training stability (reward curve smoothness, gradient variance proxies). We expect M=2 to be noticeably noisier and M=8 to offer diminishing returns at roughly double the rollout cost, but we will report whatever the data shows. We will also add a brief discussion of why M=4 was chosen as the default: it offers five discrete levels (including the extremes 0 and 1), which is the minimum needed to distinguish 'never solvable,' 'sometimes solvable,' and 'always solvable' prefixes while keeping rollout cost within our compute budget. We agree that the sufficiency of M=4 is currently an ad-hoc assumption and will label it as such in the revised text, presenting the sensitivity analysis as the empirical justification (or partial justification, depending on results). revision: yes

Circularity Check

0 steps flagged

No circularity: rollout-based process reward is defined by independent MC sampling and external ground-truth verification, not by self-citation or fitted parameters

full rationale

The paper's central construction — the rollout-based process reward (Eqs. 2–3) — is defined as the empirical success rate of M continuations sampled from the current policy conditioned on each step prefix, verified against an external ground-truth answer. This is a model-free, parameter-free construction: no parameter is fitted to a target and then renamed as a prediction, and no self-citation chain is load-bearing for the reward definition. The ablation in Table 1 compares outcome-only (λ_acc=1, λ_proc=0) vs. process-only (λ_acc=0, λ_proc=1) under the same SFT warm-up and GRPO machinery, so the comparison is between genuinely different reward signals, not between a fit and its own target. The process reward uses the same final-answer verifier as RLVR, but this is shared infrastructure (external ground-truth checking), not a circular definition. VinePPO (Kazemnejad et al., 2025) is cited as related work with a similar Monte-Carlo rollout intuition for PPO, but the paper's GRPO-based construction is presented as its own contribution, not as a consequence of an unverified self-cited theorem. The absence of error bars and the SFT degradation concern are correctness/statistical risks, not circularity. No step in the derivation chain reduces to its inputs by construction.

Axiom & Free-Parameter Ledger

9 free parameters · 4 axioms · 0 invented entities

The paper introduces no new entities (particles, forces, dimensions, etc.). It introduces a training pipeline and reward construction, which are methods rather than postulated objects. The free parameters are hyperparameters of the reward function and training setup, several of which are not explicitly stated in the paper.

free parameters (9)
  • M (continuation rollouts per step) = 4
    Number of continuations sampled per step prefix to estimate success rate; chosen by the authors, not derived.
  • Kmax (max steps scored) = 6
    Caps the number of steps that receive process rewards; chosen by the authors.
  • τ (process reward threshold) = 0.5
    Threshold below which process reward is replaced by penalty; chosen by the authors.
  • α (step bonus scale) = not stated
    Scales the bounded step bonus B(K) in Eq. 4; value not specified in the paper.
  • β (format reward bonus) = not stated
    Added to format reward in Eq. 6; value not specified.
  • λ_acc, λ_proc (reward weights) = varies by variant
    Control the mixture of outcome and process rewards; λ_acc+λ_proc=1. Process-only uses λ_proc=1.
  • Kmin, Lmin, Lmax (format constraints) = not stated
    Minimum/maximum step count and length bounds for format validity; values not specified.
  • 19k (SFT dataset size) = 19000
    Number of filtered samples kept for SFT from the 165k pool; chosen by the authors.
  • 1,250 (RL prompt count) = 1250
    Number of prompts sampled for GRPO training; chosen by the authors.
axioms (4)
  • domain assumption Step-level credit assignment improves learning over outcome-only rewards for long reasoning traces
    Core assumption motivating the entire approach; stated in §1 and §2.2. Not independently proven — the experimental results are consistent with it but confounded by the SFT+GRPO pipeline.
  • ad hoc to paper Empirical success rate of M=4 continuations is a sufficient estimator of step utility
    M=4 is a small sample size for estimating a probability; the paper does not justify this choice or report sensitivity to M. Used in Eq. 2.
  • domain assumption Step segmentation by the teacher model produces meaningful, non-redundant steps
    The SFT data construction (§2.1) relies on the teacher model segmenting reasoning into meaningful steps. The paper acknowledges this is noisy and applies filtering, but does not validate step quality independently.
  • standard math GRPO with KL regularization to a reference policy is a stable training method for this setting
    Standard GRPO recipe from Shao et al. (2024); used in §2.2.

pith-pipeline@v1.1.0-glm · 10875 in / 3096 out tokens · 142585 ms · 2026-07-04T20:03:56.111290+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

  1. [1]

    International Conference on Machine Learning , year =

    VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment , author =. International Conference on Machine Learning , year =. 2410.01679 , archivePrefix=

  2. [2]

    2025 , eprint =

    Qwen3-VL Technical Report , author =. 2025 , eprint =

  3. [3]

    2025 , eprint =

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author =. 2025 , eprint =

  4. [4]

    2024 , eprint =

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. 2024 , eprint =

  5. [5]

    2025 , eprint =

    Video-R1: Reinforcing Video Reasoning in Multimodal Large Language Models , author =. 2025 , eprint =

  6. [6]

    2025 , eprint =

    R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization , author =. 2025 , eprint =

  7. [7]

    2025 , eprint =

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models , author =. 2025 , eprint =

  8. [8]

    2025 , eprint =

    DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO , author =. 2025 , eprint =

  9. [9]

    2025 , eprint =

    Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding , author =. 2025 , eprint =

  10. [10]

    2025 , eprint =

    OneThinker: All-in-One Reasoning Model for Image and Video , author =. 2025 , eprint =

  11. [11]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

  12. [12]

    2025 , eprint =

    TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning , author =. 2025 , eprint =

  13. [13]

    2023 , eprint =

    Let's Verify Step by Step , author =. 2023 , eprint =

  14. [14]

    2024 , eprint =

    Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning , author =. 2024 , eprint =

  15. [15]

    2024 , eprint =

    ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search , author =. 2024 , eprint =

  16. [16]

    2025 , eprint =

    ReST-RL: Process Reward Guided Reinforcement Learning for Large Language Model Reasoning , author =. 2025 , eprint =

  17. [17]

    Training language models to follow instructions with human feedback

    Training language models to follow instructions with human feedback , author =. arXiv preprint arXiv:2203.02155 , year =. 2203.02155 , archivePrefix =

  18. [18]

    2025 , note =

    MM-Eureka: Reinforcement Learning for Multimodal Reasoning with Verifiable Rewards , author =. 2025 , note =

  19. [19]

    2024 , note =

    PQM: Process Quality Models for Step-level Verification in Reasoning , author =. 2024 , note =

  20. [20]

    2024 , note =

    Improving Multimodal Chain-of-Thought Reasoning in Vision-Language Models , author =. 2024 , note =

  21. [21]

    2025 , eprint =

    Process Reward Models That Think , author =. 2025 , eprint =

  22. [22]

    2025 , eprint =

    The Lessons of Developing Process Reward Models in Mathematical Reasoning , author =. 2025 , eprint =

  23. [23]

    2025 , eprint =

    VisualPRM: An Effective Process Reward Model for Multimodal Reasoning , author =. 2025 , eprint =

  24. [24]

    2025 , eprint =

    GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning , author =. 2025 , eprint =

  25. [25]

    2025 , eprint =

    MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision , author =. 2025 , eprint =

  26. [26]

    2025 , eprint =

    AURORA: Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification , author =. 2025 , eprint =

  27. [27]

    2025 , eprint =

    SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning , author =. 2025 , eprint =

  28. [28]

    2024 , eprint =

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling , author =. 2024 , eprint =

  29. [29]

    2024 , eprint =

    V-STaR: Training Verifiers for Self-Taught Reasoners , author =. 2024 , eprint =

  30. [30]

    2025 , eprint =

    Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling , author =. 2025 , eprint =

  31. [31]

    International Conference on Machine Learning (ICML) , year =

    e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs , author =. International Conference on Machine Learning (ICML) , year =

  32. [32]

    2025 , eprint =

    Unlocking multimodal mathematical reasoning via process reward model , author =. 2025 , eprint =

  33. [33]

    2025 , eprint =

    OpenThoughts: Data Recipes for Reasoning Models , author =. 2025 , eprint =

  34. [34]

    2024 , eprint =

    OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models , author =. 2024 , eprint =

  35. [35]

    2025 , eprint =

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos , author =. 2025 , eprint =

  36. [36]

    2025 , eprint =

    MMVU: Measuring Expert-Level Multi-Discipline Video Understanding , author =. 2025 , eprint =

  37. [37]

    2025 , eprint =

    VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos , author =. 2025 , eprint =

  38. [38]

    2024 , eprint =

    LongVideoBench: A Benchmark for Long-context Interleaved Video-language Understanding , author =. 2024 , eprint =

  39. [39]

    2025 , eprint =

    R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO , author =. 2025 , eprint =

  40. [40]

    2025 , eprint =

    Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense , author =. 2025 , eprint =

  41. [41]

    2025 , eprint =

    Lessons from Training Grounded LLMs with Verifiable Rewards , author =. 2025 , eprint =

  42. [42]

    2025 , eprint =

    Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains , author =. 2025 , eprint =

  43. [43]

    2025 , eprint =

    Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning , author =. 2025 , eprint =

  44. [44]

    2025 , eprint =

    XRPO: Pushing the Limits of GRPO with Targeted Exploration and Exploitation , author =. 2025 , eprint =

  45. [45]

    2025 , eprint =

    GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models , author =. 2025 , eprint =

  46. [46]

    2026 , eprint =

    AMIR-GRPO: Inducing Implicit Preference Signals into GRPO , author =. 2026 , eprint =

  47. [47]

    2026 , eprint =

    From Absolute to Relative: Rethinking Reward Shaping in Group-Based Reinforcement Learning , author =. 2026 , eprint =

  48. [48]

    2017 , eprint =

    Proximal Policy Optimization Algorithms , author=. 2017 , eprint =