pith. machine review for the scientific record. sign in

arxiv: 2604.11554 · v2 · submitted 2026-04-13 · 💻 cs.CL

Recognition: unknown

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:01 UTC · model grok-4.3

classification 💻 cs.CL
keywords reinforcement learningRL post-trainingomni-modal modelsasynchronous trainingTransferQueueMoE modelsmultimodal RLlarge language models
0
0 comments X

The pith

Decoupling RL roles into independent services and a tunable-staleness TransferQueue delivers up to 2x faster omni-modal post-training while matching on-policy rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Relax as an open-source engine for reinforcement learning post-training on large models that accept text, images, audio, and video. It claims that an omni-native stack, fault-isolated independent services for each RL component, and asynchronous data movement through a single TransferQueue with one adjustable staleness parameter together solve the problems of heterogeneous data, scale fragility, and the usual speed-versus-freshness tradeoff. If these changes keep learning dynamics intact, training runs could finish in less wall-clock time without losing the final reward that on-policy methods achieve. Readers who want to scale RL to agentic, multimodal models would care because existing engines often force a choice between throughput and correctness or require heavy re-engineering when new modalities appear.

Core claim

Relax implements an omni-native architecture that embeds multimodal support across data preprocessing, parallelism, and generation; runs every RL role as an independent, recoverable service; and routes data through a TransferQueue that lets a single staleness value interpolate between fully on-policy and fully asynchronous execution. On Qwen3-4B this yields a 1.20× end-to-end speedup over veRL in on-policy mode and 1.76× in fully async mode; on the 30B omni-modal model the async mode reaches 2.00× over colocate while all three modes converge to identical reward curves. The same system supports R3 replay for MoE models at 1.9 % overhead and sustains stable training for more than 2,000 steps 0

What carries the argument

The TransferQueue data bus, which decouples independent RL services so that a single staleness parameter controls the degree of asynchrony while preserving end-to-end learning dynamics.

If this is right

  • On-policy, near-on-policy, and fully async modes all reach the same final reward on the tested models.
  • Fully async execution yields 1.76× speedup on 4B and 2.00× on 30B omni-modal models compared with colocated baselines.
  • R3 routing for MoE models incurs only 1.9 % overhead instead of the 32 % degradation seen in prior engines.
  • Stable convergence holds across image, text, audio, and video inputs for at least 2,000 steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same service decoupling could simplify adding new input modalities or swapping individual components without restarting the entire training run.
  • Tuning the single staleness parameter might offer a practical knob for trading compute efficiency against sample quality on tasks not yet tested.
  • The approach may reduce the hardware specialization needed for large-scale RL by improving utilization on existing clusters.

Load-bearing premise

The architectural separation of services and the TransferQueue preserve the same reward convergence that synchronous on-policy training would achieve.

What would settle it

Training the identical Qwen3-Omni-30B workload in fully async mode and observing a lower final reward than the on-policy baseline after the same number of steps would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.11554 by Benzhe Ning, Jia Liu, Jiaxing Li, Lei Zhang, Liujie Zhang, Lumeng Wu, Minghao Li, Rui Yang, Weihang Chen, Weiqi Hu, Xiaoyan Yu.

Figure 1
Figure 1. Figure 1: Relax system architecture. The Controller (control plane) orchestrates RL roles that execute on independent [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Service-oriented infrastructure components. (a) DCS coordinates topology discovery and weight metadata; [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Timeline comparison of three training modes. (a) Colocate: rollout and training alternate on shared GPUs. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Omni-modal reward convergence (Qwen3-Omni-30B). (a) Image+text+audio training on Echo Ink. (b) Video [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: End-to-end performance: Relax vs. veRL on Qwen3-4B. (a) Per-step time comparison. (b) Gantt-chart view [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training-mode comparison on Qwen3-4B / DAPO-MATH-17k. (a) Reward by step. (b) Reward by wall-clock [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Colocate vs. fully async on Qwen3-Omni-30B / Echo Ink. (a) Reward by step. (b) Reward by wall-clock time. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: R3 ablation on Qwen3-30B-A3B: Relax vs. veRL. (a) Normalized routing mismatch. (b) Reward convergence. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Text-only RL on DAPO-MATH (Qwen3-30B-A3B). (a) Training reward, pass@k, and response length. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Agentic RL reward convergence on multi-turn tool-calling (Qwen3-VL-MoE-30B, Deepeyes). [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: FP16 vs. BF16 precision ablation (Qwen3-4B). (a) Reward convergence. (b) Train–rollout log-probability [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
read the original abstract

Reinforcement learning (RL) post-training has proven effective at unlocking reasoning, self-reflection, and tool-use capabilities in large language models. As models extend to omni-modal inputs and agentic multi-turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness -- throughput tradeoff. We present \textbf{Relax} (Reinforcement Engine Leveraging Agentic X-modality), an open-source RL training engine that addresses these challenges through three co-designed architectural layers. First, an \emph{omni-native architecture} builds multimodal support into the full stack -- from data preprocessing and modality-aware parallelism to inference generation -- rather than retrofitting it onto a text-centric pipeline. Second, each RL role runs as an independent, fault-isolated service that can be scaled, recovered, and upgraded without global coordination. Third, service-level decoupling enables asynchronous training via the TransferQueue data bus, where a single staleness parameter smoothly interpolates among on-policy, near-on-policy, and fully asynchronous execution. Relax achieves a 1.20$\times$ end-to-end speedup over veRL on Qwen3-4B on-policy training. Its fully async mode delivers a 1.76$\times$ speedup over colocate on Qwen3-4B and a 2.00$\times$ speedup on Qwen3-Omni-30B, while all modes converge to the same reward level. Relax supports R3 (Rollout Routing Replay)~\cite{ma2025r3} for MoE models with only 1.9\% overhead, compared to 32\% degradation in veRL under the same configuration. It further demonstrates stable omni-modal RL convergence on Qwen3-Omni across image, text, and audio, sustaining over 2{,}000 steps on video without degradation. Relax is available at https://github.com/rednote-ai/Relax.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Relax, an open-source asynchronous RL training engine for omni-modal post-training of LLMs. It describes an omni-native architecture with modality-aware parallelism, independent fault-isolated services for RL roles, and a TransferQueue data bus controlled by a single staleness parameter that interpolates between on-policy, near-on-policy, and fully asynchronous modes. Central empirical claims include a 1.20× end-to-end speedup over veRL on Qwen3-4B on-policy training, 1.76× and 2.00× speedups in fully async mode on Qwen3-4B and Qwen3-Omni-30B, equivalent final reward levels across modes, 1.9% overhead for R3 on MoE models (vs. 32% degradation in veRL), and stable convergence over 2,000 steps on video data across image/text/audio modalities.

Significance. If the results are robust, the work offers a practical engineering contribution for scaling RL post-training to large omni-modal and MoE models by decoupling throughput from staleness without apparent performance loss. The open-source release, explicit support for R3, and demonstration of long-horizon video stability are concrete strengths that could aid reproducibility and adoption. The design addresses real operational challenges in heterogeneous data flows and fault tolerance at scale.

major comments (3)
  1. [Experimental Evaluation] Abstract and Experimental Evaluation: The headline claim that on-policy, near-on-policy, and fully-async modes all converge to identical reward levels is load-bearing for the architectural contribution, yet no training curves, per-seed variance, or statistical tests are shown. Without these, it is impossible to verify that the single staleness parameter in TransferQueue produces updates whose effective data distribution and gradient bias remain sufficiently close to on-policy.
  2. [Abstract] Performance claims in Abstract: The specific speedups (1.20× over veRL, 1.76× and 2.00× for async) and overhead numbers (1.9% for R3) are presented without error bars, number of independent runs, hardware configuration details, or baseline implementation notes. These quantitative results are central to the paper's value proposition but cannot be assessed for robustness or reproducibility from the given information.
  3. [Omni-Modal and MoE Experiments] Omni-modal and MoE sections: The assertion of stable omni-modal RL over 2,000 steps on video and low-overhead R3 support lacks ablations on how modality-aware parallelism or MoE routing interacts with the async TransferQueue, or any observed failure modes. This is required to substantiate that the decoupling preserves dynamics across the tested configurations.
minor comments (2)
  1. [Related Work] The citation for R3 (ma2025r3) is referenced but the integration details and differences from the original R3 implementation could be clarified in the methods or related work to aid readers.
  2. [Figures] Ensure all figures (if present) for reward curves or speedup breakdowns include clear legends distinguishing the three execution modes and any variance indicators.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that strengthening the experimental evaluation with additional visualizations, statistics, and ablations will improve the robustness and reproducibility of our claims. We address each major comment below and commit to the corresponding revisions.

read point-by-point responses
  1. Referee: [Experimental Evaluation] Abstract and Experimental Evaluation: The headline claim that on-policy, near-on-policy, and fully-async modes all converge to identical reward levels is load-bearing for the architectural contribution, yet no training curves, per-seed variance, or statistical tests are shown. Without these, it is impossible to verify that the single staleness parameter in TransferQueue produces updates whose effective data distribution and gradient bias remain sufficiently close to on-policy.

    Authors: We agree that the submitted manuscript does not include training curves, per-seed variance, or statistical tests for the convergence claim. In the revised version we will add reward curves for on-policy, near-on-policy, and fully-async modes across multiple seeds (N=3) with standard-deviation bands. We will also report statistical tests (paired t-tests on final rewards) confirming no significant difference. On the data-distribution concern, the TransferQueue staleness parameter explicitly bounds maximum sample age; we will add a short analysis of observed sample-age histograms to show that gradient bias remains negligible within the tested range, supporting the architectural claim while acknowledging the need for the requested evidence. revision: yes

  2. Referee: [Abstract] Performance claims in Abstract: The specific speedups (1.20× over veRL, 1.76× and 2.00× for async) and overhead numbers (1.9% for R3) are presented without error bars, number of independent runs, hardware configuration details, or baseline implementation notes. These quantitative results are central to the paper's value proposition but cannot be assessed for robustness or reproducibility from the given information.

    Authors: We acknowledge that the abstract and experimental sections lack error bars, run counts, hardware details, and baseline notes. We will revise both the abstract and main text to report mean speedups and overheads with standard deviations from N=3 independent runs, specify the exact hardware configuration (e.g., 8×H100 cluster with interconnect details), and document the precise veRL version and configuration flags used. These additions will directly address reproducibility concerns. revision: yes

  3. Referee: [Omni-Modal and MoE Experiments] Omni-modal and MoE sections: The assertion of stable omni-modal RL over 2,000 steps on video and low-overhead R3 support lacks ablations on how modality-aware parallelism or MoE routing interacts with the async TransferQueue, or any observed failure modes. This is required to substantiate that the decoupling preserves dynamics across the tested configurations.

    Authors: We agree that the current sections would benefit from targeted ablations. In revision we will add experiments that vary the staleness parameter while measuring per-modality throughput and routing statistics under modality-aware parallelism. For MoE models we will include routing-frequency histograms under async versus on-policy settings. Our 2,000-step video runs exhibited no degradation or failure modes; we will explicitly report this observation and any edge cases considered (e.g., extreme modality imbalance). These additions will substantiate that the TransferQueue decoupling preserves training dynamics. revision: yes

Circularity Check

0 steps flagged

No circularity: results are empirical benchmarks against external baselines

full rationale

The paper presents an engineering system (Relax) with architectural choices for async RL and omni-modal support, then reports measured speedups (1.20× over veRL, 1.76×/2.00× for async modes) and convergence to equivalent rewards across modes. These outcomes are obtained by running the implemented system on Qwen3 models and comparing against independent external systems (veRL, colocate). No equations, fitted parameters, or first-principles derivations are claimed; the single staleness parameter is a configurable design knob whose effects are validated empirically rather than derived by construction. Self-citations (e.g., to R3) are peripheral and not load-bearing for the core speedup or convergence claims. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems engineering paper whose claims rest on design choices and empirical measurements rather than mathematical axioms or derivations. No free parameters are fitted to data in the traditional sense; the staleness parameter is a user-configurable hyperparameter. No new entities are postulated.

pith-pipeline@v0.9.0 · 5693 in / 1200 out tokens · 61041 ms · 2026-05-10T16:01:34.641908+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MinT: Managed Infrastructure for Training and Serving Millions of LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    MinT enables efficient management of million-scale LoRA-adapted LLM policies over shared 1T-parameter base models by moving only small adapters through training and serving pipelines.

Reference graph

Works this paper leans on

22 extracted references · 20 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    Stabilizing MoE reinforcement learning by aligning training and inference routers.arXiv preprint arXiv:2510.11370,

    Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo. Stabilizing MoE reinforcement learning by aligning training and inference routers.arXiv preprint arXiv:2510.11370, 2025. Proposed Rollout Routing Replay (R3): records inference-time routing distributions and replays them during training to stabilize MoE RL

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. Demonstrated that pure RL can incentivize emergent reasoning (self-reflection, verification) in LLMs with...

  3. [3]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. Introduced PPO, a simple and effective policy gradient method with clipped surrogate objectives for stable RL training

  4. [4]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. Introduced Group Relative Policy Optimization (GRPO), a memory-efficient variant of PPO for mathemati...

  5. [5]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. Proposed Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO), achieving 50 points on AIME 2024 with Qwen2.5-32B. ...

  6. [6]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework.arXiv preprint arXiv:2409.19256, 2024. Proposed veRL/HybridFlow: hybrid single-/multi-controller paradigm for RLHF with 3D-HybridEngine for actor resharding, achieving 1.53x–20.57x throughpu...

  7. [7]

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

    Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, et al. OpenRLHF: An easy-to-use, scalable and high-performance RLHF framework.arXiv preprint arXiv:2405.11143, 2024. Open-source RLHF/RLVR framework built on Ray, vLLM, DeepSpeed, and HuggingFace Transformers, achieving 1.22x–1.68x speedup...

  8. [8]

    AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. AReaL: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298, 2025. Fully asynchronous RL system decoupling generation from training with staleness-enha...

  9. [9]

    AsyncFlow Authors

    Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, Huazhong Ji, Wenjie Liu, Yu Huang, Yixiang Zhang, Chenyi Pan, Jing Wang, Xin Huang, Chunsheng Li, and Jianping Wu. AsyncFlow: An asynchronous streaming RL framework for efficient LLM post-training.arXiv preprint arXiv:2507.01663, 20...

  10. [10]

    Reinforcement learning optimization for large- scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

    Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, Zichen Liu, Haizhou Zhao, et al. Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

  11. [11]

    Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents.arXiv preprint arXiv:2603.18815, 2026

    Hao Zhang, Mingjie Liu, Shaokun Zhang, Songyang Han, Jian Hu, Zhenghui Jin, Yuchi Zhang, Shizhe Diao, Ximing Lu, Binfeng Xu, Zhiding Yu, Jan Kautz, and Yi Dong. ProRL Agent: Rollout-as-a-service for RL training of multi-turn LLM agents.arXiv preprint arXiv:2603.18815, 2026. NVIDIA NeMo Gym: rollout-as-a-service infrastructure for multi-turn agentic RL tra...

  12. [12]

    slime: An llm post-training framework for rl scaling

    Zilin Zhu, Chengxing Xie, Xin Lv, and slime Contributors. slime: An llm post-training framework for rl scaling. https://github.com/THUDM/slime, 2025. GitHub repository

  13. [13]

    Jordan, and Ion Stoica

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A distributed framework for emerging AI applications.arXiv preprint arXiv:1712.05889, 2018. Distributed computing framework for AI with task-parallel and actor-based abstractions, wi...

  14. [14]

    Brown, Miljan Martic, Shane Legg, and Dario Amodei

    Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.arXiv preprint arXiv:1706.03741, 2017. Pioneering work on learning reward functions from human preferences to guide RL agents

  15. [15]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

  16. [16]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Zhangyue Chen et al. A survey of reasoning with foundation models: Towards complex and long chain-of-thought reasoning.arXiv preprint arXiv:2503.09567, 2025. Comprehensive survey on long and complex chain-of-thought reasoning with foundation models

  17. [17]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019. Introduced efficient intra-layer tensor model parallelism for training multi-billion parameter transformers, sustaining 15.1 PetaFLO...

  18. [18]

    SGLang: Efficient Execution of Structured Language Model Programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs.arXiv preprint arXiv:2312.07104, 2023. LLM inference system with RadixAttention for KV cache reuse and compressed F...

  19. [19]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. PyTorch FSDP: Experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023. Industr...

  20. [20]

    Inference-time scaling for generalist reward modeling, 2025

    Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling.arXiv preprint arXiv:2504.02495, 2025. DeepSeek-GRM: Self-Principled Critique Tuning (SPCT) for scalable generalist reward modeling via online RL with inference-time scaling

  21. [21]

    Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning.arXiv preprint arXiv:2505.04623,

    Zhenghao Xing, Xiaowei Hu, Chi-Wing Fu, Wenhai Wang, Jifeng Dai, and Pheng-Ann Heng. EchoInk-R1: Ex- ploring audio-visual reasoning in multimodal LLMs via reinforcement learning.arXiv preprint arXiv:2505.04623,

  22. [22]

    Audio-visual reasoning via RL with A VQA-R1-6K dataset; trained Qwen2.5-Omni-7B using GRPO. 14 APREPRINT- APRIL15, 2026 A Discussion A.1 Design Trade-offs The role of colocate mode.Although the fully asynchronous (off-policy) mode delivers the highest throughput in our experiments, colocate mode remains a practical choice when the GPU budget is limited. B...