{"total":15,"items":[{"citing_arxiv_id":"2606.27580","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Retroactive Advantage Correction: Closed-Form V-Trace Bias Correction for Delay-Aware RLHF","primary_cat":"cs.LG","submitted_at":"2026-06-25T22:12:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAC is a closed-form bias correction for delayed rewards in RLHF that is unbiased under full mass reinjection of the delay kernel and reduces to V-trace with no delay.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.24143","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AsyncOPD: How Stale Can On-Policy Distillation Be?","primary_cat":"cs.LG","submitted_at":"2026-06-23T04:50:49+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AsyncOPD shows asynchronous OPD training reaches 1.6-3.8x higher throughput than synchronous baselines with comparable accuracy by using forward-KL estimators and multi-sample Monte Carlo correction for finite teacher caches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19004","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training","primary_cat":"cs.DC","submitted_at":"2026-06-17T12:31:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spotlight achieves 4x faster DiT RL post-training on spot GPUs via stale-weight exploration and elastic sequence parallelism, cutting costs 1.4-6.4x with better image quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11867","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training","primary_cat":"cs.DC","submitted_at":"2026-06-10T09:42:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ForeMoE uses routing foresight from the rollout stage to enable micro-step load balancing in MoE RL post-training via a hierarchical planner and transfer engine, claiming up to 1.45x speedup on 64 GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05597","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents","primary_cat":"cs.LG","submitted_at":"2026-06-04T02:18:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AsyncWebRL reports up to 2.9x training speedup and new SOTA on WebGym OOD split via async overlap plus constant normalizer in GRPO, with largest gains on harder tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03077","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Libra: Efficient Resource Management for Agentic RL Post-Training","primary_cat":"cs.LG","submitted_at":"2026-06-02T03:09:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Libra optimizes GPU allocation across rollout and training in agentic RL via an elastic hybrid pool and C-MLFQ scheduler based on tool-return causal signals, claiming up to 3.0x throughput and 2.5x faster reward convergence on 48 A800 GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14220","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diagnosing Training Inference Mismatch in LLM Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-14T00:27:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12070","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction","primary_cat":"cs.LG","submitted_at":"2026-05-12T12:57:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. [19] Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, and Xing Yu. Vespo: Variational sequence-level soft policy optimization for stable off-policy llm training.arXiv preprint arXiv:2602.10693, 2026. [20] Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, et al. Laminar: A scalable asynchronous rl post-training framework.arXiv preprint arXiv:2510.12633, 2025. [21] Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework."},{"citing_arxiv_id":"2605.08520","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration","primary_cat":"cs.LG","submitted_at":"2026-05-08T22:04:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA workloads.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Jiang, Z. Wang, R. Han, L. T. Le, S. Daruki, X. Tang, et al. Reasoningbank: Scaling agent self-evolving with reasoning memory.arXiv preprint arXiv:2509.25140, 2025. [23] V . Pyatkin, S. Malik, V . Graf, H. Ivison, S. Huang, P. Dasigi, N. Lambert, and H. Hajishirzi. Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833, 2025. [24] G. Sheng, Y . Tong, B. Wan, W. Zhang, C. Jia, X. Wu, Y . Wu, X. Li, C. Zhang, Y . Peng, et al. Laminar: A scalable asynchronous rl post-training framework.arXiv preprint arXiv:2510.12633, 2025. [25] G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference"},{"citing_arxiv_id":"2605.06534","ref_index":59,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL","primary_cat":"cs.DC","submitted_at":"2026-05-07T16:33:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"without expanding the training cluster. 3.2 Serving Capacity Availability We next quantify how much serving capacity isempirically harvestable for the cooperative elasticity of rollouts. Fluctuating Serving Traffic Leads to GPU Underutiliza- tion.Production LLM serving workloads exhibit fluctuating request rates [47, 63, 66, 74, 82]. Figure 3a plots a 24-hour Microsoft trace [60] at minute granularity alongside three zoomed-in 5-minute windows at per-second granularity. At the minute level, the peak rate reaches1.7× the 24-hour aver- age. At the second level, burstiness is far more pronounced: per-second peaks reach 4.22×, 1.58×, and 1.73× their respec- tive window averages, consistent with second-level spikes reported by BurstGPT [66]."},{"citing_arxiv_id":"2604.26256","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training","primary_cat":"cs.LG","submitted_at":"2026-04-29T03:25:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DORA's multi-version streaming rollout enables 2-3x higher throughput in asynchronous RL for LLMs while preserving convergence by maintaining policy consistency, data integrity, and bounded staleness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23838","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training","primary_cat":"cs.LG","submitted_at":"2026-04-26T18:45:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"which leads to underutilized GPU resources. Off-policy algo- rithms [29, 39] break the strict dependency between stages, enabling asynchronous rollout and training so that idle GPUs can proceed with subsequent-stage computation instead of blocking.AReaL[13] improves pipeline efficiency by dis- carding overlong samples and recomputing them later to mit- igate the long-tail effect.Laminar[44] proposes fully asyn- chronous rollout and trainer instances to break barriers be- tween stages, leveraging relay buffers to support fine-grained weight updates and isolate long-tail samples.RLinf[62] en- ables more flexible data and stage partitioning at a finer granu- larity, achieving dynamic spatiotemporal scheduling within a single RL pipeline."},{"citing_arxiv_id":"2604.09107","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training","primary_cat":"cs.DC","submitted_at":"2026-04-10T08:40:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"actions among multiple modules, aiming to either enhance model performance or achieve acceleration. TensorHub has identified the challenges posed by the rapid escalation of such complexity to the scalability and stability of distributed systems. Drawing inspiration from the design principles of classical distributed filesystems [2, 9] and peer-to-peer stor- age [35], it introduces an abstraction layer with a storage interface. Through streamlined APIs, it can support complex scenarios such as asynchronous training, elastic rollout, and even heterogeneous GPU types across datacenters, while achieving excellent performance. Storage Systems for Machine Learning.Prior studies have explored techniques for building storage systems for"},{"citing_arxiv_id":"2602.05765","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training","primary_cat":"cs.AI","submitted_at":"2026-02-05T15:30:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.14617","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning","primary_cat":"cs.DC","submitted_at":"2025-11-18T16:12:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Seer improves synchronous LLM RL rollout throughput by up to 2.04x and reduces long-tail latency by 72-94% via divided rollout, context-aware scheduling, and adaptive grouped speculative decoding based on prompt similarity observations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}