arxiv: 2605.07330 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.DC

Recognition: no theorem link

SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication

Hscos Zhang, Hugh Yin, Isaac Zhu, Jason Zhao, Lucas Hu, Ranchi Zhao, Zach Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC

keywords sparse synchronizationweight updatesreinforcement learningcommunication efficiencydistributed trainingpolicy stalenesslossless updatesasynchronous RL

0 comments

The pith

Sparse synchronization sends only changed weight indices and values to cut RL communication volume by about 100 times while reconstructing full weights exactly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reinforcement learning systems must regularly copy policy weights from the trainer to rollout workers to prevent using stale policies. When models are large and networks are slow or variable, shipping the entire dense weight vector becomes a major bottleneck for throughput. The paper observes that between updates the actual changes touch only a small fraction of the elements, typically 99 percent or more. By packing just the positions and new values of those changes into a sparse payload that the rollout side can expand back to the exact original weights, the volume of data moved shrinks dramatically. This keeps the learning dynamics identical to full synchronization while making the process viable in cross-cluster or bandwidth-constrained deployments.

Core claim

In mainstream large-model RL training the locations where parameters actually change are highly sparse at the element level, often 99 percent or more. SparseRL-Sync replaces full-weight transfers with a lossless sparse update payload of indices and values that can be exactly reconstructed on the inference side, preserving 100 percent fidelity. Under a simplified cost model this reduces the per-update communication volume from S to approximately S/X, yielding about a 100x reduction in transmitted data with 99 percent sparsity, and bucketing further cuts launch and control-plane overhead.

What carries the argument

The lossless sparse update payload of parameter indices paired with their new values, sent in place of dense weights and grouped via bucketing to reduce overhead.

If this is right

Per-update communication volume falls from full size S to roughly S/X when sparsity reaches 99 percent.
Launch and control-plane overhead shrinks because payloads are smaller and can be bucketed.
Scalability improves in bandwidth-limited, cross-datacenter, or highly asynchronous RL settings.
Policy fidelity stays identical to full-weight synchronization, so training quality is unaffected.
End-to-end throughput rises when weight synchronization previously dominated the timeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sparse-payload approach could reduce communication in other distributed training workloads if their update patterns exhibit comparable element-wise sparsity.
Variable-bandwidth environments such as online RL or heterogeneous clusters would see the largest relative gains in tail latency.
Combining the index-value format with further encoding of the index list itself might yield additional savings beyond the basic 100x factor.

Load-bearing premise

The element-level locations of actual parameter changes remain highly sparse, around 99 percent or more, consistently across training steps and model scales.

What would settle it

Measure the fraction of parameters whose values differ by more than a small numerical threshold between successive policy updates in a production-scale RL run; if the average sparsity falls below roughly 90 percent the claimed reduction factor would not hold.

Figures

Figures reproduced from arXiv: 2605.07330 by Hscos Zhang, Hugh Yin, Isaac Zhu, Jason Zhao, Lucas Hu, Ranchi Zhao, Zach Zhang.

**Figure 1.** Figure 1: Core result at a glance. SparseRL-Sync reduces the Trainer-to-Rollout weightsynchronization payload by 32×–54× raw and up to ≈ 100× after lossless compression across model scales (left) while preserving training dynamics bit-exactly (right). Codes will be released at https://github.com/scitix/helix. 1 arXiv:2605.07330v1 [cs.LG] 8 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Trainer–Rollout weight-synchronization workflow. Left column: the full-update baseline pipeline used by open-source RL frameworks such as slime. Right column: the SparseRL-Sync pipeline, with newly inserted steps highlighted in vermillion. The center column shows the physical topology shared by both: M Trainer stages (PP size) →Ray + process group →N Rollout ranks. Each Trainer stage contributes K buckets … view at source ↗

**Figure 3.** Figure 3: Estimated wall-clock cost of a single full-weight (BF16) parameter update for representative open models under different per-node aggregated NIC bandwidths, including Qwen3-30B-A3B (30B), Nemotron-3-Super-120B-A12B (120B), MiniMax-M2.5 (230B), Qwen3.5-397B-A17B (397B), DeepSeekV3.1 (671B), and Kimi K2.5 (1TB). As model size increases and available bandwidth decreases, the synchronization cost rises sharpl… view at source ↗

**Figure 4.** Figure 4: Precision-gated sparsity gap over synchronization steps. (a) BF16 model parameters synchronized to Rollout have sub-1% changed-element density; (b) FP32 master weights on the Trainer side remain near-dense throughout. Both panels share the same algorithm legend, shown below each subfigure. of this precision-gated effect in Section 2.2. This precision-gated gap is the foundation of the lossless sparse-sync… view at source ↗

**Figure 5.** Figure 5: Element-level changed-element density under different synchronization precisions, on a log scale so that the FP16 / BF16 / FP8 differences remain visible. Values shown are measured on Qwen3- 30B-A3B over a GRPO run. Precision controls visible sparsity [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Tensor-level inactive ratio over synchronization steps. Only about 5%–6% of parameter tensors have no changed elements, so the observed sparsity is primarily within tensors. settings, the inactive-tensor ratio is only about 5%–6%, so more than 94% of tensors still contain at least one changed element at each synchronization point. The sparsity is therefore structural within tensors, not between them. This … view at source ↗

**Figure 7.** Figure 7: BF16 element-level sparsity across the four model scales (8B, 30B, 106B, 671B) over synchronization steps. All models exhibit high sparsity (≥ 98%) from the first step, and sparsity tends to increase over training. The 671B model reaches the highest observed sparsity, consistent with larger pretrained weights having more mass in the sub-threshold regime that gets absorbed by the BF16 cast. Sparsity across… view at source ↗

**Figure 8.** Figure 8: Temporal locality of update indices on 30B (GRPO). For each sync step t and each parameter tensor, we compute the locality ratio |It∩ S s<t Is| / |It|—the fraction of the current changed indices that have appeared in any prior step. The three curves show the 25th, 50th (median), and 90th percentiles of this ratio across all parameter tensors. All three rise monotonically from ∼ 45%–52% at step 1 to ∼ 72%–7… view at source ↗

**Figure 9.** Figure 9: Three-Gate Theory of Zhu et al. (2025), reproduced here as the explanatory framework for our sparsity observations. The pretrained base model and the RL optimizer (top) jointly set a small-step, KL-bounded regime; three successive gates—Gate I (KL anchor) bounds step magnitude, Gate II (model geometry) routes the bounded update onto off-principal, low-curvature coordinates, and Gate III (BF16 precision) su… view at source ↗

**Figure 10.** Figure 10: Per-synchronization broadcast time under the two bandwidth regimes of Section 4.1. 106B is measured on 128 H100 GPUs in separated mode (64 Trainer + 64 Rollout); 671B is projected from the 106B effective bandwidth (hatched bars). Numbers on top of each SparseRL-Sync bar are speedups over the corresponding full-update baseline. Note the log-scale y-axis. Findings. Three observations stand out. (i) IB-off b… view at source ↗

read the original abstract

In large-scale reinforcement learning (RL) systems with decoupled Trainer-Rollout execution, the Trainer must regularly synchronize policy weights to the Rollout side to limit policy staleness. When inter-node bandwidth is abundant, such synchronization is usually only a small fraction of end-to-end cost. As model size grows, however, the communication demand rises rapidly. In bandwidth-constrained or network-variable deployments -- for example, cross-datacenter or cross-cluster settings, heterogeneous resource pools, and online RL -- weight synchronization can become a dominant bottleneck for throughput and tail latency. We observe that, in mainstream large-model RL training, the locations where parameters actually change are highly sparse at the element level (often 99%+ sparsity). Building on this observation, we propose and implement SparseRL-Sync, which replaces full-weight transfers with a lossless sparse update payload (indices and values) that can be exactly reconstructed on the inference side, thereby preserving 100% fidelity. Under a simplified cost model, sparse synchronization reduces the per-update communication volume from S to approximately S/X; with 99% sparsity (X ~ 100), this yields about a 100x reduction in transmitted data. Combined with appropriate bucketing, SparseRL-Sync also reduces launch and control-plane overhead, significantly improving scalability and end-to-end efficiency in bandwidth-limited and highly asynchronous RL settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SparseRL-Sync applies known sparsity tricks to RL weight sync in a targeted way, but the 100x claim drops to roughly 50x once index overhead is counted and the abstract shows no measurements.

read the letter

The paper's main point is a practical fix for weight synchronization in decoupled trainer-rollout RL: send only the sparse changed parameters as index-value pairs instead of full dense weights. This targets bandwidth-limited cases like cross-cluster or online training where sync can dominate costs. They note that policy updates often hit 99% element-level sparsity and build a lossless reconstruction on the rollout side plus bucketing to trim control overhead. That combination is the concrete new piece here, even if the underlying sparsity observation and sparse formats draw from distributed training work elsewhere. It is a useful systems-level optimization for people running large RL jobs under network constraints. The lossless guarantee is the right call because any fidelity loss would break training. Bucketing shows they considered launch costs, not just bytes on the wire. The soft spot is the cost model. It states communication falls from S to S/X with X around 100 at 99% sparsity. A real sparse payload carries 32-bit indices alongside 32-bit values, so each changed element costs 8 bytes rather than 4. That math gives about 50x reduction, not 100x, unless they use a different indexing scheme that the abstract does not describe. The paper also supplies no actual measurements of transmitted volume, added compute on the rollout side, or whether sparsity stays consistent step to step and scale to scale. Without those numbers the savings remain theoretical. This is for RL infrastructure engineers who already deal with distributed training and need concrete ideas for cutting sync traffic. A reader in that group can extract implementation details even if the headline number needs adjustment. It deserves peer review because the problem is genuine and the approach is straightforward, provided the full version includes benchmarks and a corrected cost breakdown.

Referee Report

2 major / 2 minor

Summary. The paper claims that in large-scale RL with decoupled Trainer-Rollout execution, parameter updates exhibit high element-level sparsity (often 99%+). SparseRL-Sync replaces full weight transfers with a lossless sparse payload of indices and values that can be exactly reconstructed, preserving 100% fidelity. Under a simplified cost model, this reduces per-update communication volume from S to approximately S/X, yielding ~100x savings at 99% sparsity (X~100); bucketing further reduces launch and control-plane overhead in bandwidth-constrained or asynchronous settings.

Significance. If the sparsity observation proves consistent across steps and scales and sparse handling overhead remains low, the approach could meaningfully improve throughput and scalability for RL training in cross-datacenter, heterogeneous, or online settings by addressing communication bottlenecks without fidelity loss. The lossless reconstruction property is a clear strength.

major comments (2)

[Abstract] Abstract: The central quantitative claim that sparse synchronization reduces volume from S to S/X (~100x at 99% sparsity) relies on a simplified cost model that sets index transmission cost to zero. With standard 32-bit indices and 32-bit float values, each changed element costs 8 bytes; at 1% density the transmitted volume is 0.01*(8/4)=0.02 of dense size, for only a 50x reduction. The S/X approximation is therefore overstated unless a specific low-overhead indexing scheme (e.g., bitmap or delta-encoded) is defined and analyzed.
[Abstract] Abstract: No experimental measurements, sparsity statistics across training steps or model scales, overhead benchmarks, or end-to-end throughput results are supplied to support the 99%+ sparsity observation or to validate that reconstruction preserves fidelity at scale. This leaves the empirical premise of the ~100x claim unverified.

minor comments (2)

The description of bucketing and its interaction with sparse payloads to reduce control-plane overhead would benefit from a concrete example or pseudocode.
Notation for the cost model (S, X) should be defined explicitly when first introduced.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive feedback. The two major comments highlight important aspects of our presentation that require clarification and qualification. We address each point below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central quantitative claim that sparse synchronization reduces volume from S to S/X (~100x at 99% sparsity) relies on a simplified cost model that sets index transmission cost to zero. With standard 32-bit indices and 32-bit float values, each changed element costs 8 bytes; at 1% density the transmitted volume is 0.01*(8/4)=0.02 of dense size, for only a 50x reduction. The S/X approximation is therefore overstated unless a specific low-overhead indexing scheme (e.g., bitmap or delta-encoded) is defined and analyzed.

Authors: We agree that the abstract employs a simplified cost model that focuses on the value payload and treats index overhead as secondary. The manuscript describes the payload as indices plus values but does not specify or analyze a particular index encoding. We will revise the abstract and add a short paragraph in the main text to explicitly state the assumptions of the model, provide the more accurate 32-bit index + 32-bit value calculation the referee notes, and discuss practical low-overhead schemes (compressed bitmaps, run-length encoding, or delta indexing) that can substantially reduce index cost at high sparsity. This will qualify the ~100x figure as an upper-bound under the simplified model while showing how closer-to-ideal savings remain achievable. revision: partial
Referee: [Abstract] Abstract: No experimental measurements, sparsity statistics across training steps or model scales, overhead benchmarks, or end-to-end throughput results are supplied to support the 99%+ sparsity observation or to validate that reconstruction preserves fidelity at scale. This leaves the empirical premise of the ~100x claim unverified.

Authors: The sparsity observation is drawn from our internal large-scale RL training runs, and the lossless reconstruction follows directly from transmitting exact indices and values. However, the current manuscript presents these as motivating observations without accompanying statistics, overhead measurements, or end-to-end results. We will revise the text to (1) qualify the 99%+ figure as an observed range rather than a universal claim, (2) add a brief discussion of the source of the observation with illustrative (non-proprietary) examples, and (3) explicitly note that comprehensive benchmarks are left for future work. We cannot introduce new large-scale experiments in this revision cycle. revision: partial

standing simulated objections not resolved

Absence of quantitative sparsity statistics, overhead benchmarks, and end-to-end throughput measurements to support the empirical claims.

Circularity Check

0 steps flagged

No circularity: central claim follows from external empirical sparsity observation

full rationale

The paper states an empirical observation of 99%+ element-level sparsity in parameter updates during large-model RL training as an external fact, then applies a simplified cost model to conclude that sparse synchronization reduces volume from S to approximately S/X (with X~100 yielding ~100x reduction). This scaling is a direct arithmetic consequence of the input sparsity level rather than any self-referential derivation, fitted parameter, or self-citation chain. No equations, uniqueness theorems, or ansatzes are introduced that reduce the result to the paper's own outputs by construction. The derivation remains self-contained against the stated observation and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on an empirical domain observation of high sparsity and a simplified cost model; no new mathematical axioms or invented physical entities are introduced.

free parameters (1)

sparsity factor X
Taken directly from the stated 99%+ element-level sparsity observation in large-model RL training; used to compute the 100x reduction.

axioms (1)

domain assumption Parameter updates exhibit high element-level sparsity (99%+) in mainstream large-model RL training
Invoked as the enabling observation for replacing full transfers with sparse payloads

pith-pipeline@v0.9.0 · 5559 in / 1398 out tokens · 46883 ms · 2026-05-11T01:20:58.719349+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

[3]

Advances in Neural Information Processing Systems , volume =

Training language models to follow instructions with human feedback , author =. Advances in Neural Information Processing Systems , volume =

work page
[4]

2017 , eprint =

Proximal Policy Optimization Algorithms , author =. 2017 , eprint =

work page 2017
[5]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and others , year =. 2402.03300 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Yu and others , year =

work page
[7]

2025 , note =

Reasoning-burden of long chain-of-thought RL fine-tuning , author =. 2025 , note =

work page 2025
[8]

Kazemnejad, Amirhossein and others , year =

work page
[9]

Gao and others , year =

work page
[10]

Zheng and others , year =

work page
[11]

Le Roux, Nicolas and others , year =

work page
[12]

Tapered importance weights for off-policy

Arnal and others , year =. Tapered importance weights for off-policy

work page
[13]

Wang and others , year =

work page
[14]

Tang and others , year =

work page
[15]

Nan and others , year =

work page
[16]

2024 , howpublished =

work page 2024
[17]

2024 , howpublished =

Composer2: Multi-Cluster. 2024 , howpublished =

work page 2024
[18]

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech

Seide, Frank and Fu, Hao and Droppo, Jasha and Li, Gang and Yu, Dong , booktitle =. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech

work page
[19]

Wen, Wei and Xu, Cong and Yan, Feng and Wu, Chunpeng and Wang, Yandan and Chen, Yiran and Li, Hai , booktitle =

work page
[20]

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Sparse Communication for Distributed Gradient Descent , author =. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

work page 2017
[21]

International Conference on Learning Representations (ICLR) , year =

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , author =. International Conference on Learning Representations (ICLR) , year =

work page
[22]

Vogels, Thijs and Karimireddy, Sai Praneeth and Jaggi, Martin , booktitle =

work page
[23]

Collet, Yann and Kucherawy, Murray , year =

work page
[24]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , year =. 1909.08053 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 1909
[25]

2023 , eprint =

Zhao, Yanli and Gu, Andrew and Varma, Rohan and Luo, Liang and Huang, Chien-Chin and Xu, Min and Wright, Less and Shojanazeri, Hamid and Ott, Myle and Shleifer, Sam and Desmaison, Alban and Balioglu, Can and Damania, Pritam and Nguyen, Bernard and Chauhan, Geeta and Hao, Yuchen and Mathews, Ajit and Li, Shen , booktitle =. 2023 , eprint =

work page 2023
[26]

2026 , howpublished =

work page 2026
[27]

Sparse communication for distributed gradient descent

Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017

work page 2017
[28]

AReaL : Towards fully asynchronous reinforcement learning for large language models, 2024 a

Anonymous . AReaL : Towards fully asynchronous reinforcement learning for large language models, 2024 a . TODO: confirm citation key and arXiv id

work page 2024
[29]

ROLL : Heterogeneous reinforcement learning for large models, 2024 b

Anonymous . ROLL : Heterogeneous reinforcement learning for large models, 2024 b . TODO: confirm citation key and arXiv id

work page 2024
[30]

AWex : Asynchronous weight exchange for large-model RL training

Ant Group / inclusionAI . AWex : Asynchronous weight exchange for large-model RL training. GitHub repository, 2024. URL https://github.com/inclusionAI/asystem-awex

work page 2024
[31]

Composer2: Multi-cluster RL training at C ursor

Cursor . Composer2: Multi-cluster RL training at C ursor. Technical report, 2024. TODO: replace with the canonical URL once published

work page 2024
[32]

SAPO : Soft asymmetric policy optimization, 2025

Gao et al. SAPO : Soft asymmetric policy optimization, 2025. TODO: confirm citation; sigmoid-based soft gating

work page 2025
[33]

VinePPO : Unlocking RL potential for llm reasoning through refined credit assignment, 2024

Amirhossein Kazemnejad et al. VinePPO : Unlocking RL potential for llm reasoning through refined credit assignment, 2024. TODO: confirm citation; cited in info.md as Kazemnejad et al., 2024

work page 2024
[34]

TOPR : Tapered off-policy REINFORCE for stable off-policy learning, 2025

Nicolas Le Roux et al. TOPR : Tapered off-policy REINFORCE for stable off-policy learning, 2025. TODO: confirm citation

work page 2025
[35]

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations (ICLR), 2018

work page 2018
[36]

Understanding and exploiting weight update sparsity for communication-efficient distributed RL , 2026

Erfan Miahi and Eugene Belilovsky. Understanding and exploiting weight update sparsity for communication-efficient distributed RL , 2026. URL https://arxiv.org/abs/2602.03839

work page arXiv 2026
[37]

Kimi checkpoint engine

Moonshot AI . Kimi checkpoint engine. GitHub repository, 2024. URL https://github.com/MoonshotAI/checkpoint-engine

work page 2024
[38]

NGRPO : Negative-aware group relative policy optimization, 2025

Nan et al. NGRPO : Negative-aware group relative policy optimization, 2025. TODO: confirm citation

work page 2025
[39]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 2022

work page 2022
[40]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Helix : An RL training framework

Scitix . Helix : An RL training framework. GitHub repository, 2026. URL https://github.com/scitix/helix. Repository to be released; placeholder URL

work page 2026
[42]

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs . In Fifteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), 2014

work page 2014
[43]

DeepSeekMath : Pushing the limits of mathematical reasoning in open language models, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, et al. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models, 2024. Introduces Group Relative Policy Optimization (GRPO)

work page 2024
[44]

A3PO : Adaptive advantage shaping for policy optimization, 2025

Tang et al. A3PO : Adaptive advantage shaping for policy optimization, 2025. TODO: confirm citation

work page 2025
[45]

slime : an open-source framework for large-model reinforcement learning

THU-DCST . slime : an open-source framework for large-model reinforcement learning. GitHub repository, 2024. URL https://github.com/THU-DCST/slime. TODO: confirm canonical citation and version commit

work page 2024
[46]

PowerSGD : Practical low-rank gradient compression for distributed optimization

Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. PowerSGD : Practical low-rank gradient compression for distributed optimization. In Advances in Neural Information Processing Systems, 2019

work page 2019
[47]

ASPO : Asymmetric importance-ratio correction for policy optimization, 2025

Wang et al. ASPO : Asymmetric importance-ratio correction for policy optimization, 2025. TODO: confirm citation

work page 2025
[48]

TernGrad : Ternary gradients to reduce communication in distributed deep learning

Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. TernGrad : Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, 2017

work page 2017
[49]

DAPO : An open-source LLM reinforcement learning system at scale, 2025

Yu et al. DAPO : An open-source LLM reinforcement learning system at scale, 2025. TODO: confirm full author list and arXiv id

work page 2025
[50]

GSPO : Sequence-level group sequence policy optimization, 2025

Zheng et al. GSPO : Sequence-level group sequence policy optimization, 2025. TODO: confirm citation

work page 2025
[51]

arXiv preprint arXiv:2511.08567 , year=

Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, David Z. Pan, Zhangyang Wang, Yuandong Tian, and Kai Sheng Tai. The path not taken: RLVR provably learns off the principals, 2025. URL https://arxiv.org/abs/2511.08567

work page arXiv 2025