pith. machine review for the scientific record. sign in

arxiv: 2605.07330 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.DC

Recognition: no theorem link

SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication

Hscos Zhang, Hugh Yin, Isaac Zhu, Jason Zhao, Lucas Hu, Ranchi Zhao, Zach Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC
keywords sparse synchronizationweight updatesreinforcement learningcommunication efficiencydistributed trainingpolicy stalenesslossless updatesasynchronous RL
0
0 comments X

The pith

Sparse synchronization sends only changed weight indices and values to cut RL communication volume by about 100 times while reconstructing full weights exactly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reinforcement learning systems must regularly copy policy weights from the trainer to rollout workers to prevent using stale policies. When models are large and networks are slow or variable, shipping the entire dense weight vector becomes a major bottleneck for throughput. The paper observes that between updates the actual changes touch only a small fraction of the elements, typically 99 percent or more. By packing just the positions and new values of those changes into a sparse payload that the rollout side can expand back to the exact original weights, the volume of data moved shrinks dramatically. This keeps the learning dynamics identical to full synchronization while making the process viable in cross-cluster or bandwidth-constrained deployments.

Core claim

In mainstream large-model RL training the locations where parameters actually change are highly sparse at the element level, often 99 percent or more. SparseRL-Sync replaces full-weight transfers with a lossless sparse update payload of indices and values that can be exactly reconstructed on the inference side, preserving 100 percent fidelity. Under a simplified cost model this reduces the per-update communication volume from S to approximately S/X, yielding about a 100x reduction in transmitted data with 99 percent sparsity, and bucketing further cuts launch and control-plane overhead.

What carries the argument

The lossless sparse update payload of parameter indices paired with their new values, sent in place of dense weights and grouped via bucketing to reduce overhead.

If this is right

  • Per-update communication volume falls from full size S to roughly S/X when sparsity reaches 99 percent.
  • Launch and control-plane overhead shrinks because payloads are smaller and can be bucketed.
  • Scalability improves in bandwidth-limited, cross-datacenter, or highly asynchronous RL settings.
  • Policy fidelity stays identical to full-weight synchronization, so training quality is unaffected.
  • End-to-end throughput rises when weight synchronization previously dominated the timeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sparse-payload approach could reduce communication in other distributed training workloads if their update patterns exhibit comparable element-wise sparsity.
  • Variable-bandwidth environments such as online RL or heterogeneous clusters would see the largest relative gains in tail latency.
  • Combining the index-value format with further encoding of the index list itself might yield additional savings beyond the basic 100x factor.

Load-bearing premise

The element-level locations of actual parameter changes remain highly sparse, around 99 percent or more, consistently across training steps and model scales.

What would settle it

Measure the fraction of parameters whose values differ by more than a small numerical threshold between successive policy updates in a production-scale RL run; if the average sparsity falls below roughly 90 percent the claimed reduction factor would not hold.

Figures

Figures reproduced from arXiv: 2605.07330 by Hscos Zhang, Hugh Yin, Isaac Zhu, Jason Zhao, Lucas Hu, Ranchi Zhao, Zach Zhang.

Figure 1
Figure 1. Figure 1: Core result at a glance. SparseRL-Sync reduces the Trainer-to-Rollout weight￾synchronization payload by 32×–54× raw and up to ≈ 100× after lossless compression across model scales (left) while preserving training dynamics bit-exactly (right). Codes will be released at https://github.com/scitix/helix. 1 arXiv:2605.07330v1 [cs.LG] 8 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Trainer–Rollout weight-synchronization workflow. Left column: the full-update baseline pipeline used by open-source RL frameworks such as slime. Right column: the SparseRL-Sync pipeline, with newly inserted steps highlighted in vermillion. The center column shows the physical topology shared by both: M Trainer stages (PP size) →Ray + process group →N Rollout ranks. Each Trainer stage contributes K buckets … view at source ↗
Figure 3
Figure 3. Figure 3: Estimated wall-clock cost of a single full-weight (BF16) parameter update for representative open models under different per-node aggregated NIC bandwidths, including Qwen3-30B-A3B (30B), Nemotron-3-Super-120B-A12B (120B), MiniMax-M2.5 (230B), Qwen3.5-397B-A17B (397B), DeepSeek￾V3.1 (671B), and Kimi K2.5 (1TB). As model size increases and available bandwidth decreases, the synchronization cost rises sharpl… view at source ↗
Figure 4
Figure 4. Figure 4: Precision-gated sparsity gap over synchronization steps. (a) BF16 model parameters syn￾chronized to Rollout have sub-1% changed-element density; (b) FP32 master weights on the Trainer side remain near-dense throughout. Both panels share the same algorithm legend, shown below each subfigure. of this precision-gated effect in Section 2.2. This precision-gated gap is the foundation of the lossless sparse-sync… view at source ↗
Figure 5
Figure 5. Figure 5: Element-level changed-element density under different synchronization precisions, on a log scale so that the FP16 / BF16 / FP8 differences remain visible. Values shown are measured on Qwen3- 30B-A3B over a GRPO run. Precision controls visible sparsity [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Tensor-level inactive ratio over synchronization steps. Only about 5%–6% of parameter tensors have no changed elements, so the observed sparsity is primarily within tensors. settings, the inactive-tensor ratio is only about 5%–6%, so more than 94% of tensors still contain at least one changed element at each synchronization point. The sparsity is therefore structural within tensors, not between them. This … view at source ↗
Figure 7
Figure 7. Figure 7: BF16 element-level sparsity across the four model scales (8B, 30B, 106B, 671B) over syn￾chronization steps. All models exhibit high sparsity (≥ 98%) from the first step, and sparsity tends to increase over training. The 671B model reaches the highest observed sparsity, consistent with larger pretrained weights having more mass in the sub-threshold regime that gets absorbed by the BF16 cast. Sparsity across… view at source ↗
Figure 8
Figure 8. Figure 8: Temporal locality of update indices on 30B (GRPO). For each sync step t and each parameter tensor, we compute the locality ratio |It∩ S s<t Is| / |It|—the fraction of the current changed indices that have appeared in any prior step. The three curves show the 25th, 50th (median), and 90th percentiles of this ratio across all parameter tensors. All three rise monotonically from ∼ 45%–52% at step 1 to ∼ 72%–7… view at source ↗
Figure 9
Figure 9. Figure 9: Three-Gate Theory of Zhu et al. (2025), reproduced here as the explanatory framework for our sparsity observations. The pretrained base model and the RL optimizer (top) jointly set a small-step, KL-bounded regime; three successive gates—Gate I (KL anchor) bounds step magnitude, Gate II (model geometry) routes the bounded update onto off-principal, low-curvature coordinates, and Gate III (BF16 precision) su… view at source ↗
Figure 10
Figure 10. Figure 10: Per-synchronization broadcast time under the two bandwidth regimes of Section 4.1. 106B is measured on 128 H100 GPUs in separated mode (64 Trainer + 64 Rollout); 671B is projected from the 106B effective bandwidth (hatched bars). Numbers on top of each SparseRL-Sync bar are speedups over the corresponding full-update baseline. Note the log-scale y-axis. Findings. Three observations stand out. (i) IB-off b… view at source ↗
read the original abstract

In large-scale reinforcement learning (RL) systems with decoupled Trainer-Rollout execution, the Trainer must regularly synchronize policy weights to the Rollout side to limit policy staleness. When inter-node bandwidth is abundant, such synchronization is usually only a small fraction of end-to-end cost. As model size grows, however, the communication demand rises rapidly. In bandwidth-constrained or network-variable deployments -- for example, cross-datacenter or cross-cluster settings, heterogeneous resource pools, and online RL -- weight synchronization can become a dominant bottleneck for throughput and tail latency. We observe that, in mainstream large-model RL training, the locations where parameters actually change are highly sparse at the element level (often 99%+ sparsity). Building on this observation, we propose and implement SparseRL-Sync, which replaces full-weight transfers with a lossless sparse update payload (indices and values) that can be exactly reconstructed on the inference side, thereby preserving 100% fidelity. Under a simplified cost model, sparse synchronization reduces the per-update communication volume from S to approximately S/X; with 99% sparsity (X ~ 100), this yields about a 100x reduction in transmitted data. Combined with appropriate bucketing, SparseRL-Sync also reduces launch and control-plane overhead, significantly improving scalability and end-to-end efficiency in bandwidth-limited and highly asynchronous RL settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that in large-scale RL with decoupled Trainer-Rollout execution, parameter updates exhibit high element-level sparsity (often 99%+). SparseRL-Sync replaces full weight transfers with a lossless sparse payload of indices and values that can be exactly reconstructed, preserving 100% fidelity. Under a simplified cost model, this reduces per-update communication volume from S to approximately S/X, yielding ~100x savings at 99% sparsity (X~100); bucketing further reduces launch and control-plane overhead in bandwidth-constrained or asynchronous settings.

Significance. If the sparsity observation proves consistent across steps and scales and sparse handling overhead remains low, the approach could meaningfully improve throughput and scalability for RL training in cross-datacenter, heterogeneous, or online settings by addressing communication bottlenecks without fidelity loss. The lossless reconstruction property is a clear strength.

major comments (2)
  1. [Abstract] Abstract: The central quantitative claim that sparse synchronization reduces volume from S to S/X (~100x at 99% sparsity) relies on a simplified cost model that sets index transmission cost to zero. With standard 32-bit indices and 32-bit float values, each changed element costs 8 bytes; at 1% density the transmitted volume is 0.01*(8/4)=0.02 of dense size, for only a 50x reduction. The S/X approximation is therefore overstated unless a specific low-overhead indexing scheme (e.g., bitmap or delta-encoded) is defined and analyzed.
  2. [Abstract] Abstract: No experimental measurements, sparsity statistics across training steps or model scales, overhead benchmarks, or end-to-end throughput results are supplied to support the 99%+ sparsity observation or to validate that reconstruction preserves fidelity at scale. This leaves the empirical premise of the ~100x claim unverified.
minor comments (2)
  1. The description of bucketing and its interaction with sparse payloads to reduce control-plane overhead would benefit from a concrete example or pseudocode.
  2. Notation for the cost model (S, X) should be defined explicitly when first introduced.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive feedback. The two major comments highlight important aspects of our presentation that require clarification and qualification. We address each point below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central quantitative claim that sparse synchronization reduces volume from S to S/X (~100x at 99% sparsity) relies on a simplified cost model that sets index transmission cost to zero. With standard 32-bit indices and 32-bit float values, each changed element costs 8 bytes; at 1% density the transmitted volume is 0.01*(8/4)=0.02 of dense size, for only a 50x reduction. The S/X approximation is therefore overstated unless a specific low-overhead indexing scheme (e.g., bitmap or delta-encoded) is defined and analyzed.

    Authors: We agree that the abstract employs a simplified cost model that focuses on the value payload and treats index overhead as secondary. The manuscript describes the payload as indices plus values but does not specify or analyze a particular index encoding. We will revise the abstract and add a short paragraph in the main text to explicitly state the assumptions of the model, provide the more accurate 32-bit index + 32-bit value calculation the referee notes, and discuss practical low-overhead schemes (compressed bitmaps, run-length encoding, or delta indexing) that can substantially reduce index cost at high sparsity. This will qualify the ~100x figure as an upper-bound under the simplified model while showing how closer-to-ideal savings remain achievable. revision: partial

  2. Referee: [Abstract] Abstract: No experimental measurements, sparsity statistics across training steps or model scales, overhead benchmarks, or end-to-end throughput results are supplied to support the 99%+ sparsity observation or to validate that reconstruction preserves fidelity at scale. This leaves the empirical premise of the ~100x claim unverified.

    Authors: The sparsity observation is drawn from our internal large-scale RL training runs, and the lossless reconstruction follows directly from transmitting exact indices and values. However, the current manuscript presents these as motivating observations without accompanying statistics, overhead measurements, or end-to-end results. We will revise the text to (1) qualify the 99%+ figure as an observed range rather than a universal claim, (2) add a brief discussion of the source of the observation with illustrative (non-proprietary) examples, and (3) explicitly note that comprehensive benchmarks are left for future work. We cannot introduce new large-scale experiments in this revision cycle. revision: partial

standing simulated objections not resolved
  • Absence of quantitative sparsity statistics, overhead benchmarks, and end-to-end throughput measurements to support the empirical claims.

Circularity Check

0 steps flagged

No circularity: central claim follows from external empirical sparsity observation

full rationale

The paper states an empirical observation of 99%+ element-level sparsity in parameter updates during large-model RL training as an external fact, then applies a simplified cost model to conclude that sparse synchronization reduces volume from S to approximately S/X (with X~100 yielding ~100x reduction). This scaling is a direct arithmetic consequence of the input sparsity level rather than any self-referential derivation, fitted parameter, or self-citation chain. No equations, uniqueness theorems, or ansatzes are introduced that reduce the result to the paper's own outputs by construction. The derivation remains self-contained against the stated observation and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on an empirical domain observation of high sparsity and a simplified cost model; no new mathematical axioms or invented physical entities are introduced.

free parameters (1)
  • sparsity factor X
    Taken directly from the stated 99%+ element-level sparsity observation in large-model RL training; used to compute the 100x reduction.
axioms (1)
  • domain assumption Parameter updates exhibit high element-level sparsity (99%+) in mainstream large-model RL training
    Invoked as the enabling observation for replacing full transfers with sparse payloads

pith-pipeline@v0.9.0 · 5559 in / 1398 out tokens · 46883 ms · 2026-05-11T01:20:58.719349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

  1. [3]

    Advances in Neural Information Processing Systems , volume =

    Training language models to follow instructions with human feedback , author =. Advances in Neural Information Processing Systems , volume =

  2. [4]

    2017 , eprint =

    Proximal Policy Optimization Algorithms , author =. 2017 , eprint =

  3. [5]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and others , year =. 2402.03300 , archivePrefix =

  4. [6]

    Yu and others , year =

  5. [7]

    2025 , note =

    Reasoning-burden of long chain-of-thought RL fine-tuning , author =. 2025 , note =

  6. [8]

    Kazemnejad, Amirhossein and others , year =

  7. [9]

    Gao and others , year =

  8. [10]

    Zheng and others , year =

  9. [11]

    Le Roux, Nicolas and others , year =

  10. [12]

    Tapered importance weights for off-policy

    Arnal and others , year =. Tapered importance weights for off-policy

  11. [13]

    Wang and others , year =

  12. [14]

    Tang and others , year =

  13. [15]

    Nan and others , year =

  14. [16]

    2024 , howpublished =

  15. [17]

    2024 , howpublished =

    Composer2: Multi-Cluster. 2024 , howpublished =

  16. [18]

    1-bit stochastic gradient descent and its application to data-parallel distributed training of speech

    Seide, Frank and Fu, Hao and Droppo, Jasha and Li, Gang and Yu, Dong , booktitle =. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech

  17. [19]

    Wen, Wei and Xu, Cong and Yan, Feng and Wu, Chunpeng and Wang, Yandan and Chen, Yiran and Li, Hai , booktitle =

  18. [20]

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

    Sparse Communication for Distributed Gradient Descent , author =. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

  19. [21]

    International Conference on Learning Representations (ICLR) , year =

    Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , author =. International Conference on Learning Representations (ICLR) , year =

  20. [22]

    Vogels, Thijs and Karimireddy, Sai Praneeth and Jaggi, Martin , booktitle =

  21. [23]

    Collet, Yann and Kucherawy, Murray , year =

  22. [24]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , year =. 1909.08053 , archivePrefix =

  23. [25]

    2023 , eprint =

    Zhao, Yanli and Gu, Andrew and Varma, Rohan and Luo, Liang and Huang, Chien-Chin and Xu, Min and Wright, Less and Shojanazeri, Hamid and Ott, Myle and Shleifer, Sam and Desmaison, Alban and Balioglu, Can and Damania, Pritam and Nguyen, Bernard and Chauhan, Geeta and Hao, Yuchen and Mathews, Ajit and Li, Shen , booktitle =. 2023 , eprint =

  24. [26]

    2026 , howpublished =

  25. [27]

    Sparse communication for distributed gradient descent

    Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017

  26. [28]

    AReaL : Towards fully asynchronous reinforcement learning for large language models, 2024 a

    Anonymous . AReaL : Towards fully asynchronous reinforcement learning for large language models, 2024 a . TODO: confirm citation key and arXiv id

  27. [29]

    ROLL : Heterogeneous reinforcement learning for large models, 2024 b

    Anonymous . ROLL : Heterogeneous reinforcement learning for large models, 2024 b . TODO: confirm citation key and arXiv id

  28. [30]

    AWex : Asynchronous weight exchange for large-model RL training

    Ant Group / inclusionAI . AWex : Asynchronous weight exchange for large-model RL training. GitHub repository, 2024. URL https://github.com/inclusionAI/asystem-awex

  29. [31]

    Composer2: Multi-cluster RL training at C ursor

    Cursor . Composer2: Multi-cluster RL training at C ursor. Technical report, 2024. TODO: replace with the canonical URL once published

  30. [32]

    SAPO : Soft asymmetric policy optimization, 2025

    Gao et al. SAPO : Soft asymmetric policy optimization, 2025. TODO: confirm citation; sigmoid-based soft gating

  31. [33]

    VinePPO : Unlocking RL potential for llm reasoning through refined credit assignment, 2024

    Amirhossein Kazemnejad et al. VinePPO : Unlocking RL potential for llm reasoning through refined credit assignment, 2024. TODO: confirm citation; cited in info.md as Kazemnejad et al., 2024

  32. [34]

    TOPR : Tapered off-policy REINFORCE for stable off-policy learning, 2025

    Nicolas Le Roux et al. TOPR : Tapered off-policy REINFORCE for stable off-policy learning, 2025. TODO: confirm citation

  33. [35]

    Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations (ICLR), 2018

  34. [36]

    Understanding and exploiting weight update sparsity for communication-efficient distributed RL , 2026

    Erfan Miahi and Eugene Belilovsky. Understanding and exploiting weight update sparsity for communication-efficient distributed RL , 2026. URL https://arxiv.org/abs/2602.03839

  35. [37]

    Kimi checkpoint engine

    Moonshot AI . Kimi checkpoint engine. GitHub repository, 2024. URL https://github.com/MoonshotAI/checkpoint-engine

  36. [38]

    NGRPO : Negative-aware group relative policy optimization, 2025

    Nan et al. NGRPO : Negative-aware group relative policy optimization, 2025. TODO: confirm citation

  37. [39]

    Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 2022

  38. [40]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URL https://arxiv.org/abs/1707.06347

  39. [41]

    Helix : An RL training framework

    Scitix . Helix : An RL training framework. GitHub repository, 2026. URL https://github.com/scitix/helix. Repository to be released; placeholder URL

  40. [42]

    1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs

    Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs . In Fifteenth Annual Conference of the International Speech Communication Association (INTERSPEECH), 2014

  41. [43]

    DeepSeekMath : Pushing the limits of mathematical reasoning in open language models, 2024

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, et al. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models, 2024. Introduces Group Relative Policy Optimization (GRPO)

  42. [44]

    A3PO : Adaptive advantage shaping for policy optimization, 2025

    Tang et al. A3PO : Adaptive advantage shaping for policy optimization, 2025. TODO: confirm citation

  43. [45]

    slime : an open-source framework for large-model reinforcement learning

    THU-DCST . slime : an open-source framework for large-model reinforcement learning. GitHub repository, 2024. URL https://github.com/THU-DCST/slime. TODO: confirm canonical citation and version commit

  44. [46]

    PowerSGD : Practical low-rank gradient compression for distributed optimization

    Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. PowerSGD : Practical low-rank gradient compression for distributed optimization. In Advances in Neural Information Processing Systems, 2019

  45. [47]

    ASPO : Asymmetric importance-ratio correction for policy optimization, 2025

    Wang et al. ASPO : Asymmetric importance-ratio correction for policy optimization, 2025. TODO: confirm citation

  46. [48]

    TernGrad : Ternary gradients to reduce communication in distributed deep learning

    Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. TernGrad : Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, 2017

  47. [49]

    DAPO : An open-source LLM reinforcement learning system at scale, 2025

    Yu et al. DAPO : An open-source LLM reinforcement learning system at scale, 2025. TODO: confirm full author list and arXiv id

  48. [50]

    GSPO : Sequence-level group sequence policy optimization, 2025

    Zheng et al. GSPO : Sequence-level group sequence policy optimization, 2025. TODO: confirm citation

  49. [51]

    arXiv preprint arXiv:2511.08567 , year=

    Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, David Z. Pan, Zhangyang Wang, Yuandong Tian, and Kai Sheng Tai. The path not taken: RLVR provably learns off the principals, 2025. URL https://arxiv.org/abs/2511.08567