A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL

Deyi Xiong; Lei Yang; Siyu Ding

arxiv: 2606.02398 · v1 · pith:ALY6AUVOnew · submitted 2026-06-01 · 💻 cs.LG · cs.CL

A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL

Lei Yang , Siyu Ding , Deyi Xiong This is my paper

Pith reviewed 2026-06-28 15:35 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords multi-domain reinforcement learningcross-domain interferencelocal perturbation modelsparse parameter updatesconflict subspacedomain refreshLLM post-trainingsecond-order damage

0 comments

The pith

Later-domain RL training harms earlier domains mainly through a second-order damage term concentrated in a low-dimensional shared conflict subspace.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that single-domain RL produces sparse parameter edits with limited neuron overlap, yet domains share active computation routes where update directions create synergy or conflict. Under a local perturbation model, it proves that interference arises primarily from a second-order damage term that concentrates in this low-dimensional subspace rather than from global gradient conflicts or full forgetting. A short refresh on the original domain contracts the harmful component on that subspace, allowing selective recovery with limited effects on other domains. This account explains observed interference patterns in multi-domain LLM post-training and points to targeted mitigation strategies.

Core claim

Under the local perturbation model of multi-domain RL, later-domain training harms an earlier domain mainly through a second-order damage term that, given the observed sparse route structure, concentrates in a low-dimensional shared conflict subspace. A brief domain refresh contracts this harmful component, enabling selective recovery, while a training-free rollback on a sparse proxy conflict coordinate set provides direct evidence of localized damage.

What carries the argument

the second-order damage term within the local perturbation model, which concentrates interference in the low-dimensional shared conflict subspace under sparse active computation routes.

If this is right

A short refresh on an earlier domain recovers performance by contracting the second-order damage term with limited collateral effects on other domains.
Training-free rollback on a sparse proxy set of conflict coordinates can partially restore earlier-domain performance without further training.
Interference persists even when full-model gradients are nearly orthogonal because shared routes determine synergy or conflict.
The sparse, small-magnitude edits from single-domain RL create weak top-neuron overlap yet still produce route-level conflicts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Monitoring activity along the identified conflict subspace during training could enable early detection and targeted interventions before full degradation occurs.
The localized mechanism may extend to continual learning settings beyond RL where parameter updates remain sparse and route-overlapping.
Proxy-level rollback on conflict coordinates suggests that low-rank approximations of the subspace could support efficient recovery methods at scale.

Load-bearing premise

The local perturbation model remains a valid approximation of the dominant parameter-update dynamics inside the observed sparse route structure.

What would settle it

An experiment measuring the second-order damage term after sequential domain training and finding that it does not concentrate in the predicted low-dimensional shared conflict subspace, or that a short refresh fails to contract the harmful component while preserving other domains.

Figures

Figures reproduced from arXiv: 2606.02398 by Deyi Xiong, Lei Yang, Siyu Ding.

**Figure 2.** Figure 2: Parameter-change distributions of the four single-domain experts relative to the base model. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Neuron-overlap rates under different settings. Since domain RL updates are sparse, a natural explanation for strong interference is direct co-editing: different domains may concentrate their large updates on the same set of functional units. We test this explanation by lifting the analysis from parameters to MLP neurons and measuring the overlap among the most strongly changed neurons across domain experts… view at source ↗

**Figure 4.** Figure 4: Layer-wise average directional cosine on shared top-changed neurons across domain pairs. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Layer-wise heatmaps of pairwise gradient cosine in attention and MLP modules. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Module-level conflict and synergy trends across the most prominent attention and MLP [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Cumulative parameter-change distributions along the sequential domain RL chain relative [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Incremental parameter-change distributions at each stage of the sequential domain RL [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Validation dynamics during Re-Math refresh from [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Math and Code performance changes during Re-Code on Matho [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Validation dynamics for the reverse QA → Math ordering. This provides further evidence that crossdomain interference is directional rather than symmetric. This asymmetry is also consistent with our local perturbation analysis and Proposition 1: the damage to an earlier domain depends on whether the later update enters its curvature-sensitive shared directions. Here, the Math update appears to perturb Q… view at source ↗

**Figure 12.** Figure 12: Layer-wise neuron selection analysis. Top: normalized layer scores; middle: budget [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Math recovery under different intervention budgets. To keep the comparison with the MLP-only experiments fair, we define the total budget in the same units as before: B = β × Nmlp, (94) where Nmlp = 36 × 9728 = 350,208. We first compute a joint layer score by combining the top-ρ scores from MLP and attention units: sℓ = S¯ρ mlp(ℓ) dint + S¯ρ attn(ℓ) nattn dint + nattn , (95) with ρ = 0.1, dint = 9728, a… view at source ↗

read the original abstract

Reinforcement learning (RL) post-training improves large language models (LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades performance on others. Existing explanations based on catastrophic forgetting or global gradient conflict are incomplete: substantial interference can occur even when full-model gradients are nearly orthogonal. We show that single-domain RL produces sparse, small-magnitude parameter edits with weak overlap among top-changed neurons, while different domains still share substantial active computation routes on which update directions determine whether they act synergistically or conflict. Guided by this observation, we prove under a local perturbation model of multi-domain RL that later-domain training harms an earlier domain mainly through a second-order damage term, which under the observed sparse route structure concentrates in a low-dimensional shared conflict subspace. Moreover, a short domain refresh contracts the harmful component on this subspace, enabling selective recovery with limited collateral damage. Consistent with the theory, a brief Re-Math refresh after Code $\rightarrow$ Math $\rightarrow$ QA $\rightarrow$ CW recovers Math from 57.66 to 66.04 while largely preserving performance on the other domains, yielding the best average score of 66.39. Beyond refresh, a training-free rollback on a sparse proxy conflict coordinate set for the Math-QA pair partially restores Math, providing direct proxy-level evidence for localized damage. These results provide a localized mechanistic account of interference and recovery in multi-domain RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a second-order damage term under a local perturbation model that concentrates interference in a low-dimensional shared subspace for multi-domain RL, and shows recovery via domain refresh plus proxy rollback.

read the letter

The core contribution is the local perturbation derivation that pins most cross-domain harm on a second-order term, which the sparse route structure funnels into a low-dimensional conflict subspace. They start from the observation that single-domain RL edits are sparse and small with limited top-neuron overlap, then use that to explain why interference persists even with nearly orthogonal full gradients.

What works is the link from that observation to the model choice and then to the recovery experiments. A brief Math refresh after the Code-Math-QA-CW sequence lifts Math from 57.66 to 66.04 while keeping the other domains mostly intact and posting the highest average of 66.39. The training-free proxy rollback on the Math-QA conflict coordinates gives direct evidence that the damage is localized rather than global.

The soft spot is that the local perturbation model is introduced as an analytical device guided by the sparse-edit finding, so the subspace concentration is partly built into the assumptions rather than emerging as a fully independent prediction. The abstract claims a proof, but without the full steps, variance numbers, or stronger baseline comparisons it is difficult to gauge how tightly the math supports the practical claims.

This is aimed at researchers who train LLMs across multiple RL domains and want a mechanistic handle on interference beyond standard forgetting stories. Readers who care about subspace-level accounts or lightweight recovery methods will find usable ideas. The combination of a stated model, derivation, and matching recovery results is solid enough to warrant referee time even if the model applicability needs more scrutiny.

I would send it for review.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a local perturbation theory for cross-domain interference in multi-domain RL post-training of LLMs. Guided by empirical observations of sparse, small-magnitude parameter edits with weak neuron overlap but shared active computation routes, it proves under an explicit local perturbation model that later-domain training primarily harms earlier domains via a second-order damage term. This term concentrates in a low-dimensional shared conflict subspace due to the sparse route structure. The theory further predicts that a brief domain refresh contracts the harmful component for selective recovery with limited collateral effects. Experiments on Math, Code, QA, and CW domains show a Re-Math refresh recovering Math from 57.66 to 66.04 while preserving other domains (best average 66.39), with additional support from a training-free rollback on a sparse proxy conflict coordinate set.

Significance. If the local perturbation model is a valid approximation, the work supplies a mechanistic, localized account of interference that addresses gaps in catastrophic forgetting and global gradient conflict explanations. The derivation of the second-order term and its concentration in a low-dimensional subspace, combined with matching experimental recovery via refresh and rollback, offers both theoretical insight and practical recovery techniques for multi-domain LLM post-training. The explicit modeling choice tested against observed sparse edits and direct proxy-level evidence are strengths that could inform more targeted training strategies.

major comments (2)

[Theory section] Theory section (local perturbation model and second-order derivation): The model is introduced as guided by the sparse-route observation to produce the claimed second-order term; the manuscript should add an explicit statement of the perturbation-size regime (e.g., relative to gradient norms or update magnitudes) under which the second-order approximation dominates and the subspace concentration holds, together with a boundary-case check against full-gradient computations.
[Experimental results] Experimental results (recovery scores and validation): The reported improvements (Math 57.66 → 66.04, average 66.39) and rollback results are presented without error bars, number of runs, or comparisons to standard baselines such as experience replay or gradient-projection methods; this weakens the claim that the refresh achieves selective recovery consistent with the theory.

minor comments (2)

[Theory section] Notation for the shared conflict subspace and active routes should be formally defined with symbols and dimensions when first introduced in the theory section to improve readability.
[Abstract and results] The abstract and results section would benefit from a brief statement of how the sparse proxy coordinate set for rollback is constructed, including its dimensionality relative to the full parameter space.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments help clarify the conditions for the local perturbation model and strengthen the experimental validation. We respond to each major comment below.

read point-by-point responses

Referee: [Theory section] Theory section (local perturbation model and second-order derivation): The model is introduced as guided by the sparse-route observation to produce the claimed second-order term; the manuscript should add an explicit statement of the perturbation-size regime (e.g., relative to gradient norms or update magnitudes) under which the second-order approximation dominates and the subspace concentration holds, together with a boundary-case check against full-gradient computations.

Authors: We agree that an explicit statement of the perturbation-size regime is necessary. In the revised manuscript, we will add a paragraph in the Theory section defining the regime: the second-order approximation holds when update magnitudes are small relative to gradient norms (consistent with the observed sparse, small-magnitude edits), ensuring the second-order damage term dominates and concentrates in the low-dimensional shared conflict subspace. We will also include a boundary-case check comparing the local approximation against full-gradient computations on a subset of domains to delineate the validity range. revision: yes
Referee: [Experimental results] Experimental results (recovery scores and validation): The reported improvements (Math 57.66 → 66.04, average 66.39) and rollback results are presented without error bars, number of runs, or comparisons to standard baselines such as experience replay or gradient-projection methods; this weakens the claim that the refresh achieves selective recovery consistent with the theory.

Authors: We acknowledge that error bars and the number of runs were omitted. The reported scores were obtained over 3 random seeds; we will revise the Experimental Results section to report means and standard deviations. However, direct comparisons to experience replay or gradient-projection methods fall outside the paper's primary aim of testing consistency with the local perturbation predictions (subspace contraction and selective recovery). Such baselines would require substantial additional experiments, which we will flag as future work while retaining the rollback as direct proxy-level support for the theory. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under explicit model

full rationale

The paper states an empirical observation of sparse edits, then explicitly introduces a local perturbation model as an analytical device to derive the second-order damage term and subspace concentration. The derivation proceeds from the model's assumptions to the claimed results, with direct experimental tests of recovery via refresh and rollback. No load-bearing step reduces by construction to a self-definition, a fitted input renamed as prediction, or a self-citation chain; the model applicability is presented as a testable approximation rather than hidden. This is the normal case of a modeling paper whose central claim remains independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the local perturbation model and the empirical observation of sparse parameter edits; both are domain assumptions introduced to support the derivation and are not shown to be independently verified outside the paper's own experiments.

axioms (2)

domain assumption The local perturbation model accurately captures the dynamics of multi-domain RL parameter updates.
The proof of the second-order damage term is conducted under this model as stated in the abstract.
domain assumption Single-domain RL produces sparse, small-magnitude parameter edits with weak overlap among top-changed neurons while domains share active computation routes.
This observation is invoked to justify that damage concentrates in a low-dimensional shared conflict subspace.

invented entities (1)

low-dimensional shared conflict subspace no independent evidence
purpose: To localize the second-order damage term responsible for cross-domain interference.
The subspace is postulated within the local perturbation model to explain why refresh can selectively recover performance.

pith-pipeline@v0.9.1-grok · 5801 in / 1610 out tokens · 43924 ms · 2026-06-28T15:35:39.937761+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 9 linked inside Pith

[1]

Aime problems and solutions

Art of Problem Solving. Aime problems and solutions. https://artofproblemsolving.com/ wiki/index.php/AIME_Problems_and_Solutions, 2024a. Accessed: 2025-12-18

2025
[2]

A. Bau, Y . Belinkov, H. Sajjad, N. Durrani, F. Dalvi, and J. R. Glass. Identifying and controlling important neurons in neural machine translation. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019

2019
[3]

H. Chen, N. Razin, K. Narasimhan, and D. Chen. Retaining by doing: The role of on-policy data in mitigating forgetting.CoRR, abs/2510.18874, 2025

arXiv 2025
[4]

Z. Chen, V . Badrinarayanan, C. Lee, and A. Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In J. G. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stock- holmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 ofProceedings of Machine Lea...

2018
[5]

P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V . N. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 20...

2017
[6]

T. Chu, Y . Zhai, J. Yang, S. Tong, S. Xie, D. Schuurmans, Q. V . Le, S. Levine, and Y . Ma. SFT memorizes, RL generalizes: A comparative study of foundation model post-training.CoRR, abs/2501.17161, 2025

Pith/arXiv arXiv 2025
[7]

D. Dai, L. Dong, Y . Hao, Z. Sui, B. Chang, and F. Wei. Knowledge neurons in pretrained transformers. In S. Muresan, P. Nakov, and A. Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 8493–8502. Association for Computatio...

2022
[8]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning, 2025

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning, 2025. 10

2025
[9]

Dekoninck, N

J. Dekoninck, N. Jovanovi´c, T. Gehrunger, K. Rögnvalddson, I. Petrov, C. Sun, and M. Vechev. Beyond benchmarks: Matharena as an evaluation platform for mathematics with llms.CoRR, abs/2605.00674, 2026

Pith/arXiv arXiv 2026
[10]

C. He, R. Luo, Y . Bai, S. Hu, Z. L. Thai, J. Shen, J. Hu, X. Han, Y . Huang, Y . Zhang, J. Liu, L. Qi, Z. Liu, and M. Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In L. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Comp...

2024
[11]

M. Huan, Y . Li, T. Zheng, X. Xu, S. Kim, M. Du, R. Poovendran, G. Neubig, and X. Yue. Does math reasoning improve general LLM capabilities? understanding transferability of LLM reasoning.CoRR, abs/2507.00432, 2025

Pith/arXiv arXiv 2025
[12]

Open r1: A fully open reproduction of deepseek-r1, January 2025

Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025

2025
[13]

N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025
[14]

Kirkpatrick, R

J. Kirkpatrick, R. Pascanu, N. C. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell. Overcoming catastrophic forgetting in neural networks.CoRR, abs/1612.00796, 2016

arXiv 2016
[15]

S. Lai, H. Zhao, R. Feng, C. Ma, W. Liu, H. Zhao, X. Lin, D. Yi, M. Xie, Q. Zhang, H. Liu, G. Meng, and F. Zhu. Reinforcement fine-tuning naturally mitigates forgetting in continual post-training.CoRR, abs/2507.05386, 2025

arXiv 2025
[16]

Leng and D

Y . Leng and D. Xiong. Towards understanding multi-task learning (generalization) of LLMs via detecting and exploring task-specific neurons. In O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, and S. Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, UAE, January 19-...

2025
[17]

D. Li, J. Zhou, A. Kazemi, Q. Sun, A. Ghaddar, M. A. Alomrani, L. Ma, Y . Luo, D. Li, F. Wen, J. Hao, M. Coates, and Y . Zhang. Omni-thinker: Scaling cross-domain generalization in llms via multi-task RL with hybrid rewards.CoRR, abs/2507.14783, 2025

arXiv 2025
[18]

Liang, L

X. Liang, L. Yang, J. Wang, R. Liu, Y . Lu, J. Zeng, H. Chen, D. Li, and J. Hao. Boosting multi-domain reasoning of LLMs via curvature-guided policy optimization. InInternational Conference on Learning Representations, 2026

2026
[19]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024
[20]

B. Liu, X. Liu, X. Jin, P. Stone, and Q. Liu. Conflict-averse gradient descent for multi-task learning. In M. Ranzato, A. Beygelzimer, Y . N. Dauphin, P. Liang, and J. W. Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages ...

2021
[21]

Matena and C

M. Matena and C. Raffel. Merging models with fisher-weighted averaging. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022

2022
[22]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017. 11

Pith/arXiv arXiv 2017
[23]

Sener and V

O. Sener and V . Koltun. Multi-task learning as multi-objective optimization. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 525–536, 2018

2018
[24]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024

Pith/arXiv arXiv 2024
[25]

Shenfeld, J

I. Shenfeld, J. Pari, and P. Agrawal. RL’s razor: Why online reinforcement learning forgets less. CoRR, abs/2509.04259, 2025

Pith/arXiv arXiv 2025
[26]

Sheng, C

G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297. ACM, 2025

2025
[27]

D. Shi, Z. Han, S. Ostermann, R. Jin, J. van Genabith, and D. Xiong. Why does reinforcement learning generalize? a feature-level mechanistic study of post-training in large language models. InProceedings of the 64th Annual Meeting of the Association for Computational Linguistics, 2026

2026
[28]

H. Shi, Z. Xu, H. Wang, W. Qin, W. Wang, Y . Wang, and H. Wang. Continual learning of large language models: A comprehensive survey.CoRR, abs/2404.16789, 2024

arXiv 2024
[29]

Stiennon, L

N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. F. Christiano. Learning to summarize with human feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Sys- tems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 20...

2020
[30]

Z. Su, L. Pan, M. Lv, Y . Li, W. Hu, F. Zhang, K. Gai, and G. Zhou. Ce-gppo: Controlling entropy via gradient-preserving clipping policy optimization in reinforcement learning, 2025

2025
[31]

K. Team. Kimi k1.5: Scaling reinforcement learning with llms.CoRR, abs/2501.12599, 2025

Pith/arXiv arXiv 2025
[32]

M. Team. Supergpqa: Scaling LLM evaluation across 285 graduate disciplines.CoRR, abs/2502.14739, 2025

Pith/arXiv arXiv 2025
[33]

Q. Team. Qwen3 technical report.CoRR, abs/2505.09388, 2025

Pith/arXiv arXiv 2025
[34]

X. Wang, K. Wen, Z. Zhang, L. Hou, Z. Liu, and J. Li. Finding skill neurons in pre-trained transformer-based language models. In Y . Goldberg, Z. Kozareva, and Y . Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 11132–11152. Asso...

2022
[35]

Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural ...

2024
[36]

Wortsman, G

M. Wortsman, G. Ilharco, S. Y . Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y . Carmon, S. Kornblith, and L. Schmidt. Model soups: aver- aging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational Conference on Machine Learning, ICML 2022, 17-23 July 2022, Bal- timore, Ma...

2022
[37]

Y . Wu, J. Mei, M. Yan, C. Li, S. Lai, Y . Ren, Z. Wang, J. Zhang, M. Wu, Q. Jin, and F. Huang. Writingbench: A comprehensive benchmark for generative writing.CoRR, abs/2503.05244, 2025

arXiv 2025
[38]

Yadav, D

P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal. Ties-merging: Resolving interfer- ence when merging models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Confer- ence on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, D...

2023
[39]

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn. Gradient surgery for multi-task learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020

2020
[40]

forgetting A and then forgetting B

J. Zheng, X. Cai, S. Qiu, and Q. Ma. Spurious forgetting in continual learning of language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 13 Table 4: Training hyperparameters. train batch size ppo mini batch size max prompt length max response length adv estimat...

2025

[1] [1]

Aime problems and solutions

Art of Problem Solving. Aime problems and solutions. https://artofproblemsolving.com/ wiki/index.php/AIME_Problems_and_Solutions, 2024a. Accessed: 2025-12-18

2025

[2] [2]

A. Bau, Y . Belinkov, H. Sajjad, N. Durrani, F. Dalvi, and J. R. Glass. Identifying and controlling important neurons in neural machine translation. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019

2019

[3] [3]

H. Chen, N. Razin, K. Narasimhan, and D. Chen. Retaining by doing: The role of on-policy data in mitigating forgetting.CoRR, abs/2510.18874, 2025

arXiv 2025

[4] [4]

Z. Chen, V . Badrinarayanan, C. Lee, and A. Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In J. G. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stock- holmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 ofProceedings of Machine Lea...

2018

[5] [5]

P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V . N. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 20...

2017

[6] [6]

T. Chu, Y . Zhai, J. Yang, S. Tong, S. Xie, D. Schuurmans, Q. V . Le, S. Levine, and Y . Ma. SFT memorizes, RL generalizes: A comparative study of foundation model post-training.CoRR, abs/2501.17161, 2025

Pith/arXiv arXiv 2025

[7] [7]

D. Dai, L. Dong, Y . Hao, Z. Sui, B. Chang, and F. Wei. Knowledge neurons in pretrained transformers. In S. Muresan, P. Nakov, and A. Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 8493–8502. Association for Computatio...

2022

[8] [8]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning, 2025

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning, 2025. 10

2025

[9] [9]

Dekoninck, N

J. Dekoninck, N. Jovanovi´c, T. Gehrunger, K. Rögnvalddson, I. Petrov, C. Sun, and M. Vechev. Beyond benchmarks: Matharena as an evaluation platform for mathematics with llms.CoRR, abs/2605.00674, 2026

Pith/arXiv arXiv 2026

[10] [10]

C. He, R. Luo, Y . Bai, S. Hu, Z. L. Thai, J. Shen, J. Hu, X. Han, Y . Huang, Y . Zhang, J. Liu, L. Qi, Z. Liu, and M. Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In L. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Comp...

2024

[11] [11]

M. Huan, Y . Li, T. Zheng, X. Xu, S. Kim, M. Du, R. Poovendran, G. Neubig, and X. Yue. Does math reasoning improve general LLM capabilities? understanding transferability of LLM reasoning.CoRR, abs/2507.00432, 2025

Pith/arXiv arXiv 2025

[12] [12]

Open r1: A fully open reproduction of deepseek-r1, January 2025

Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025

2025

[13] [13]

N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025

[14] [14]

Kirkpatrick, R

J. Kirkpatrick, R. Pascanu, N. C. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell. Overcoming catastrophic forgetting in neural networks.CoRR, abs/1612.00796, 2016

arXiv 2016

[15] [15]

S. Lai, H. Zhao, R. Feng, C. Ma, W. Liu, H. Zhao, X. Lin, D. Yi, M. Xie, Q. Zhang, H. Liu, G. Meng, and F. Zhu. Reinforcement fine-tuning naturally mitigates forgetting in continual post-training.CoRR, abs/2507.05386, 2025

arXiv 2025

[16] [16]

Leng and D

Y . Leng and D. Xiong. Towards understanding multi-task learning (generalization) of LLMs via detecting and exploring task-specific neurons. In O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, and S. Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, UAE, January 19-...

2025

[17] [17]

D. Li, J. Zhou, A. Kazemi, Q. Sun, A. Ghaddar, M. A. Alomrani, L. Ma, Y . Luo, D. Li, F. Wen, J. Hao, M. Coates, and Y . Zhang. Omni-thinker: Scaling cross-domain generalization in llms via multi-task RL with hybrid rewards.CoRR, abs/2507.14783, 2025

arXiv 2025

[18] [18]

Liang, L

X. Liang, L. Yang, J. Wang, R. Liu, Y . Lu, J. Zeng, H. Chen, D. Li, and J. Hao. Boosting multi-domain reasoning of LLMs via curvature-guided policy optimization. InInternational Conference on Learning Representations, 2026

2026

[19] [19]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024

[20] [20]

B. Liu, X. Liu, X. Jin, P. Stone, and Q. Liu. Conflict-averse gradient descent for multi-task learning. In M. Ranzato, A. Beygelzimer, Y . N. Dauphin, P. Liang, and J. W. Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages ...

2021

[21] [21]

Matena and C

M. Matena and C. Raffel. Merging models with fisher-weighted averaging. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022

2022

[22] [22]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017. 11

Pith/arXiv arXiv 2017

[23] [23]

Sener and V

O. Sener and V . Koltun. Multi-task learning as multi-objective optimization. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 525–536, 2018

2018

[24] [24]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024

Pith/arXiv arXiv 2024

[25] [25]

Shenfeld, J

I. Shenfeld, J. Pari, and P. Agrawal. RL’s razor: Why online reinforcement learning forgets less. CoRR, abs/2509.04259, 2025

Pith/arXiv arXiv 2025

[26] [26]

Sheng, C

G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297. ACM, 2025

2025

[27] [27]

D. Shi, Z. Han, S. Ostermann, R. Jin, J. van Genabith, and D. Xiong. Why does reinforcement learning generalize? a feature-level mechanistic study of post-training in large language models. InProceedings of the 64th Annual Meeting of the Association for Computational Linguistics, 2026

2026

[28] [28]

H. Shi, Z. Xu, H. Wang, W. Qin, W. Wang, Y . Wang, and H. Wang. Continual learning of large language models: A comprehensive survey.CoRR, abs/2404.16789, 2024

arXiv 2024

[29] [29]

Stiennon, L

N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. F. Christiano. Learning to summarize with human feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Sys- tems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 20...

2020

[30] [30]

Z. Su, L. Pan, M. Lv, Y . Li, W. Hu, F. Zhang, K. Gai, and G. Zhou. Ce-gppo: Controlling entropy via gradient-preserving clipping policy optimization in reinforcement learning, 2025

2025

[31] [31]

K. Team. Kimi k1.5: Scaling reinforcement learning with llms.CoRR, abs/2501.12599, 2025

Pith/arXiv arXiv 2025

[32] [32]

M. Team. Supergpqa: Scaling LLM evaluation across 285 graduate disciplines.CoRR, abs/2502.14739, 2025

Pith/arXiv arXiv 2025

[33] [33]

Q. Team. Qwen3 technical report.CoRR, abs/2505.09388, 2025

Pith/arXiv arXiv 2025

[34] [34]

X. Wang, K. Wen, Z. Zhang, L. Hou, Z. Liu, and J. Li. Finding skill neurons in pre-trained transformer-based language models. In Y . Goldberg, Z. Kozareva, and Y . Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 11132–11152. Asso...

2022

[35] [35]

Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural ...

2024

[36] [36]

Wortsman, G

M. Wortsman, G. Ilharco, S. Y . Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y . Carmon, S. Kornblith, and L. Schmidt. Model soups: aver- aging weights of multiple fine-tuned models improves accuracy without increasing inference time. InInternational Conference on Machine Learning, ICML 2022, 17-23 July 2022, Bal- timore, Ma...

2022

[37] [37]

Y . Wu, J. Mei, M. Yan, C. Li, S. Lai, Y . Ren, Z. Wang, J. Zhang, M. Wu, Q. Jin, and F. Huang. Writingbench: A comprehensive benchmark for generative writing.CoRR, abs/2503.05244, 2025

arXiv 2025

[38] [38]

Yadav, D

P. Yadav, D. Tam, L. Choshen, C. A. Raffel, and M. Bansal. Ties-merging: Resolving interfer- ence when merging models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Confer- ence on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, D...

2023

[39] [39]

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn. Gradient surgery for multi-task learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020

2020

[40] [40]

forgetting A and then forgetting B

J. Zheng, X. Cai, S. Qiu, and Q. Ma. Spurious forgetting in continual learning of language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 13 Table 4: Training hyperparameters. train batch size ppo mini batch size max prompt length max response length adv estimat...

2025