Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

Chenyu Zhou; Qiliang Jiang; Shuning Wu; Xu Zhou

arxiv: 2606.19222 · v1 · pith:RDBBLWFYnew · submitted 2026-06-17 · 💻 cs.LG · cs.AI

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

Chenyu Zhou , Qiliang Jiang , Shuning Wu , Xu Zhou This is my paper

Pith reviewed 2026-06-26 21:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords unlearningselective targetingRLVRattention projectionsmath reasoninglanguage modelsgradient updatesparameter ranking

0 comments

The pith

MAST ranks attention-projection tensors to forget RLVR-induced math reasoning while leaving retain sets intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MAST to unlearn specific reasoning capabilities induced by RLVR training in language models. Full-parameter gradient ascent on the forget objective damages performance on both target and related retain tasks such as GSM8K and other MATH items. MAST instead ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling magnitude, then updates only the top-ranked subset. On Qwen2.5-Math-1.5B this produces statistically significant forgetting on the MATH forget set while GSM8K rises slightly and MATH retain drops only 0.5 points. The pattern holds across seeds, alternative objectives, and a second model family.

Core claim

In matched SFT-to-RLVR checkpoints the token-level delta-log-probability increment differs sharply from the SFT update. Full-parameter ascent forgets only by harming retain sets. MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling magnitude, updates only the top subset, and thereby induces statistically significant target forgetting (MATH forget 45/150 to 37/150, McNemar p=0.0078) while preserving GSM8K (+0.8 pp) and MATH retain (-0.5 pp). The advantage reproduces across seeds, NPO/SimNPO objectives, and Qwen3.

What carries the argument

MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling magnitude to select a subset for update that isolates RLVR-induced changes.

If this is right

Target forgetting reaches statistical significance on the MATH forget set (45/150 to 37/150, McNemar p=0.0078).
GSM8K accuracy is preserved or slightly improved (+0.8 pp).
MATH retain set changes by only -0.5 pp.
The selective advantage reproduces across random seeds, NPO/SimNPO objectives, and the Qwen3 model family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-metric ranking could be tested on non-math unlearning tasks such as removing specific factual associations or unsafe response patterns.
If the off-principal energy and coupling metrics locate training-phase effects in other architectures, the approach might reduce the cost of capability editing more broadly.
The observed difference between SFT and RLVR increments in token-level delta-log-probability suggests that distinct training stages may affect largely separate parameter subsets.

Load-bearing premise

The ranking of attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling magnitude identifies parameters whose selective update produces the desired forgetting without collateral damage to retain sets.

What would settle it

Updating the top-ranked tensors under MAST yet observing either no statistically significant drop on the MATH forget set or a large drop on the GSM8K or MATH retain sets would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.19222 by Chenyu Zhou, Qiliang Jiang, Shuning Wu, Xu Zhou.

**Figure 1.** Figure 1: Mechanism separation on both model families. (a) The SFT update raises token log-probability (large positive mean ∆log-probability), while the SFT→RLVR increment is near zero. (b) The SFT update is directionally one-sided (positive-∆ fraction ≫ 0.5), while the RLVR increment is balanced (near 0.5). The pattern reproduces on Qwen3-1.7B-Base at smaller magnitude. Forget-gradient versus checkpoint-delta dire… view at source ↗

**Figure 2.** Figure 2: Forget–collateral Pareto frontier on both model families (steps-matched GA strength sweep, 50 steps). Each point is one unlearning strength; the x-axis is MATHforget reduction and the y-axis is collateral (GSM8K, top row; MATH-retain, bottom row; lower is better). Negative x-values mark strengths that increase target accuracy rather than forget. Wherever full-parameter GA forgets meaningfully, its collate… view at source ↗

**Figure 3.** Figure 3: Ranking specificity. (a) R1: the top-96 selection is structurally coherent by construction, keeping all query and output projections and dropping only key/value tensors. (b) R2: the bottom-ranked 64 tensors produce negligible forgetting at the selected strength, while the ranked top-96 forgets. (c) R3: at matched forgetting, the ranked selector retains significantly more GSM8K than a same-size random subse… view at source ↗

**Figure 4.** Figure 4: Why forget-accuracy alone is misleading. Both the restricted and the full update lower the solution-trajectory log-probability while raising the short final-answer logprobability, relative to the RLVR source. Generated-correctness and answer-stringlikelihood views therefore disagree. 8 Cross-Model Validation (Qwen3-1.7B-Base) To test whether the findings are specific to one model family, we rerun the ent… view at source ↗

read the original abstract

We propose MAST (Mechanism-Aligned Selective Targeting), a mechanism-guided method for unlearning RLVR-induced reasoning with substantially lower collateral damage than standard full-parameter updates. In matched SFT/RLVR checkpoints on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, the SFT-to-RLVR increment differs sharply from the SFT update in token-level delta-log-probability, and full-parameter gradient ascent forgets only by damaging retain MATH and GSM8K. MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling magnitude, then updates only the top-ranked subset. On the primary model, MAST induces statistically significant target forgetting (MATH forget 45/150 to 37/150; McNemar p=0.0078) while preserving GSM8K (+0.8 pp) and MATH retain (-0.5 pp). The advantage reproduces across seeds, NPO/SimNPO objectives, and Qwen3, where MAST preserves GSM8K while full-parameter unlearning collapses it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAST gives a workable selective unlearning trick on math reasoning models that preserves retain performance better than full updates, but the three-metric ranking lacks controls to show it actually isolates the target mechanism.

read the letter

MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling, then updates only the top slice to forget RLVR-induced MATH reasoning. On Qwen2.5-Math-1.5B it drops forget-set accuracy from 45/150 to 37/150 (McNemar p=0.0078) while GSM8K rises 0.8 pp and MATH retain drops only 0.5 pp; the pattern repeats on Qwen3 and with NPO/SimNPO.

The concrete numbers and cross-seed, cross-objective, cross-model checks are the real contribution. They show that a small, ranked subset can produce the desired forgetting without the collateral damage seen in full-parameter ascent. That is a usable engineering result for anyone who needs to edit specific behaviors in already-trained reasoning models.

The soft spot is the missing controls. The abstract reports no random-subset or single-metric ablations, so we cannot tell whether the three-criterion ranking is doing the work or whether any comparably small update set would have produced the same preservation. If the latter, the “mechanism-aligned” framing rests on an untested assumption. The paper also stays inside two math benchmarks and two model families; broader tasks would be needed before claiming general utility.

This is for labs doing unlearning or safety edits on LLMs. It has enough reproducible experimental detail to go to referees, even though the mechanism claim needs tighter tests.

Referee Report

1 major / 1 minor

Summary. The paper proposes MAST (Mechanism-Aligned Selective Targeting), a selective unlearning method for RLVR-induced reasoning. It identifies attention-projection tensors via a three-metric ranking (off-principal energy, update magnitude, forget-gradient coupling magnitude) and performs targeted updates on only the top-ranked subset. On Qwen2.5-Math-1.5B and Qwen3 models, MAST produces statistically significant forgetting on a MATH forget set (45/150 to 37/150, McNemar p=0.0078) while preserving GSM8K (+0.8 pp) and MATH retain (-0.5 pp), outperforming full-parameter gradient ascent which damages retain sets; the advantage holds across seeds, NPO/SimNPO objectives, and model variants.

Significance. If the central result holds, the work demonstrates a practical route to mechanism-guided unlearning that reduces collateral damage relative to full-parameter methods, addressing a key limitation in safety applications of RL-trained models. The reproduction across seeds, objectives, and two model families provides some robustness evidence. No parameter-free derivations or machine-checked proofs are present, but the quantitative evaluation with explicit p-values and cross-setting consistency is a strength.

major comments (1)

[Experiments] Experiments section (and associated tables/figures): the manuscript reports no ablation comparing the three-metric ranking against random subset selection of the same size or against single-metric rankings. This omission leaves open whether the observed preservation of GSM8K and MATH retain is attributable to mechanism-specific alignment or merely to the reduced number of updated parameters; the claim that the ranking isolates RLVR-induced reasoning mechanisms is therefore untested and load-bearing for the central contribution.

minor comments (1)

[Abstract] The abstract states the McNemar test result but does not specify the exact contingency table construction or correction for multiple comparisons; adding these details in the main text would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for ablations on the ranking procedure. We agree this strengthens the central claim and will incorporate the requested comparisons.

read point-by-point responses

Referee: [Experiments] Experiments section (and associated tables/figures): the manuscript reports no ablation comparing the three-metric ranking against random subset selection of the same size or against single-metric rankings. This omission leaves open whether the observed preservation of GSM8K and MATH retain is attributable to mechanism-specific alignment or merely to the reduced number of updated parameters; the claim that the ranking isolates RLVR-induced reasoning mechanisms is therefore untested and load-bearing for the central contribution.

Authors: We agree that the absence of these ablations leaves the mechanism-specific contribution of the three-metric ranking untested relative to parameter count alone. The manuscript motivates the metrics from observed differences in token-level delta-log-probability between SFT and RLVR updates and from gradient-coupling analysis, but does not empirically isolate their joint effect. In revision we will add (i) random subset selection of identical cardinality and (ii) single-metric rankings, each evaluated on the same forget/retain splits, seeds, and objectives. These results will be reported in an expanded Experiments section with the same McNemar and accuracy metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines MAST explicitly as a ranking procedure over attention-projection tensors using three stated metrics (off-principal energy, update magnitude, forget-gradient coupling magnitude), then applies selective updates and reports empirical outcomes on held-out benchmarks (MATH forget set, GSM8K, MATH retain). No equations, fitted parameters, or self-citations are shown to reduce the reported forgetting/preservation statistics to the ranking inputs by construction; the advantage over full-parameter baselines is presented as an observed experimental result rather than a definitional identity. The derivation chain therefore remains self-contained against external evaluation data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract, no explicit free parameters, axioms, or invented entities are introduced; the method relies on empirical ranking without detailed mathematical foundations provided.

pith-pipeline@v0.9.1-grok · 5731 in / 1263 out tokens · 34113 ms · 2026-06-26T21:12:36.738344+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 6 linked inside Pith

[1]

arXiv preprint arXiv:2410.07163 (2024)

Fan, C., Liu, J., Lin, L., Jia, J., Zhang, R., Mei, S., Liu, S.: Simplicity prevails: Rethinking negative preference optimization for LLM unlearning. arXiv preprint arXiv:2410.07163 (2024)

arXiv 2024
[2]

knowledge edit- ing in language models

Hase, P., Bansal, M., Kim, B., Ghandeharioun, A.: Does localization inform editing? surprising differences in causality-based localization vs. knowledge edit- ing in language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

2023
[3]

In: International Con- ference on Learning Representations (ICLR) (2021)

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring massive multitask language understanding. In: International Con- ference on Learning Representations (ICLR) (2021)

2021
[4]

arXiv preprint arXiv:2603.22117 (2026)

Huang, K., Meng, H., Wu, J., et al.: On the direction of RLVR updates for LLM reasoning: Identification and exploitation. arXiv preprint arXiv:2603.22117 (2026)

arXiv 2026
[5]

arXiv preprint arXiv:2603.04426 (2026)

Kassem, A., Jiralerspong, T., Rostamzadeh, N., Farnadi, G.: Delta-Crosscoder: Robust crosscoder model diffing in narrow fine-tuning regimes. arXiv preprint arXiv:2603.04426 (2026)

arXiv 2026
[6]

arXiv preprint arXiv:2507.21084 (2025)

Kassem, A.M., Shi, Z., Rostamzadeh, N., Farnadi, G.: Reviving your MNEME: Predicting the side effects of LLM unlearning and fine-tuning via sparse model diffing. arXiv preprint arXiv:2507.21084 (2025)

arXiv 2025
[7]

arXiv preprint arXiv:2403.03218 (2024)

Li, N., Pan, A., Gopal, A., et al.: The WMDP benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218 (2024)

Pith/arXiv arXiv 2024
[8]

arXiv preprint arXiv:2401.06121 (2024)

Maini, P., Feng, Z., Schwarzschild, A., Lipton, Z.C., Kolter, J.Z.: TOFU: A task of fictitious unlearning for LLMs. arXiv preprint arXiv:2401.06121 (2024)

Pith/arXiv arXiv 2024
[9]

In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

Meng, K., Bau, D., Andonian, A., Belinkov, Y.: Locating and editing factual associ- ations in GPT. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

2022
[10]

In: International Conference on Learning Representations (ICLR) (2023)

Meng, K., Sharma, A.S., Andonian, A., Belinkov, Y., Bau, D.: Mass-editing mem- ory in a transformer. In: International Conference on Learning Representations (ICLR) (2023)

2023
[11]

arXiv preprint arXiv:2402.03300 (2024)

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

Pith/arXiv arXiv 2024
[12]

arXiv preprint arXiv:2402.15159 (2024)

Yao, J., Chien, E., Du, M., et al.: Machine unlearning of pre-trained large language models. arXiv preprint arXiv:2402.15159 (2024)

arXiv 2024
[13]

Yue, Y., Chen, Z., Lu, R., et al.: Does reinforcement learning really incentivize rea- soning capacity in LLMs beyond the base model? arXiv preprint arXiv:2504.13837 (2025)

Pith/arXiv arXiv 2025
[14]

arXiv preprint arXiv:2601.09361 (2026)

Zhang, J., Shi, L., Li, J., et al.: GeoRA: Geometry-aware low-rank adaptation for RLVR. arXiv preprint arXiv:2601.09361 (2026)

Pith/arXiv arXiv 2026
[15]

arXiv preprint arXiv:2404.05868 (2024)

Zhang, R., Lin, L., Bai, Y., Mei, S.: Negative preference optimization: From catas- trophic collapse to effective unlearning. arXiv preprint arXiv:2404.05868 (2024)

Pith/arXiv arXiv 2024
[16]

arXiv preprint arXiv:2511.08567 (2025)

Zhu, H., Zhang, Z., Huang, H., et al.: The path not taken: RLVR provably learns off the principals. arXiv preprint arXiv:2511.08567 (2025)

arXiv 2025

[1] [1]

arXiv preprint arXiv:2410.07163 (2024)

Fan, C., Liu, J., Lin, L., Jia, J., Zhang, R., Mei, S., Liu, S.: Simplicity prevails: Rethinking negative preference optimization for LLM unlearning. arXiv preprint arXiv:2410.07163 (2024)

arXiv 2024

[2] [2]

knowledge edit- ing in language models

Hase, P., Bansal, M., Kim, B., Ghandeharioun, A.: Does localization inform editing? surprising differences in causality-based localization vs. knowledge edit- ing in language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

2023

[3] [3]

In: International Con- ference on Learning Representations (ICLR) (2021)

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J.: Measuring massive multitask language understanding. In: International Con- ference on Learning Representations (ICLR) (2021)

2021

[4] [4]

arXiv preprint arXiv:2603.22117 (2026)

Huang, K., Meng, H., Wu, J., et al.: On the direction of RLVR updates for LLM reasoning: Identification and exploitation. arXiv preprint arXiv:2603.22117 (2026)

arXiv 2026

[5] [5]

arXiv preprint arXiv:2603.04426 (2026)

Kassem, A., Jiralerspong, T., Rostamzadeh, N., Farnadi, G.: Delta-Crosscoder: Robust crosscoder model diffing in narrow fine-tuning regimes. arXiv preprint arXiv:2603.04426 (2026)

arXiv 2026

[6] [6]

arXiv preprint arXiv:2507.21084 (2025)

Kassem, A.M., Shi, Z., Rostamzadeh, N., Farnadi, G.: Reviving your MNEME: Predicting the side effects of LLM unlearning and fine-tuning via sparse model diffing. arXiv preprint arXiv:2507.21084 (2025)

arXiv 2025

[7] [7]

arXiv preprint arXiv:2403.03218 (2024)

Li, N., Pan, A., Gopal, A., et al.: The WMDP benchmark: Measuring and reducing malicious use with unlearning. arXiv preprint arXiv:2403.03218 (2024)

Pith/arXiv arXiv 2024

[8] [8]

arXiv preprint arXiv:2401.06121 (2024)

Maini, P., Feng, Z., Schwarzschild, A., Lipton, Z.C., Kolter, J.Z.: TOFU: A task of fictitious unlearning for LLMs. arXiv preprint arXiv:2401.06121 (2024)

Pith/arXiv arXiv 2024

[9] [9]

In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

Meng, K., Bau, D., Andonian, A., Belinkov, Y.: Locating and editing factual associ- ations in GPT. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

2022

[10] [10]

In: International Conference on Learning Representations (ICLR) (2023)

Meng, K., Sharma, A.S., Andonian, A., Belinkov, Y., Bau, D.: Mass-editing mem- ory in a transformer. In: International Conference on Learning Representations (ICLR) (2023)

2023

[11] [11]

arXiv preprint arXiv:2402.03300 (2024)

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

Pith/arXiv arXiv 2024

[12] [12]

arXiv preprint arXiv:2402.15159 (2024)

Yao, J., Chien, E., Du, M., et al.: Machine unlearning of pre-trained large language models. arXiv preprint arXiv:2402.15159 (2024)

arXiv 2024

[13] [13]

Yue, Y., Chen, Z., Lu, R., et al.: Does reinforcement learning really incentivize rea- soning capacity in LLMs beyond the base model? arXiv preprint arXiv:2504.13837 (2025)

Pith/arXiv arXiv 2025

[14] [14]

arXiv preprint arXiv:2601.09361 (2026)

Zhang, J., Shi, L., Li, J., et al.: GeoRA: Geometry-aware low-rank adaptation for RLVR. arXiv preprint arXiv:2601.09361 (2026)

Pith/arXiv arXiv 2026

[15] [15]

arXiv preprint arXiv:2404.05868 (2024)

Zhang, R., Lin, L., Bai, Y., Mei, S.: Negative preference optimization: From catas- trophic collapse to effective unlearning. arXiv preprint arXiv:2404.05868 (2024)

Pith/arXiv arXiv 2024

[16] [16]

arXiv preprint arXiv:2511.08567 (2025)

Zhu, H., Zhang, Z., Huang, H., et al.: The path not taken: RLVR provably learns off the principals. arXiv preprint arXiv:2511.08567 (2025)

arXiv 2025