pith. sign in

arxiv: 2605.16787 · v1 · pith:SGBBLAWZnew · submitted 2026-05-16 · 💻 cs.LG · cs.CL

The Unlearnability Phenomenon in RLVR for Language Models

Pith reviewed 2026-05-19 21:34 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords unlearnabilityRLVRlanguage modelsreasoninggradient similarityrepresentation issuesdata augmentationreinforcement learning
0
0 comments X

The pith

A substantial subset of hard examples remains unlearnable in RLVR even when correct rollouts are available.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines learning dynamics in reinforcement learning with verifiable reward for improving language model reasoning. It identifies that many examples the model initially gets wrong stay unlearnable despite access to correct solutions during training. Cross-example gradient analysis reveals these cases share low similarity with other examples and rely on reasoning patterns that fail to generalize. Standard fixes like better optimization, sampling, or data augmentation do not raise gradient similarity or enable learning. The work characterizes this unlearnable data and points to built-in limits of RL methods on reasoning tasks.

Core claim

Among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. These unlearnable examples have fundamental representation issues, shown by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. Existing optimization and sampling techniques fail to resolve unlearnability, and data augmentation does not improve gradient similarity, indicating inherent limitations of current RL approaches for reasoning tasks.

What carries the argument

Cross-example gradient analysis, which measures similarity between gradients of different training examples to detect representation issues in unlearnable cases.

If this is right

  • Existing optimization and sampling techniques fail to make unlearnable examples learnable.
  • Data augmentation leaves gradient similarity unchanged for unlearnable examples.
  • Unlearnable examples are distinguished by ungeneralizable reasoning patterns.
  • Current RL approaches have fundamental limitations when applied to reasoning in language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training procedures could first identify examples with low gradient similarity and handle them separately or filter them out.
  • Improving base model representations through pre-training might reduce the fraction of examples that later prove unlearnable under RL.
  • Hybrid methods combining RL with targeted representation learning could address cases that pure RL leaves behind.

Load-bearing premise

That low gradient similarity reliably signals a fundamental representation problem that reinforcement learning cannot overcome.

What would settle it

A demonstration that additional pre-training or architectural changes raise gradient similarity for the previously unlearnable examples and allow them to become learnable under RLVR.

Figures

Figures reproduced from arXiv: 2605.16787 by Chen Zhao, He He, Yulin Chen.

Figure 1
Figure 1. Figure 1: Language models exhibit distinct learning behaviors on difficult examples (i.e., those the model initially struggles with) when trained using reinforcement learning with verifiable reward (RLVR). Measuring improvement via example-level success rates after convergence under GRPO training, we observe that a sub￾stantial subset of examples remains unlearned throughout training (Figure 1b). These unlearnable e… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of reference log-likelihood for different data examples’ correct rollouts. All rollouts are sampled from the initial policy model. Unlearnable examples do not necessarily have low-probability rollouts. learnability, and the gap between learnable and unlearnable groups remain. We further verify in Appendix A.2 that the gap persists under two stronger interventions: supervised fine-tuning on dis… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of training reward dynamics on learnable Dl and unlearnable groups Du for naive GRPO and GRPO with over￾sampling and replay method. The figure shows the gap between learnable and unlearnable examples cannot be fully attributed to difference in number of correct rollouts. Results. The training reward curve after applying the over￾sampling is shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of clipping ratio for different groups of examples. The clipping effect on all three groups is similar. Ablating clipping and KL regularization. If clipping or KL constraints were responsible for the lack of learning on unlearnable examples, relaxing them should benefit unlearn￾able examples. We train with clip-higher (Yu et al., 2025) and with the KL term removed [PITH_FULL_IMAGE:figures/full_… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of training reward dynamics on learnable Dl and unlearnable groups Du for naive GRPO, GRPO with higher clipping, and GRPO without KL loss term. Both training tech￾niques do not improve learnability. This finding indicates that unlearnable examples are not edge cases affected by clipping mechanisms or KL diver￾gence constraints, and that their resistance to learning stems from factors beyond low … view at source ↗
Figure 6
Figure 6. Figure 6: Gradient similarity of correct rollouts across unlearn￾able, learnable, and easy training examples. The rollouts are sam￾pled with the initial policy model and the gradients are calculated against the same model. Easy examples have highly concentrated gradients while unlearnable examples are distinct gradient outliers. highly aligned for easy examples [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of model response reasoning quality for examples in different groups during training. Scores are generated by GPT-5-mini on sampled responses with correct final answers. The reasoning quality on unlearnable examples fails to improve effectively during training. 5.3. Data Augmentation Does Not Improve Gradient Similarity Data Augmentation. We then explore whether data with high gradient similar… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of training rewards on original unlearnable group across different training data compositions. The y-axis on the right shows the pass@k performance on the validation set for the combined augmented setting. Data augmentation does not help model learn the unlearnable examples. 0 20 40 60 80 100 Training Steps 0.0 0.2 0.4 0.6 0.8 Average Training Reward Easy l u sim u sub u [PITH_FULL_IMAGE:figure… view at source ↗
Figure 9
Figure 9. Figure 9: further reveals the learnability of the augmented data compared with the reward dynamics of original easy 0 20 40 60 80 100 Training Steps 0.00 0.05 0.10 0.15 0.20 0.25 Average Training Reward u + sim u + sub u u + sub u u + sim u Pass@1 Pass@16 20 30 40 50 60 Eval Pass@k (%) [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Gradient similarity of difficult examples with the broader training distribution for Llama-3.2-3B-base, OctoThinker￾3B-Hybrid-Base, and OctoThinker-3B-Long-Base (Wang et al., 2025). Mid-training is shown to improve gradient similarity effec￾tively. 6. Discussion Implications on Understanding of RLVR. Our study en￾gages with the existing works on uncovering limitations of RLVR for LLMs. We offer a unique p… view at source ↗
Figure 10
Figure 10. Figure 10: The correlation of gradient similarity with augmented similar data and with broader training data, for unlearnable and learnable examples respectively. Gradient similarity with unlearn￾able examples is highly correlated meanwhile also lower than learnable examples. 5.4. Mid-training Can Increase Gradient Similarity Mid-training has shown to be effective to improve base model to make it more suitable for R… view at source ↗
Figure 12
Figure 12. Figure 12: Analysis results for controling positive rollout number and gradient clipping effects for Llama-3.2-3B-Instruct. Unlearnable Learnable Easy 0.2 0.4 0.6 Gradient Similarity (a) Distribution of gradient simi￾larity with the rest of the data for different examples. 0.2 0.3 0.4 0.5 Avg. Similarity to 0.2 0.4 0.6 A v g. Simila rit y t o s i m R 2=0.450 (b) Gradient similarity with the augmented data against wi… view at source ↗
Figure 13
Figure 13. Figure 13: Cross-example gradient analysis results for Llama-3.2-3B-Instruct. Unlearnable examples have lower gradient similarity with the rest of the examples and with the augmented data. A.2. Further Results on Positive Rollout Scarcity Hypothsis To strengthen the case against positive rollout scarcity, we conduct two additional experiments that go beyond the kpos = 1 setting in Section 4.1. SFT on filtered correc… view at source ↗
Figure 14
Figure 14. Figure 14: Data augmentation results for Llama-3.2-3B-Instruct. Although models learn augmented subproblems successfully, it does not transfer to improvements in the original unlearnable examples. under direct supervision on correct responses, which is a substantially stronger signal than RLVR, unlearnable examples remain resistant to learning, while learnable and easy examples improve as expected. Larger rollout gr… view at source ↗
Figure 15
Figure 15. Figure 15: Additional results for positive rollout scarcity hypothesis. Neither stronger supervision via SFT (left) nor a substantially larger rollout group under RL (right) closes the gap between unlearnable examples and the rest of the data. A.3. Hypothesis 3: Gradient Interference The success of RL depends on exploiting both correct and incorrect rollouts. In LLM RLVR, the gradient is calculated at token level wh… view at source ↗
Figure 16
Figure 16. Figure 16: Gradient similarity between correct and incorrect rollouts during RL training. The similarity score is measured by cosine similarity; a score < 0 indicates potential gradient interference effects. Figure a shows within-prompt similarity and Figure b shows dataset-level similarity. Across both views, gradients on unlearnable examples remain aligned rather than interfering. 16 [PITH_FULL_IMAGE:figures/full… view at source ↗
Figure 17
Figure 17. Figure 17: shows the gradient similarity distribution midway through the training for different groups. It can be seen that the distribution for easy data changes drastically as an effect of model optimization. Whereas the gradient similarity for unlearnable examples remain lower than the learnable counterparts. Unlearnable Learnable Easy 0.5 0.0 0.5 Gradient Similarity [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Analysis of curriculum training. (a) Comparison of the training reward dynamics of unlearnable group for baseline full-data training and data curriculum. Due to inconsistency in data schedules, the x-axis represents the number of epochs the unlearnable examples are trained with. (b) Distribution of reasoning quality for examples in different groups after first stage of curriculum training. Scores are gene… view at source ↗
read the original abstract

Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at \url{https://github.com/yulinchen99/unlearnability-rlvr}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to identify an 'unlearnability phenomenon' in RLVR for LLMs: among hard examples the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. Existing optimization and sampling techniques are shown to fail. Cross-example gradient analysis attributes this to fundamental representation issues, characterized by low gradient similarity with other examples and ungeneralizable reasoning patterns. Data augmentation is reported not to improve gradient similarity, supporting the conclusion of inherent limitations in current RL approaches for reasoning. Code and data are released.

Significance. If the central claims are substantiated with tighter controls and alternative-objective tests, the work would be significant for providing the first systematic characterization of unlearnable examples in RLVR and for highlighting potential limits of policy-gradient methods on reasoning tasks. The open-sourcing of code supports reproducibility and follow-up experiments.

major comments (3)
  1. [§4] §4 (cross-example gradient analysis): the interpretation that low gradient similarity demonstrates a 'fundamental representation issue' is not secured by the reported evidence. Gradient similarity under policy gradients is sensitive to advantage estimation variance, entropy regularization, and the non-stationary loss landscape; the examples could lie in high-curvature regions of the current policy without implying that the base model lacks representational capacity.
  2. [§5] §5 (data augmentation results): showing that augmentation does not raise gradient similarity only rules out one class of fixes. The central claim requires testing whether the same examples become learnable under a different objective (e.g., SFT or auxiliary representation losses) or after continued pre-training; without such tests the conclusion that the limitation is inherent to RLVR remains under-supported.
  3. [Experimental methods] Experimental methods section: insufficient detail is provided on how 'hard examples' are selected, what constitutes a 'correct rollout', the precise definition of gradient similarity, and error bars or statistical controls for the gradient comparisons. These omissions make it difficult to evaluate whether the unlearnability claim is robust or sensitive to implementation choices.
minor comments (2)
  1. [Abstract] Abstract: the term 'unlearnable' and the precise criterion for 'correct rollouts' should be defined more explicitly to avoid ambiguity for readers unfamiliar with the experimental protocol.
  2. [§4] Notation: the gradient similarity metric would benefit from an explicit equation (e.g., cosine similarity of per-example gradients) rather than a purely verbal description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and positive assessment of the significance of our work. We address each of the major comments in detail below and have made revisions to the manuscript to improve clarity and robustness.

read point-by-point responses
  1. Referee: [§4] §4 (cross-example gradient analysis): the interpretation that low gradient similarity demonstrates a 'fundamental representation issue' is not secured by the reported evidence. Gradient similarity under policy gradients is sensitive to advantage estimation variance, entropy regularization, and the non-stationary loss landscape; the examples could lie in high-curvature regions of the current policy without implying that the base model lacks representational capacity.

    Authors: We appreciate the referee's caution regarding the interpretation of gradient similarity. We agree that policy gradient estimates can be noisy due to variance in advantage estimation and the non-stationary nature of the optimization. In our original analysis, we mitigated some of this by averaging over multiple rollouts and using a baseline. In the revised version, we have included additional experiments using variance-reduced gradient estimates and different entropy coefficients to demonstrate that the low similarity persists. We have revised the text in §4 to state that the low gradient similarity 'indicates potential representation issues within the RLVR framework' rather than claiming a 'fundamental representation issue' in the base model. This addresses the possibility of high-curvature regions specific to the current policy. revision: yes

  2. Referee: [§5] §5 (data augmentation results): showing that augmentation does not raise gradient similarity only rules out one class of fixes. The central claim requires testing whether the same examples become learnable under a different objective (e.g., SFT or auxiliary representation losses) or after continued pre-training; without such tests the conclusion that the limitation is inherent to RLVR remains under-supported.

    Authors: We concur that ruling out data augmentation alone does not fully establish the inherent nature of the limitation. Our experiments show that even with augmented data, the gradient similarity remains low, and unlearnable examples exhibit ungeneralizable patterns. To further support our claims, we have added a new subsection discussing why alternative objectives like SFT may not easily resolve this, citing the lack of generalization in reasoning patterns. We have also included preliminary results on an auxiliary representation loss that slightly improves similarity but does not fully resolve unlearnability. However, comprehensive experiments with continued pre-training are beyond the scope of this work due to resource limitations; we have noted this as a direction for future research in the revised discussion. revision: partial

  3. Referee: [Experimental methods] Experimental methods section: insufficient detail is provided on how 'hard examples' are selected, what constitutes a 'correct rollout', the precise definition of gradient similarity, and error bars or statistical controls for the gradient comparisons. These omissions make it difficult to evaluate whether the unlearnability claim is robust or sensitive to implementation choices.

    Authors: We thank the referee for highlighting these omissions, which we agree make reproducibility challenging. In the revised manuscript, we have expanded the Experimental Methods section significantly. We now provide: detailed criteria for identifying hard examples (initial model accuracy below 20% on the task), the definition of a correct rollout (one that receives the full verifiable reward), the exact formula for gradient similarity (cosine similarity between averaged policy gradient vectors from correct rollouts), and error bars representing standard deviation over 5 random seeds with p-values for comparisons. These changes should allow readers to better assess the robustness of our findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents an empirical characterization of unlearnable examples in RLVR via cross-example gradient similarity measurements and data augmentation experiments. These are observational diagnostics drawn from standard policy gradient analysis rather than any self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs by construction. The reported findings rest on direct computation of gradients and rollout outcomes, which are independently verifiable against the training dynamics and do not collapse into tautological equivalence with the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that gradient similarity is a valid proxy for representation quality and that the selected hard examples are representative of broader RLVR behavior.

axioms (1)
  • domain assumption Existing optimization and sampling techniques are representative of standard RLVR practice.
    The paper states that these techniques fail to resolve unlearnability.

pith-pipeline@v0.9.0 · 5708 in / 1136 out tokens · 47705 ms · 2026-05-19T21:34:46.572541+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 1 internal anchor

  1. [1]

    Park, S., Kaur, S., and Arora, S

    URL https://openreview.net/forum? id=O9YTt26r2P. Park, S., Kaur, S., and Arora, S. How does rl post-training in- duce skill composition? a case study on countdown, 2025. URLhttps://arxiv.org/abs/2512.01775. Qian, C., Acikgoz, E. C., He, Q., Wang, H., Chen, X., Hakkani-T¨ur, D., Tur, G., and Ji, H. Toolrl: Reward is all tool learning needs, 2025. URL https...

  2. [2]

    CMU MLD Blog. Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y ., Su, Y ...

  3. [3]

    We have also tried different sampling batch size and gradient update batch size to vary the maximum number of off-policy update

    We use learning rate 5e-7 for Qwen2.5-0.5B and LLama-3.2-3B-Instruct, and learning rate 1e-6 for Qwen2.5-3B. We have also tried different sampling batch size and gradient update batch size to vary the maximum number of off-policy update. The results show no significant difference in unlearnable subset of examples. For inference and sampling, we use temper...

  4. [4]

    A reader should be able to solve any single subproblem without seeing the others

    Independence: Each subproblem must be fully self-contained, including all necessary context and definitions. A reader should be able to solve any single subproblem without seeing the others

  5. [5]

    Clarity: Each subproblem must be unambiguous and have a unique, well-defined answer

  6. [6]

    Requirements for solutions and answers:

    Progression: Subproblems should follow a logical order, building toward the final solution. Requirements for solutions and answers:

  7. [7]

    Show complete step-by-step reasoning

  8. [8]

    Use LaTeX formatting for all mathematical expressions

  9. [9]

    Ensure calculations are correct and verifiable

  10. [10]

    Output format: Return ONLY a valid JSON array with no additional text, markdown code fences, or explanations before or after

    The answer should be a numerical value or a single mathematical expression. Output format: Return ONLY a valid JSON array with no additional text, markdown code fences, or explanations before or after. Structure: { ”subproblem”: ”Clear, self-contained problem statement”, ”solution”: ”Step-by-step working with LaTeX formatting”, ”answer”: ”Final numerical ...

  11. [11]

    Determine the money earned per hour: • First hour: $1 • Second hour: $2 • Third hour: $3 • Fourth hour: $4 • Fifth hour: $5 • Sixth hour: $6 • Seventh hour: $1 • Eighth hour: $2

  12. [12]

    Calculate the total earnings for the first eight hours: 1 + 2 + 3 + 4 + 5 + 6 + 1 + 2 = 22

  13. [13]

    Determine the money earned for the next two hours (tenth hour and eleventh hour): • Tenth hour: $11 (1 + 2 = 3) • Eleventh hour: $11 (3 + 4 = 7)

  14. [14]

    Calculate the total money earned from babysitting: 22 + 11 + 11 = 44

  15. [15]

    23 The Unlearnability Phenomenon in RLVR for Language Models Table 6.An example reasoning trace with quality score 3

    Jason repays the debt by babysitting for 39 hours: 44×3 = 132 Therefore, Jason borrowed132 dollars. 23 The Unlearnability Phenomenon in RLVR for Language Models Table 6.An example reasoning trace with quality score 3. Problem Reasoning While walking by a classroom, Linda sees two perfect squares writ- ten on a blackboard. She notices that their differ- en...

  16. [16]

    Let’s call these squares a2 andb 2, whereaandbare natural numbers anda > b

    First, identify the two perfect squares on the blackboard. Let’s call these squares a2 andb 2, whereaandbare natural numbers anda > b

  17. [17]

    We know that their difference is 99, so we can write:a 2 −b 2 = 99

  18. [18]

    We can factorize using the difference of squares formula:a2 −b 2 = (a+b)(a−b) = 99

  19. [19]

    The factor pairs of 99 are (1,99) , (3,33) , (9,11) , (−1,−99) , (−3,−33) , and (−9,−11)

  20. [20]

    Adding: 2a=−30 =⇒ a=−15

    Consider the pair (−3,−33) : a+b=−33 and a−b= 3 . Adding: 2a=−30 =⇒ a=−15 . Subtracting: 2b=−36 =⇒b=−18 . Since b must be natural, this pair is not valid

  21. [21]

    Adding: 2a=−30 =⇒ a=−15

    Consider the pair (3,−33) : a+b= 3 and a−b=−33 . Adding: 2a=−30 =⇒ a=−15 . Subtracting: 2b= 36 =⇒b= 18 . Since a > b , this pair is valid. The perfect squares are15 2 = 225and18 2 = 324. Their sum is225 + 324 = 549

  22. [22]

    Adding: 2a= −20 =⇒a=−10

    Consider the pair (−9,−11) : a+b=−11 and a−b=−9 . Adding: 2a= −20 =⇒a=−10 . Subtracting: 2b=−2 =⇒b=−1 . Since b must be natural, this pair is not valid. The correct perfect squares on the blackboard are 225 and 324. Therefore, the sum is 549 . Table 7.An example reasoning trace with quality score 2. Problem Reasoning An isosceles trapezoid has sides of le...