Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

Jiangnan Xia; Kishan Panaganti; Ninghao Liu; Yucheng Shi; Yu Yang; Zhenwen Liang

arxiv: 2606.10346 · v1 · pith:L7V3WJD2new · submitted 2026-06-09 · 💻 cs.AI

Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

Jiangnan Xia , Yucheng Shi , Yu Yang , Kishan Panaganti , Zhenwen Liang , Ninghao Liu This is my paper

Pith reviewed 2026-06-27 13:37 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM reinforcement learningexploration methodsreasoning vs memorizationdirection-awareGRPOpolicy optimizationgradient featuresreward shaping

0 comments

The pith

DiRL extracts a reasoning-memorization direction from LLM representations to steer reinforcement learning exploration toward genuine reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing exploration methods in LLM reinforcement learning reward diversity equally whether it comes from new reasoning or from memorized shortcuts. DiRL identifies an internal direction within the model's representations that separates these two sources of variation. It then builds direction-weighted gradient features from rollouts and uses them to shape rewards inside the GRPO algorithm, boosting updates aligned with reasoning while damping those aligned with memorization. Experiments on mathematical and general reasoning benchmarks show consistent gains over prior diversity-based methods. A reader would care because the approach offers a concrete way to make RL training favor actual problem solving over pattern recall.

Core claim

DiRL extracts an internal reasoning-memorization direction from model representations, constructs direction-weighted gradient features to characterize rollout updates, and shapes rewards to amplify reasoning-aligned exploration while suppressing memorization-aligned variations, integrating seamlessly into Group Relative Policy Optimization (GRPO) and producing significant improvements over existing exploration methods on mathematical and general reasoning benchmarks.

What carries the argument

The reasoning-memorization direction extracted from model representations, used to weight gradient features and shape rewards during policy updates.

If this is right

Exploration in GRPO can be made selective rather than uniformly diverse by anchoring to the internal direction.
Direction-weighted gradient features provide a practical signal for distinguishing reasoning trajectories from memorization ones during training.
The method yields measurable performance lifts on both mathematical and general reasoning benchmarks relative to prior exploration baselines.
Reward shaping that suppresses memorization-aligned updates integrates directly into standard policy optimization without requiring new architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the direction remains stable across training checkpoints, it could be reused to monitor whether later stages of RL drift back toward memorization.
The same extraction technique might be tested on non-reasoning tasks such as instruction following to check whether analogous directions separate helpful variation from shortcut learning.
Multi-direction extensions could be explored if different reasoning domains produce distinct axes in representation space.

Load-bearing premise

A single stable direction exists in the model's representations that genuinely separates reasoning processes from memorization patterns.

What would settle it

An ablation in which reward shaping along the extracted direction produces no larger gains in benchmark accuracy than unweighted diversity rewards, or in which the direction correlates more strongly with memorized-answer variation than with novel reasoning steps.

Figures

Figures reproduced from arXiv: 2606.10346 by Jiangnan Xia, Kishan Panaganti, Ninghao Liu, Yucheng Shi, Yu Yang, Zhenwen Liang.

**Figure 1.** Figure 1: Comparison of different exploration strategies. Unlike existing diversity-based methods, DiRL selectively [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Ablation studies on four datasets [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Reasoning- and Memorization-Aligned Response Ratios on AIME25. arating reasoning and memorization is the foundation of DiRL. Without this partition, the exploration objective cannot distinguish beneficial reasoning trajectories from memorization-driven ones. We also observe performance declines for NoMR, which highlights the importance of evaluating memorization-aligned responses relative to reasonin… view at source ↗

**Figure 5.** Figure 5: Effect of reward shaping hyperparameters on [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt used for GPT-4o-based MATH reasoning–memory labeling. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Reinforcement learning has become a key paradigm for eliciting reasoning abilities in large language models, where exploration is crucial for discovering effective solution trajectories. Existing exploration methods typically encourage diversity in semantic or gradient spaces, without distinguishing what drives this diversity. A trajectory may appear novel because it follows a new reasoning process, or because it varies memorized patterns and shortcuts. Rewarding both cases equally may steer exploration toward memorization rather than genuine reasoning improvement. In this paper, we propose DiRL, a Direction-Aware Reinforcement Learning framework that anchors exploration to an internal reasoning-memorization direction of the policy. Specifically, DiRL extracts this direction from model representations, constructs direction-weighted gradient features to characterize rollout updates, and shapes rewards to amplify reasoning-aligned exploration while suppressing memorization-aligned variations. DiRL integrates seamlessly into standard Group Relative Policy Optimization (GRPO). Extensive experiments on mathematical and general reasoning benchmarks demonstrate the effectiveness of DiRL, showing significant improvements over various existing exploration methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiRL adds direction-weighted features to GRPO to bias exploration toward reasoning, but the extraction method and validation against confounds are not shown.

read the letter

The paper's main move is to pull a single direction out of model representations that is meant to separate reasoning processes from memorization, then use that direction to reweight gradients and shape the reward inside a standard GRPO loop.

This is a straightforward extension of existing diversity-based exploration work. The authors correctly identify that rewarding raw diversity can amplify shortcuts as easily as useful reasoning, and they report gains on math and general reasoning benchmarks over several baselines.

The soft spots are in the missing pieces. The abstract gives no equations, no description of how the direction is extracted (difference of means, probe, PCA?), and no checks that the direction stays stable or that it is not simply tracking length or lexical variation. Without those, the reported improvements could be re-labeling of existing variation rather than a reasoning-specific effect.

The stress-test note is on target here: a load-bearing assumption is left untested in the supplied information. If the full manuscript supplies the extraction procedure, stability tests, and ablations that rule out surface statistics, the central claim would be easier to evaluate.

This is for people already working on RL fine-tuning of LLMs for reasoning tasks. A reader who follows GRPO-style methods would get a concrete idea to try, even if the current write-up needs more grounding.

I would send it to peer review so the authors can add the implementation details and controls.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes DiRL, a Direction-Aware Reinforcement Learning framework for LLMs. It extracts an internal reasoning-memorization direction from model representations, constructs direction-weighted gradient features to characterize rollout updates, and shapes rewards within standard GRPO to amplify reasoning-aligned exploration while suppressing memorization-aligned variations. Experiments on mathematical and general reasoning benchmarks are reported to show significant improvements over existing exploration methods.

Significance. If the extracted direction genuinely isolates reasoning processes from memorization (rather than surface proxies), DiRL could provide a targeted mechanism for improving exploration in LLM RL without reinforcing shortcuts. The seamless integration with GRPO and benchmark gains would then represent a practical advance in reasoning elicitation. The absence of methodological specifics, however, leaves the significance conditional on unverified assumptions about the direction's stability and specificity.

major comments (2)

[Abstract] Abstract: the central claim that DiRL extracts a direction that 'genuinely separates reasoning processes from memorization patterns' rests on an extraction procedure (contrastive examples, method such as difference-of-means/PCA/probe, stability checks) that is not described; without this, it is impossible to evaluate whether the direction is load-bearing or merely re-labels length/entropy variation.
[Method] Method description (as summarized): no validation is supplied that the direction remains stable across prompt distributions or that weighting updates along it changes reasoning depth rather than surface statistics; this directly undermines the claim that reward shaping preferentially improves reasoning over memorization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional methodological transparency is needed. We address the major comments point by point below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that DiRL extracts a direction that 'genuinely separates reasoning processes from memorization patterns' rests on an extraction procedure (contrastive examples, method such as difference-of-means/PCA/probe, stability checks) that is not described; without this, it is impossible to evaluate whether the direction is load-bearing or merely re-labels length/entropy variation.

Authors: We agree that the abstract does not describe the extraction procedure in sufficient detail. In the revised manuscript we will expand the abstract to briefly specify the construction of contrastive reasoning versus memorization examples and the difference-of-means method used to obtain the direction, together with a short note on the stability checks that were performed. revision: yes
Referee: [Method] Method description (as summarized): no validation is supplied that the direction remains stable across prompt distributions or that weighting updates along it changes reasoning depth rather than surface statistics; this directly undermines the claim that reward shaping preferentially improves reasoning over memorization.

Authors: The referee correctly notes the absence of explicit validation. While the reported benchmark gains are consistent with the intended effect, the manuscript does not contain dedicated experiments measuring direction stability across prompt distributions or isolating reasoning depth from surface statistics such as length or entropy. We will add a new subsection with these analyses, including cross-prompt consistency metrics and appropriate controls. revision: yes

Circularity Check

0 steps flagged

No significant circularity; direction extraction treated as independent modeling step

full rationale

The paper's core proposal extracts a reasoning-memorization direction from model representations, builds direction-weighted gradient features, and shapes rewards within GRPO. This extraction and weighting are presented as an independent modeling choice rather than a quantity fitted to the target benchmark outcomes or reduced by construction to prior self-citations. No equations or procedures are described that would make the reported gains equivalent to re-labeling of inputs already present in the data or in the authors' prior work. The method therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that model representations contain a linearly separable direction distinguishing reasoning from memorization; no free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Model hidden states contain a stable, extractable direction that separates reasoning trajectories from memorization trajectories.
Invoked when the paper states that DiRL 'extracts this direction from model representations' to construct weighted features.

pith-pipeline@v0.9.1-grok · 5708 in / 1186 out tokens · 16159 ms · 2026-06-27T13:37:45.594863+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 11 linked inside Pith

[1]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

The reasoning-memorization interplay in language models is mediated by a single direction , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[2]

arXiv preprint arXiv:2512.15687 , year=

Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning , author=. arXiv preprint arXiv:2512.15687 , year=

arXiv
[3]

Information Fusion , volume=

Exploration in deep reinforcement learning: A survey , author=. Information Fusion , volume=. 2022 , publisher=

2022
[4]

Nature , volume=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

2025
[5]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[6]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[7]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv
[8]

arXiv preprint arXiv:2509.15194 , year=

Evolving language models without labels: Majority drives selection, novelty promotes variation , author=. arXiv preprint arXiv:2509.15194 , year=

arXiv
[9]

International conference on machine learning , pages=

Asynchronous methods for deep reinforcement learning , author=. International conference on machine learning , pages=. 2016 , organization=

2016
[10]

arXiv preprint arXiv:2509.06941 , year=

Outcome-based exploration for llm reasoning , author=. arXiv preprint arXiv:2509.06941 , year=

arXiv
[11]

arXiv preprint arXiv:2509.02534 , year=

Jointly reinforcing diversity and quality in language model generations , author=. arXiv preprint arXiv:2509.02534 , year=

arXiv
[12]

arXiv preprint arXiv:2311.12022 , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

Pith/arXiv arXiv
[13]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=
[14]

arXiv preprint arXiv:1707.06347 , year=

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2311.03658 , year=

The linear representation hypothesis and the geometry of large language models , author=. arXiv preprint arXiv:2311.03658 , year=

Pith/arXiv arXiv
[16]

Advances in neural information processing systems , volume=

Faith and fate: Limits of transformers on compositionality , author=. Advances in neural information processing systems , volume=
[17]

arXiv preprint arXiv:2201.02177 , year=

Grokking: Generalization beyond overfitting on small algorithmic datasets , author=. arXiv preprint arXiv:2201.02177 , year=

Pith/arXiv arXiv
[18]

arXiv preprint arXiv:2301.05217 , year=

Progress measures for grokking via mechanistic interpretability , author=. arXiv preprint arXiv:2301.05217 , year=

Pith/arXiv arXiv
[19]

International Conference on Learning Representations , volume=

Linearity of relation decoding in transformer language models , author=. International Conference on Learning Representations , volume=
[20]

A is B” fail to learn “B is A

The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A” , author=. International Conference on Learning Representations , volume=
[21]

International Conference on Learning Representations , volume=

Physics of language models: Part 2.1, grade-school math and the hidden reasoning process , author=. International Conference on Learning Representations , volume=
[22]

arXiv preprint arXiv:1909.08593 , year=

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

Pith/arXiv arXiv 1909
[23]

Machine learning , volume=

Near-optimal reinforcement learning in polynomial time , author=. Machine learning , volume=. 2002 , publisher=

2002
[24]

Sentence-bert: Sentence embeddings using siamese bert-networks , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

2019
[25]

International Conference on Learning Representations , volume=

Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models , author=. International Conference on Learning Representations , volume=
[26]

arXiv preprint arXiv:1912.09713 , year=

Measuring compositional generalization: A comprehensive method on realistic data , author=. arXiv preprint arXiv:1912.09713 , year=

arXiv 1912
[27]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Towards a mechanistic interpretation of multi-step reasoning capabilities of language models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[28]

arXiv preprint arXiv:2211.00593 , year=

Interpretability in the wild: a circuit for indirect object identification in gpt-2 small , author=. arXiv preprint arXiv:2211.00593 , year=

Pith/arXiv arXiv
[29]

Transformer Circuits Thread , volume=

A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=
[30]

arXiv preprint arXiv:2103.03874 , year=

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

Pith/arXiv arXiv
[31]

arXiv preprint arXiv:2110.14168 , year=

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv
[32]

arXiv preprint arXiv:2210.03057 , year=

Language models are multilingual chain-of-thought reasoners , author=. arXiv preprint arXiv:2210.03057 , year=

Pith/arXiv arXiv
[33]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=
[34]

Advances in neural information processing systems , volume=

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models , author=. Advances in neural information processing systems , volume=

[1] [1]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

The reasoning-memorization interplay in language models is mediated by a single direction , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025

[2] [2]

arXiv preprint arXiv:2512.15687 , year=

Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning , author=. arXiv preprint arXiv:2512.15687 , year=

arXiv

[3] [3]

Information Fusion , volume=

Exploration in deep reinforcement learning: A survey , author=. Information Fusion , volume=. 2022 , publisher=

2022

[4] [4]

Nature , volume=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

2025

[5] [5]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

[6] [6]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

[7] [7]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv

[8] [8]

arXiv preprint arXiv:2509.15194 , year=

Evolving language models without labels: Majority drives selection, novelty promotes variation , author=. arXiv preprint arXiv:2509.15194 , year=

arXiv

[9] [9]

International conference on machine learning , pages=

Asynchronous methods for deep reinforcement learning , author=. International conference on machine learning , pages=. 2016 , organization=

2016

[10] [10]

arXiv preprint arXiv:2509.06941 , year=

Outcome-based exploration for llm reasoning , author=. arXiv preprint arXiv:2509.06941 , year=

arXiv

[11] [11]

arXiv preprint arXiv:2509.02534 , year=

Jointly reinforcing diversity and quality in language model generations , author=. arXiv preprint arXiv:2509.02534 , year=

arXiv

[12] [12]

arXiv preprint arXiv:2311.12022 , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

Pith/arXiv arXiv

[13] [13]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

[14] [14]

arXiv preprint arXiv:1707.06347 , year=

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv

[15] [15]

arXiv preprint arXiv:2311.03658 , year=

The linear representation hypothesis and the geometry of large language models , author=. arXiv preprint arXiv:2311.03658 , year=

Pith/arXiv arXiv

[16] [16]

Advances in neural information processing systems , volume=

Faith and fate: Limits of transformers on compositionality , author=. Advances in neural information processing systems , volume=

[17] [17]

arXiv preprint arXiv:2201.02177 , year=

Grokking: Generalization beyond overfitting on small algorithmic datasets , author=. arXiv preprint arXiv:2201.02177 , year=

Pith/arXiv arXiv

[18] [18]

arXiv preprint arXiv:2301.05217 , year=

Progress measures for grokking via mechanistic interpretability , author=. arXiv preprint arXiv:2301.05217 , year=

Pith/arXiv arXiv

[19] [19]

International Conference on Learning Representations , volume=

Linearity of relation decoding in transformer language models , author=. International Conference on Learning Representations , volume=

[20] [20]

A is B” fail to learn “B is A

The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A” , author=. International Conference on Learning Representations , volume=

[21] [21]

International Conference on Learning Representations , volume=

Physics of language models: Part 2.1, grade-school math and the hidden reasoning process , author=. International Conference on Learning Representations , volume=

[22] [22]

arXiv preprint arXiv:1909.08593 , year=

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

Pith/arXiv arXiv 1909

[23] [23]

Machine learning , volume=

Near-optimal reinforcement learning in polynomial time , author=. Machine learning , volume=. 2002 , publisher=

2002

[24] [24]

Sentence-bert: Sentence embeddings using siamese bert-networks , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

2019

[25] [25]

International Conference on Learning Representations , volume=

Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models , author=. International Conference on Learning Representations , volume=

[26] [26]

arXiv preprint arXiv:1912.09713 , year=

Measuring compositional generalization: A comprehensive method on realistic data , author=. arXiv preprint arXiv:1912.09713 , year=

arXiv 1912

[27] [27]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Towards a mechanistic interpretation of multi-step reasoning capabilities of language models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023

[28] [28]

arXiv preprint arXiv:2211.00593 , year=

Interpretability in the wild: a circuit for indirect object identification in gpt-2 small , author=. arXiv preprint arXiv:2211.00593 , year=

Pith/arXiv arXiv

[29] [29]

Transformer Circuits Thread , volume=

A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=

[30] [30]

arXiv preprint arXiv:2103.03874 , year=

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

Pith/arXiv arXiv

[31] [31]

arXiv preprint arXiv:2110.14168 , year=

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv

[32] [32]

arXiv preprint arXiv:2210.03057 , year=

Language models are multilingual chain-of-thought reasoners , author=. arXiv preprint arXiv:2210.03057 , year=

Pith/arXiv arXiv

[33] [33]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

[34] [34]

Advances in neural information processing systems , volume=

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models , author=. Advances in neural information processing systems , volume=