Right Makes Might: Aligning Verified Hidden States Empowers RL Reasoning

Aomufei Yuan; Caoyuan Ma; Jiaqi Wang; Nan Duan; Shuai Dong; Weichu Xie; Wenpu Liu; Wenqi Shao; Xiaoying Zhang; Yiran Yao

arxiv: 2606.03234 · v1 · pith:IGPBWSCYnew · submitted 2026-06-02 · 💻 cs.LG

Right Makes Might: Aligning Verified Hidden States Empowers RL Reasoning

Ziyue Wang , Aomufei Yuan , Yongfu Zhu , Shuai Dong , Wenpu Liu , Yiran Yao , Weichu Xie , Yuqi Xu

show 5 more authors

Caoyuan Ma Wenqi Shao Xiaoying Zhang Nan Duan Jiaqi Wang

This is my paper

Pith reviewed 2026-06-28 11:11 UTC · model grok-4.3

classification 💻 cs.LG

keywords Hidden-AlignRLVRmathematical reasoninghidden state alignmentreinforcement learningLLM reasoningauxiliary lossanchor token

0 comments

The pith

Aligning hidden states of correct rollouts at the anchor token extracts a unified correct decision representation that improves RL reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper observes that correct reasoning paths converge naturally in hidden states at the anchor token immediately before the answer marker, reaching cosine similarity around 0.84 while still carrying residual variance from each path. It argues that pushing these states into fuller alignment during reinforcement learning training will cause the model to distill a path-independent correct decision signal rather than memorize specific sequences. This idea is realized by adding Hidden-Align, an auxiliary loss on the last-layer hidden states of only verified correct rollouts at that anchor position. The loss adds no overhead at training or inference time. Experiments across eight math benchmarks show consistent accuracy lifts for models ranging from 1.7B to 14B parameters.

Core claim

Correct rollouts naturally converge at the anchor token because they produce the same answer, yet retain residual variance from unique reasoning paths. Encouraging full alignment at this point pushes the model to extract a unified "correct decision" representation, reducing sensitivity to which reasoning path was taken. The Hidden-Align loss achieves this by aligning the last-layer hidden states of correct rollouts at the anchor token during RL training.

What carries the argument

Hidden-Align auxiliary loss that aligns last-layer hidden states of correct rollouts at the anchor token.

If this is right

Raises average pass@1 by 3.8, 6.2, and 5.4 percentage points over the DAPO baseline on Qwen3-1.7B, 4B, and 14B models.
Produces consistent gains in pass@k across all three model scales.
Adds zero overhead to both training and inference.
Yields improvements confirmed by ablations varying loss type, anchor position, layer depth, and loss weight.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach suggests RLVR can be strengthened by adding geometric constraints on internal states in addition to scalar rewards.
Similar alignment at the decision token might transfer to other verifiable-outcome tasks such as code synthesis.
Reducing path sensitivity at one critical position could make models more robust when allowed to explore diverse reasoning styles.
Testing the same loss at earlier tokens or additional layers could reveal whether the anchor is uniquely effective.

Load-bearing premise

The residual variance among hidden states of correct rollouts at the anchor token is noise rather than useful signal that should be removed to improve generalization.

What would settle it

Applying the alignment loss produces no improvement or a decrease in pass@1 on the eight benchmarks, or the cosine similarity among correct states at the anchor is found to be no higher than among incorrect states.

read the original abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has become the dominant approach for improving mathematical reasoning in large language models, yet current methods reduce each correct rollout to a single reward bit, ignoring the geometric structure shared among their hidden states. Investigating this structure, we find that at the anchor token (the position immediately before the answer marker), correct rollouts converge naturally because they must produce the same answer (cosine similarity ~0.84), yet each retains residual variance from its unique reasoning path. Encouraging full alignment at this point pushes the model to extract a unified "correct decision" representation, reducing sensitivity to which reasoning path was taken. Based on this observation, we propose Hidden-Align, an auxiliary loss function that aligns the last-layer hidden states of correct rollouts at the anchor token during RL training, with zero overhead in both training and inference. On eight mathematical reasoning benchmarks, Hidden-Align improves average pass@1 over the DAPO baseline by 3.8, 6.2, and 5.4 percentage points on Qwen3-1.7B, 4B, and 14B respectively, with consistent pass@k gains across all three scales, supported by ablations on loss type, anchor position, layer depth, and loss weight.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a Hidden-Align auxiliary loss that forces greater alignment among correct-rollout hidden states at the anchor token, reporting 3.8-6.2 point gains over DAPO on math benchmarks, but the gains rest on the untested premise that residual variance is noise rather than useful signal.

read the letter

The main point is that they observe correct rollouts already reach cosine similarity around 0.84 at the token right before the answer marker, then introduce an auxiliary loss to push those states even closer together during RLVR. This produces reported average pass@1 lifts of 3.8, 6.2, and 5.4 points on Qwen3 models at 1.7B, 4B, and 14B scales across eight math benchmarks, plus pass@k improvements and a set of ablations on loss type, position, layer, and weight.

The contribution is the specific choice of aligning verified-correct hidden states at that anchor point rather than using only the scalar reward. The zero-overhead claim is straightforward and the empirical pattern holds across scales, which is worth noting.

The soft spot is the assumption that the remaining variance among correct paths is detrimental noise whose removal yields a more generalizable representation. If that variance instead encodes distinct valid strategies, the alignment objective could reduce robustness on variants the training distribution does not cover. The abstract invokes the benefit without direct tests such as OOD generalization checks or measurements of whether path diversity drops.

The work is aimed at groups already running RLVR on math reasoning. It deserves a serious referee because the method is simple to reproduce, the observation is concrete, and the reported gains are large enough to check, even though the causal story about why alignment helps needs tighter evidence.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Hidden-Align, an auxiliary loss that aligns last-layer hidden states of verified-correct rollouts at the anchor token (immediately before the answer marker) during RLVR training. Motivated by the observation that these states already exhibit cosine similarity ~0.84 yet retain residual path-specific variance, the method is claimed to extract a unified 'correct decision' representation. It reports average pass@1 gains of 3.8, 6.2, and 5.4 percentage points over the DAPO baseline on eight mathematical reasoning benchmarks for Qwen3-1.7B, 4B, and 14B models, respectively, together with consistent pass@k gains and ablations on loss type, anchor position, layer depth, and loss weight; the loss incurs zero training or inference overhead.

Significance. If the reported gains prove robust, the work supplies a lightweight, geometry-aware auxiliary objective that can be added to existing RLVR pipelines without cost. The explicit ablations on multiple design choices and the zero-overhead property are concrete strengths that would make the technique easy to adopt and test in other verifiable-reward settings.

major comments (2)

[Abstract / §2] Abstract and §2 (Observation): the central premise that residual variance among correct rollouts at the anchor token constitutes detrimental path-specific noise (rather than potentially useful diversity) is invoked to motivate the alignment loss, yet no direct test—such as measuring correlation between per-rollout variance and generalization error on problem variants—is reported. This assumption is load-bearing for the claim that 'encouraging full alignment pushes the model to extract a unified representation.'
[Experiments / results tables] Experiments section, results tables: the reported average pass@1 improvements (3.8/6.2/5.4 pp) and pass@k gains are presented without accompanying statistical significance tests, standard errors across seeds, or explicit handling of multiple-testing correction across eight benchmarks and three model scales. The reader's note also flags that data-exclusion rules for rollouts cannot be verified from the provided material.

minor comments (2)

[§3] Notation for the anchor token and the precise definition of the alignment loss (e.g., whether it is a simple cosine or a more structured objective) should be stated once in a dedicated equation in §3 rather than only in prose.
[Experiments] The abstract states 'Qwen3' models; confirm the exact base-model identifiers (e.g., Qwen2.5 or Qwen3) in the experimental setup for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive comments. We address each major point below with clarifications based on the manuscript's observations and results.

read point-by-point responses

Referee: [Abstract / §2] Abstract and §2 (Observation): the central premise that residual variance among correct rollouts at the anchor token constitutes detrimental path-specific noise (rather than potentially useful diversity) is invoked to motivate the alignment loss, yet no direct test—such as measuring correlation between per-rollout variance and generalization error on problem variants—is reported. This assumption is load-bearing for the claim that 'encouraging full alignment pushes the model to extract a unified representation.'

Authors: The premise is grounded in the direct geometric observation reported in §2: correct rollouts already converge to cosine similarity ~0.84 at the anchor token because they must output the identical answer, yet retain measurable residual variance attributable to distinct reasoning paths. The consistent empirical gains from Hidden-Align (3.8–6.2 pp pass@1, plus pass@k improvements) across three model scales, together with the ablations on loss type, anchor position, layer depth, and weight, indicate that suppressing this residual variance improves decision robustness rather than harming diversity. A correlation study with generalization error on problem variants would be a useful extension but is not required to support the reported method or results; the geometry-plus-performance evidence suffices for the claim. revision: no
Referee: [Experiments / results tables] Experiments section, results tables: the reported average pass@1 improvements (3.8/6.2/5.4 pp) and pass@k gains are presented without accompanying statistical significance tests, standard errors across seeds, or explicit handling of multiple-testing correction across eight benchmarks and three model scales. The reader's note also flags that data-exclusion rules for rollouts cannot be verified from the provided material.

Authors: We agree that reporting standard errors and addressing multiple comparisons would strengthen the presentation. In the revised manuscript we will add per-benchmark standard errors computed over the three random seeds used for each scale and include a brief note on multiple-testing considerations. The data-exclusion rule is stated in §3.2: only rollouts whose final answer passes the verifier are retained for the alignment loss; no additional filtering is applied. We will make this criterion more explicit with a short clarifying sentence in the experimental setup. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external benchmarks

full rationale

The paper observes cosine similarity among correct rollouts at the anchor token, introduces an auxiliary Hidden-Align loss to increase that alignment, and reports pass@1 gains versus the external DAPO baseline on eight standard mathematical reasoning benchmarks. No derivation, equation, or central claim reduces the reported improvements to a quantity defined by the loss itself or to a self-citation chain. The assumption that residual variance is detrimental noise is presented as a hypothesis supported by ablations and results rather than a definitional equivalence. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.1-grok · 5803 in / 1119 out tokens · 26095 ms · 2026-06-28T11:11:05.045203+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 14 linked inside Pith

[1]

Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1): 207–219, 2022

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1): 207–219, 2022

2022
[2]

Evaluating large language models trained on code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

Pith/arXiv arXiv 2021
[3]

Fapo: flawed-aware policy optimization for efficient and reliable reasoning.arXiv preprint arXiv:2510.22543, 2025

Yuyang Ding, Chi Zhang, Juntao Li, Haibin Lin, and Min Zhang. Fapo: flawed-aware policy optimization for efficient and reliable reasoning.arXiv preprint arXiv:2510.22543, 2025

arXiv 2025
[4]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025
[5]

Reward inside the model: A lightweight hidden-state reward model for llm’s best-of-n sampling

Jizhou Guo, Zhaomin Wu, and S Yu Philip. Reward inside the model: A lightweight hidden-state reward model for llm’s best-of-n sampling. In2nd AI for Math Workshop@ICML 2025, 2025

2025
[6]

Foundation models for semantic novelty in reinforcement learning.arXiv preprint arXiv:2211.04878, 2022

Tarun Gupta, Peter Karkus, Tong Che, Danfei Xu, and Marco Pavone. Foundation models for semantic novelty in reinforcement learning.arXiv preprint arXiv:2211.04878, 2022

arXiv 2022
[7]

HMMT february competition.https://www.hmmt.org/, 2025

Harvard-MIT Mathematics Tournament. HMMT february competition.https://www.hmmt.org/, 2025

2025
[8]

Rewarding the unlikely: Lifting grpo beyond distribution sharpening

Andre Wang He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting grpo beyond distribution sharpening. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25559–25571, 2025

2025
[9]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd AnnualMeeting of the Association for Computational Linguistics (Volume1: Long Papers)...

2024
[10]

Apo: Enhancing reasoning ability of mllms via asymmetric policy optimization.arXiv preprint arXiv:2506.21655, 2025

Minjie Hong, Zirun Guo, Yan Xia, Zehan Wang, Ziang Zhang, Tao Jin, and Zhou Zhao. Apo: Enhancing reasoning ability of mllms via asymmetric policy optimization.arXiv preprint arXiv:2506.21655, 2025

arXiv 2025
[11]

Solving quantitative reasoning problems with language models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Am- brose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advancesin neural information processing systems, 35:3843–3857, 2022

2022
[12]

Leveraging error diversity in group rollouts for reinforcement learning.arXiv preprint arXiv:2605.17333, 2026

Wenpu Liu, Yuqi Xu, Weichu Xie, Yongfu Zhu, Shuai Dong, Ziyue Wang, Wenqi Shao, Xiaoying Zhang, Tong Yang, Nan Duan, et al. Leveraging error diversity in group rollouts for reinforcement learning.arXiv preprint arXiv:2605.17333, 2026

Pith/arXiv arXiv 2026
[13]

Under- standing r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Under- standing r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

Pith/arXiv arXiv 2025
[14]

Contrastive reasoning alignment: Rein- forcement learning from hidden representations.arXiv preprint arXiv:2603.17305, 2026

Haozheng Luo, Yimin Wang, Jiahao Yu, Binghui Wang, and Yan Chen. Contrastive reasoning alignment: Rein- forcement learning from hidden representations.arXiv preprint arXiv:2603.17305, 2026

Pith/arXiv arXiv 2026
[15]

AIME problems and solutions.https://maa.org/, 2024

Mathematical Association of America. AIME problems and solutions.https://maa.org/, 2024

2024
[16]

AMC 10/12 problems and solutions.https://maa.org/, 2024

Mathematical Association of America. AMC 10/12 problems and solutions.https://maa.org/, 2024

2024
[17]

Ngrpo: Negative-enhanced group relative policy optimization.arXiv preprint arXiv:2509.18851, 2025

Gongrui Nan, Siye Chen, Jing Huang, Mengyu Lu, Dexun Wang, Chunmei Xie, Weiqi Xiong, Xianzhou Zeng, Qixuan Zhou, Yadong Li, et al. Ngrpo: Negative-enhanced group relative policy optimization.arXiv preprint arXiv:2509.18851, 2025

arXiv 2025
[18]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730–27744, 2022

2022
[19]

Relational knowledge distillation

Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019

2019
[20]

Ride: Rewarding impact-driven exploration for procedurally-generated environments

Roberta Raileanu and Tim Rocktäschel. Ride: Rewarding impact-driven exploration for procedurally-generated environments. arXiv preprint arXiv:2002.12292, 2020. 12

arXiv 2002
[21]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[22]

Sample more to think less: Group filtered policy optimization for concise reasoning

Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Behl, and Dimitris Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning. arXiv preprint arXiv:2508.09726, 2025

arXiv 2025
[23]

Gtpo: Stabilizing group relative policy optimization via gradient and entropy control.arXiv preprint arXiv:2508.03772, 2025

Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino, and Paolo Mori. Gtpo: Stabilizing group relative policy optimization via gradient and entropy control.arXiv preprint arXiv:2508.03772, 2025

arXiv 2025
[24]

Llm reasoning as trajectories: Step-specific representation geometry and correctness signals

Lihao Sun, Hang Dong, Bo Qiao, Qingwei Lin, Dongmei Zhang, and Saravan Rajmohan. Llm reasoning as trajectories: Step-specific representation geometry and correctness signals. arXiv preprint arXiv:2604.05655, 2026

Pith/arXiv arXiv 2026
[25]

Efficient reinforcement learning for large language models with intrinsic exploration.arXiv preprint arXiv:2511.00794, 2025

Yan Sun, Jia Guo, Stanley Kok, Zihao Wang, Zujie Wen, and Zhiqiang Zhang. Efficient reinforcement learning for large language models with intrinsic exploration.arXiv preprint arXiv:2511.00794, 2025

arXiv 2025
[26]

Contrastive representation distillation

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019

arXiv 1910
[27]

Llm- empowered state representation for reinforcement learning.arXiv preprint arXiv:2407.13237, 2024

Boyuan Wang, Yun Qu, Yuhang Jiang, Jianzhun Shao, Chang Liu, Wenming Yang, and Xiangyang Ji. Llm- empowered state representation for reinforcement learning.arXiv preprint arXiv:2407.13237, 2024

arXiv 2024
[28]

Closing the modality reasoning gap for speech large language models.arXiv preprint arXiv:2601.05543, 2026

Chaoren Wang, Heng Lu, Xueyao Zhang, Shujie Liu, Yan Lu, Jinyu Li, and Zhizheng Wu. Closing the modality reasoning gap for speech large language models.arXiv preprint arXiv:2601.05543, 2026

Pith/arXiv arXiv 2026
[29]

Step-wise rubric rewards for llm reasoning.arXiv preprint arXiv:2605.17291, 2026

Weichu Xie, Haozhe Zhao, Wenpu Liu, Yongfu Zhu, Liang Chen, Minghao Ye, Zirong Chen, Yuqi Xu, Shuai Dong, Ziyue Wang, et al. Step-wise rubric rewards for llm reasoning.arXiv preprint arXiv:2605.17291, 2026

Pith/arXiv arXiv 2026
[30]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[31]

Regularizing hidden states enables learning generalizable reward model for llms.Advancesin Neural Information Processing Systems, 37:62279–62309, 2024

Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, and Tong Zhang. Regularizing hidden states enables learning generalizable reward model for llms.Advancesin Neural Information Processing Systems, 37:62279–62309, 2024

2024
[32]

Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222–113244, 2026

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222–113244, 2026

2026
[33]

Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

Pith/arXiv arXiv 2024
[34]

Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

Pith/arXiv arXiv 2025
[35]

Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

arXiv 2025
[36]

Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

Pith/arXiv arXiv 2025
[37]

knowing but not doing

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023. 13 APPENDIX A Cosine Similarity Distributions To verify that correct rollouts cluster more tightly th...

Pith/arXiv arXiv 2023

[1] [1]

Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1): 207–219, 2022

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1): 207–219, 2022

2022

[2] [2]

Evaluating large language models trained on code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

Pith/arXiv arXiv 2021

[3] [3]

Fapo: flawed-aware policy optimization for efficient and reliable reasoning.arXiv preprint arXiv:2510.22543, 2025

Yuyang Ding, Chi Zhang, Juntao Li, Haibin Lin, and Min Zhang. Fapo: flawed-aware policy optimization for efficient and reliable reasoning.arXiv preprint arXiv:2510.22543, 2025

arXiv 2025

[4] [4]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025

[5] [5]

Reward inside the model: A lightweight hidden-state reward model for llm’s best-of-n sampling

Jizhou Guo, Zhaomin Wu, and S Yu Philip. Reward inside the model: A lightweight hidden-state reward model for llm’s best-of-n sampling. In2nd AI for Math Workshop@ICML 2025, 2025

2025

[6] [6]

Foundation models for semantic novelty in reinforcement learning.arXiv preprint arXiv:2211.04878, 2022

Tarun Gupta, Peter Karkus, Tong Che, Danfei Xu, and Marco Pavone. Foundation models for semantic novelty in reinforcement learning.arXiv preprint arXiv:2211.04878, 2022

arXiv 2022

[7] [7]

HMMT february competition.https://www.hmmt.org/, 2025

Harvard-MIT Mathematics Tournament. HMMT february competition.https://www.hmmt.org/, 2025

2025

[8] [8]

Rewarding the unlikely: Lifting grpo beyond distribution sharpening

Andre Wang He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting grpo beyond distribution sharpening. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25559–25571, 2025

2025

[9] [9]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd AnnualMeeting of the Association for Computational Linguistics (Volume1: Long Papers)...

2024

[10] [10]

Apo: Enhancing reasoning ability of mllms via asymmetric policy optimization.arXiv preprint arXiv:2506.21655, 2025

Minjie Hong, Zirun Guo, Yan Xia, Zehan Wang, Ziang Zhang, Tao Jin, and Zhou Zhao. Apo: Enhancing reasoning ability of mllms via asymmetric policy optimization.arXiv preprint arXiv:2506.21655, 2025

arXiv 2025

[11] [11]

Solving quantitative reasoning problems with language models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Am- brose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advancesin neural information processing systems, 35:3843–3857, 2022

2022

[12] [12]

Leveraging error diversity in group rollouts for reinforcement learning.arXiv preprint arXiv:2605.17333, 2026

Wenpu Liu, Yuqi Xu, Weichu Xie, Yongfu Zhu, Shuai Dong, Ziyue Wang, Wenqi Shao, Xiaoying Zhang, Tong Yang, Nan Duan, et al. Leveraging error diversity in group rollouts for reinforcement learning.arXiv preprint arXiv:2605.17333, 2026

Pith/arXiv arXiv 2026

[13] [13]

Under- standing r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Under- standing r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

Pith/arXiv arXiv 2025

[14] [14]

Contrastive reasoning alignment: Rein- forcement learning from hidden representations.arXiv preprint arXiv:2603.17305, 2026

Haozheng Luo, Yimin Wang, Jiahao Yu, Binghui Wang, and Yan Chen. Contrastive reasoning alignment: Rein- forcement learning from hidden representations.arXiv preprint arXiv:2603.17305, 2026

Pith/arXiv arXiv 2026

[15] [15]

AIME problems and solutions.https://maa.org/, 2024

Mathematical Association of America. AIME problems and solutions.https://maa.org/, 2024

2024

[16] [16]

AMC 10/12 problems and solutions.https://maa.org/, 2024

Mathematical Association of America. AMC 10/12 problems and solutions.https://maa.org/, 2024

2024

[17] [17]

Ngrpo: Negative-enhanced group relative policy optimization.arXiv preprint arXiv:2509.18851, 2025

Gongrui Nan, Siye Chen, Jing Huang, Mengyu Lu, Dexun Wang, Chunmei Xie, Weiqi Xiong, Xianzhou Zeng, Qixuan Zhou, Yadong Li, et al. Ngrpo: Negative-enhanced group relative policy optimization.arXiv preprint arXiv:2509.18851, 2025

arXiv 2025

[18] [18]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin neural information processing systems, 35:27730–27744, 2022

2022

[19] [19]

Relational knowledge distillation

Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019

2019

[20] [20]

Ride: Rewarding impact-driven exploration for procedurally-generated environments

Roberta Raileanu and Tim Rocktäschel. Ride: Rewarding impact-driven exploration for procedurally-generated environments. arXiv preprint arXiv:2002.12292, 2020. 12

arXiv 2002

[21] [21]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[22] [22]

Sample more to think less: Group filtered policy optimization for concise reasoning

Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Behl, and Dimitris Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning. arXiv preprint arXiv:2508.09726, 2025

arXiv 2025

[23] [23]

Gtpo: Stabilizing group relative policy optimization via gradient and entropy control.arXiv preprint arXiv:2508.03772, 2025

Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino, and Paolo Mori. Gtpo: Stabilizing group relative policy optimization via gradient and entropy control.arXiv preprint arXiv:2508.03772, 2025

arXiv 2025

[24] [24]

Llm reasoning as trajectories: Step-specific representation geometry and correctness signals

Lihao Sun, Hang Dong, Bo Qiao, Qingwei Lin, Dongmei Zhang, and Saravan Rajmohan. Llm reasoning as trajectories: Step-specific representation geometry and correctness signals. arXiv preprint arXiv:2604.05655, 2026

Pith/arXiv arXiv 2026

[25] [25]

Efficient reinforcement learning for large language models with intrinsic exploration.arXiv preprint arXiv:2511.00794, 2025

Yan Sun, Jia Guo, Stanley Kok, Zihao Wang, Zujie Wen, and Zhiqiang Zhang. Efficient reinforcement learning for large language models with intrinsic exploration.arXiv preprint arXiv:2511.00794, 2025

arXiv 2025

[26] [26]

Contrastive representation distillation

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019

arXiv 1910

[27] [27]

Llm- empowered state representation for reinforcement learning.arXiv preprint arXiv:2407.13237, 2024

Boyuan Wang, Yun Qu, Yuhang Jiang, Jianzhun Shao, Chang Liu, Wenming Yang, and Xiangyang Ji. Llm- empowered state representation for reinforcement learning.arXiv preprint arXiv:2407.13237, 2024

arXiv 2024

[28] [28]

Closing the modality reasoning gap for speech large language models.arXiv preprint arXiv:2601.05543, 2026

Chaoren Wang, Heng Lu, Xueyao Zhang, Shujie Liu, Yan Lu, Jinyu Li, and Zhizheng Wu. Closing the modality reasoning gap for speech large language models.arXiv preprint arXiv:2601.05543, 2026

Pith/arXiv arXiv 2026

[29] [29]

Step-wise rubric rewards for llm reasoning.arXiv preprint arXiv:2605.17291, 2026

Weichu Xie, Haozhe Zhao, Wenpu Liu, Yongfu Zhu, Liang Chen, Minghao Ye, Zirong Chen, Yuqi Xu, Shuai Dong, Ziyue Wang, et al. Step-wise rubric rewards for llm reasoning.arXiv preprint arXiv:2605.17291, 2026

Pith/arXiv arXiv 2026

[30] [30]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[31] [31]

Regularizing hidden states enables learning generalizable reward model for llms.Advancesin Neural Information Processing Systems, 37:62279–62309, 2024

Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, and Tong Zhang. Regularizing hidden states enables learning generalizable reward model for llms.Advancesin Neural Information Processing Systems, 37:62279–62309, 2024

2024

[32] [32]

Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222–113244, 2026

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems, 38:113222–113244, 2026

2026

[33] [33]

Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

Pith/arXiv arXiv 2024

[34] [34]

Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

Pith/arXiv arXiv 2025

[35] [35]

Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

arXiv 2025

[36] [36]

Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

Pith/arXiv arXiv 2025

[37] [37]

knowing but not doing

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023. 13 APPENDIX A Cosine Similarity Distributions To verify that correct rollouts cluster more tightly th...

Pith/arXiv arXiv 2023