arxiv: 2602.15620 · v4 · submitted 2026-02-17 · 💻 cs.CL · cs.AI

Recognition: no theorem link

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Shiqi Liu , Zeyu He , Guojian Zhan , Letian Tao , Zhilong Zheng , Jiang Wu , Yinuo Wang , Yang Guan

show 5 more authors

Kehua Sheng Bo Zhang Keqiang Li Jingliang Duan Shengbo Eben Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords reinforcement learninglarge language modelsspurious tokenspolicy optimizationmathematical reasoningentropy stabilitygradient suppressionfine-tuning stability

0 comments

The pith

Silencing gradients from a tiny fraction of spurious tokens stabilizes RL fine-tuning of LLMs and raises math reasoning performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that roughly 0.01 percent of tokens in LLM reasoning traces receive full sequence-level rewards despite contributing almost nothing to the final answer, which inflates their gradients and drives entropy spikes followed by performance collapse late in training. It introduces the Silencing Spurious Tokens mechanism to zero out those gradients selectively and folds the change into a group-based policy objective called STAPO. Across Qwen 1.7B, 8B, and 14B models on six math benchmarks, the method keeps entropy flat and lifts average accuracy by 11.49 percent under greedy sampling and 3.73 percent under nucleus sampling relative to prior RL baselines. A sympathetic reader cares because current RL recipes for reasoning models still rely on ad-hoc fixes that fail at scale, and a targeted token-level intervention could remove the need for them.

Core claim

The central claim is that a small set of spurious tokens inherits the full outcome reward, producing outsized gradient updates that destabilize the policy and degrade reasoning quality. The authors define a unified evaluation of token-level effects across spurious risk, gradient norm, and entropy change, then propose the S2T mechanism to suppress gradients from these tokens inside a group-relative objective. The resulting STAPO algorithm produces stable entropy trajectories and consistent accuracy gains on mathematical reasoning tasks for Qwen models of three sizes.

What carries the argument

The Silencing Spurious Tokens (S2T) mechanism, which identifies low-contribution tokens and suppresses their gradient contributions within the group-based policy update.

If this is right

Late-stage performance collapse in RL fine-tuning of reasoning models can be prevented by token-level gradient editing rather than global entropy regularization.
The same S2T logic can be added to other group-relative objectives without changing their sampling or reward structure.
Entropy remains controlled across training without extra regularization terms once spurious gradient contributions are removed.
Accuracy gains appear consistently across 1.7B to 14B model scales on math benchmarks under both full and top-p sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could transfer to non-math RL tasks such as code generation where similar low-value tokens might receive oversized credit.
Detecting spurious tokens automatically rather than by fixed frequency thresholds would make the method easier to apply to new domains.
If spurious tokens also appear in preference data, the same silencing step might reduce reward-model exploitation in standard RLHF.

Load-bearing premise

That the identified spurious tokens are the dominant source of instability and that zeroing their gradients removes noise without discarding useful reasoning information or creating new biases.

What would settle it

Run identical STAPO training on the same Qwen models but disable S2T gradient suppression; if entropy still stays flat and accuracy matches the reported gains, the causal role of spurious tokens would be falsified.

read the original abstract

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. We identify a key factor behind this instability: a small fraction of tokens, termed spurious tokens (around 0.01%), which contribute little to the reasoning outcome but receive disproportionately amplified gradient updates due to inheriting the full sequence-level reward. We present a unified framework for evaluating token-level optimization impacts across spurious risk, gradient norms, and entropy changes. Building on the analysis of token characteristics that severely disrupt optimization, we propose the Silencing Spurious Tokens (S2T) mechanism to efficiently suppress their gradient perturbations. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 11.49% ($\rho_{\mathrm{T}}$=1.0, top-p=1.0) and 3.73% ($\rho_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STAPO gives a concrete token-silencing trick that stabilizes RL training and lifts math scores, but the causal role of the specific spurious set still needs a random-mask control.

read the letter

The core idea is straightforward: a tiny slice of tokens (about 0.01%) picks up outsized gradients during RL because it inherits the full sequence reward, and muting their updates inside a group objective keeps training from collapsing late. STAPO packages this into a new objective that combines the silencing step with the usual policy gradient. On Qwen 1.7B–14B models across six math benchmarks it reports clearer entropy curves and average gains of 11.49% and 3.73% over GRPO, 20-Entropy, and JustRL under two sampling regimes. That is the usable takeaway for anyone running large-scale reasoning RL right now. The unified token-impact framework (spurious risk plus gradient norm plus entropy shift) is the part that feels fresh relative to the baselines they cite. It gives a reproducible way to flag the tokens instead of relying on hand-tuned entropy bonuses. The experiments cover multiple model sizes and report consistent directionality, which is better than many RL-for-LLM papers that only show one scale. The soft spot is exactly the one the stress-test flags. Without an ablation that replaces the identified set with a random 0.01% mask while holding everything else fixed, it is still possible the gains come from generic low-frequency suppression rather than from correctly locating the “spurious” tokens. The abstract does not describe that control, so the specificity claim rests on the identification rule alone. Minor gaps include missing error bars and limited detail on how token selection interacts with the reward model. This paper is for groups already running RL fine-tuning on reasoning models and looking for a drop-in stabilizer. It is coherent on its own terms and shows clear engineering thinking, so it deserves a full referee rather than a desk reject. I would send it to review with a request for the random-mask ablation and a bit more on whether silencing changes downstream reasoning quality.

Referee Report

3 major / 2 minor

Summary. The paper claims that a small fraction (~0.01%) of spurious tokens cause instability in RL fine-tuning of LLMs by receiving amplified gradients from sequence-level rewards. They introduce a unified framework to identify these tokens based on spurious risk, gradient norms, and entropy, and propose the S2T mechanism to silence their gradients. This is incorporated into STAPO, a group-based policy optimization method, which shows superior entropy stability and performance gains of 11.49% (ρ_T=1.0, top-p=1.0) and 3.73% (ρ_T=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL on six math reasoning benchmarks with Qwen 1.7B, 8B, and 14B models.

Significance. If the results hold and the improvements are specifically due to silencing the identified spurious tokens rather than generic regularization, the work could provide a targeted approach to stabilizing RL training for LLMs, reducing reliance on heuristic entropy methods and improving reliability for scaling reasoning in large models. The cross-model-size empirical results would be a strength if the attribution is validated.

major comments (3)

Experiments section: The reported average performance improvements of 11.49% and 3.73% are given without error bars, number of runs, or statistical significance tests, which is load-bearing for the central claim of consistent superiority over baselines.
Token identification and S2T mechanism: No ablation is presented that replaces the identified spurious tokens (0.01% fraction) with a random mask of equal size while keeping all other hyperparameters fixed; without this, the entropy stability and benchmark gains cannot be attributed specifically to the spurious-token framework rather than any low-frequency gradient suppression.
S2T mechanism description: The claim that the identified tokens 'contribute little to the reasoning outcome' is not supported by any verification that silencing them preserves reasoning quality or avoids introducing new biases in the policy update.

minor comments (2)

Abstract: The phrase 'consistent gains' should be qualified with whether improvements hold on every benchmark or are driven by averages.
Notation: The parameters ρ_T and top-p appear in the results tables but their precise definitions and selection process could be stated more explicitly in the main text for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas to strengthen the paper. We address each major comment below and will incorporate revisions to provide more rigorous empirical support.

read point-by-point responses

Referee: Experiments section: The reported average performance improvements of 11.49% and 3.73% are given without error bars, number of runs, or statistical significance tests, which is load-bearing for the central claim of consistent superiority over baselines.

Authors: We fully agree that error bars, multiple runs, and statistical tests are essential to substantiate the performance claims. In the revised manuscript, we will rerun the experiments with at least 3 different random seeds, report mean and standard deviation for all metrics, and include p-values from statistical tests (such as Wilcoxon signed-rank test) to demonstrate the significance of the improvements over baselines. revision: yes
Referee: Token identification and S2T mechanism: No ablation is presented that replaces the identified spurious tokens (0.01% fraction) with a random mask of equal size while keeping all other hyperparameters fixed; without this, the entropy stability and benchmark gains cannot be attributed specifically to the spurious-token framework rather than any low-frequency gradient suppression.

Authors: This is a valid concern for attributing the benefits specifically to our framework. We will add a new ablation experiment in the revised paper where we randomly select and silence an equivalent fraction (0.01%) of tokens without using our identification criteria, and compare the results to STAPO on both stability and benchmark performance. This control will help confirm that the targeted silencing of spurious tokens is key. revision: yes
Referee: S2T mechanism description: The claim that the identified tokens 'contribute little to the reasoning outcome' is not supported by any verification that silencing them preserves reasoning quality or avoids introducing new biases in the policy update.

Authors: We appreciate this point and will enhance the manuscript with additional verification. Specifically, we will include experiments showing the effect of silencing on individual reasoning steps, such as by comparing the correctness of generated solutions with and without the S2T mechanism in controlled settings, and analyze potential biases by examining the distribution of generated tokens or reward signals post-silencing. This will support that reasoning quality is preserved. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper motivates STAPO via an empirical analysis of token-level statistics (spurious risk, gradient norms, entropy changes) to flag ~0.01% spurious tokens, then defines a silencing mechanism inside a group-based policy objective. Performance gains are reported as experimental outcomes on held-out benchmarks rather than as quantities derived from fitted parameters that reduce to the identification rule by construction. No self-citation chains, uniqueness theorems, or ansatz smuggling appear in the provided text; the token-selection rule is not shown to be a direct function of the same reward signal used for the final policy update. The derivation therefore remains self-contained against external benchmarks and does not collapse to its inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that sequence-level rewards are the source of spurious-token amplification and on the empirical observation that 0.01% of tokens dominate gradient disruption.

free parameters (2)

ρ_T
Token silencing threshold used in the reported runs (values 0.7 and 1.0)
top-p
Sampling parameter varied in the two reported settings

axioms (1)

domain assumption A small fraction of tokens inherit the full sequence reward yet contribute negligibly to the final reasoning outcome
Stated as the key factor behind instability

invented entities (2)

Spurious tokens no independent evidence
purpose: Explain source of gradient instability
Defined as ~0.01% of tokens with low contribution but high gradient impact
S2T mechanism no independent evidence
purpose: Suppress gradient perturbations from spurious tokens
New component introduced to implement silencing

pith-pipeline@v0.9.0 · 5613 in / 1391 out tokens · 43067 ms · 2026-05-15T21:38:10.962118+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 11 internal anchors

[1]

Springer, Singapore, 2023

Shengbo Eben Li.Reinforcement Learning for Sequential Decision and Optimal Control. Springer, Singapore, 2023

work page 2023
[2]

Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods

Yuji Cao, Huan Zhao, Yuheng Cheng, Ting Shu, Yue Chen, Guolong Liu, Gaoqi Liang, Junhua Zhao, Jinyue Yan, and Yun Li. Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods. IEEE Transactions on Neural Networks and Learning Systems, 2024

work page 2024
[3]

A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025

work page arXiv 2025
[4]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

NoisyGRPO: Incentivizing multimodal cot reasoning via noise injection and bayesian estimation

Longtian Qiu, Shan Ning, Jiaxuan Sun, and Xuming He. NoisyGRPO: Incentivizing multimodal cot reasoning via noise injection and bayesian estimation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[6]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Soft Adaptive Policy Optimization

Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Entropic: Towards stable long-term training of llms via entropy stabilization with proportional-integral control.arXiv preprint arXiv:2511.15248, 2025

Kai Yang, Xin Xu, Yangkun Chen, Weijie Liu, Jiafei Lyu, Zichuan Lin, Deheng Ye, and Saiyong Yang. Entropic: Towards stable long-term training of llms via entropy stabilization with proportional-integral control.arXiv preprint arXiv:2511.15248, 2025

work page arXiv 2025
[9]

Do not let low-probability tokens over-dominate in rl for llms.arXiv preprint arXiv:2505.12929, 2025

Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han, Zhiyuan He, Dongsheng Li, and Yunjian Xu. Do not let low-probability tokens over-dominate in rl for llms.arXiv preprint arXiv:2505.12929, 2025

work page arXiv 2025
[10]

Sharpness-Guided Group Relative Policy Optimization via Probability Shaping

Tue Le, Nghi DQ Bui, Linh Ngo Van, and Trung Le. Token-regulated group relative policy optimization for stable reinforcement learning in large language models.arXiv preprint arXiv:2511.00066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Low-probability tokens sustain exploration in reinforcement learning with verifiable reward

Guanhua Huang, Tingqiang Xu, Mingze Wang, Qi Yi, Xue Gong, Siheng Li, Ruibin Xiong, Kejiao Li, Yuhao Jiang, and Bo Zhou. Low-probability tokens sustain exploration in reinforcement learning with verifiable reward. arXiv preprint arXiv:2510.03222, 2025

work page arXiv 2025
[12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Justrl: Scaling a 1.5 b llm with a simple rl recipe.arXiv preprint arXiv:2512.16649, 2025

Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. Justrl: Scaling a 1.5 b llm with a simple rl recipe.arXiv preprint arXiv:2512.16649, 2025

work page arXiv 2025
[14]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Bapo: Stabilizing off-policy reinforcement learning for llms via balanced policy optimization with adaptive clipping.arXiv preprint arXiv:2510.18927, 2025

Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Zhihao Zhang, Honglin Guo, et al. Bapo: Stabilizing off-policy reinforcement learning for llms via balanced policy optimization with adaptive clipping.arXiv preprint arXiv:2510.18927, 2025

work page arXiv 2025
[16]

On the entropy dynamics in reinforcement fine-tuning of large language models.arXiv preprint arXiv:2602.03392, 2026

Shumin Wang, Yuexiang Xie, Wenhao Zhang, Yuchang Sun, Yanxi Chen, Yaliang Li, and Yanyong Zhang. On the entropy dynamics in reinforcement fine-tuning of large language models.arXiv preprint arXiv:2602.03392, 2026

work page arXiv 2026
[17]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018. 13

work page 2018
[19]

Trustregionpolicyoptimization

JohnSchulman, SergeyLevine, PieterAbbeel, MichaelJordan, andPhilippMoritz. Trustregionpolicyoptimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015
[20]

Distributional soft actor-critic with three refinements.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3935–3946, 2025

Jingliang Duan, Wenxuan Wang, Liming Xiao, Jiaxin Gao, Shengbo Eben Li, Chang Liu, Ya-Qin Zhang, Bo Cheng, and Keqiang Li. Distributional soft actor-critic with three refinements.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3935–3946, 2025

work page 2025
[21]

Bootstrap off-policy with world model

Guojian Zhan, Likun Wang, Xiangteng Zhang, Jiaxin Gao, Masayoshi Tomizuka, and Shengbo Eben Li. Bootstrap off-policy with world model. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[22]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025
[24]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, pages 53728–53741, 2023

work page 2023
[26]

Group sequence policy optimization, 2025

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025

work page 2025
[27]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Aspo: Asymmetric importance sampling policy optimization.arXiv preprint arXiv:2510.06062, 2025

Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai. Aspo: Asymmetric importance sampling policy optimization.arXiv preprint arXiv:2510.06062, 2025

work page arXiv 2025
[29]

Jingliang Duan, Yang Guan, Shengbo Eben Li, Yangang Ren, Qi Sun, and Bo Cheng. Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors.IEEE transactions on neural networks and learning systems, 33(11):6584–6598, 2021

work page 2021
[30]

Diffusion actor-critic with entropy regulator.Advances in Neural Information Processing Systems, 37:54183–54204, 2024

Yinuo Wang, Likun Wang, Yuxuan Jiang, Wenjun Zou, Tong Liu, Xujie Song, Wenxuan Wang, Liming Xiao, Jiang Wu, Jingliang Duan, et al. Diffusion actor-critic with entropy regulator.Advances in Neural Information Processing Systems, 37:54183–54204, 2024

work page 2024
[31]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Bicriteria policy optimization for high-accuracy reinforcement learning.IEEE Transactions on Neural Networks and Learning Systems, 2025

Guojian Zhan, Xiangteng Zhang, Feihong Zhang, Letian Tao, and Shengbo Eben Li. Bicriteria policy optimization for high-accuracy reinforcement learning.IEEE Transactions on Neural Networks and Learning Systems, 2025

work page 2025
[33]

Continuous-time policy optimization

Guojian Zhan, Yuxuan Jiang, Jingliang Duan, Shengbo Eben Li, Bo Cheng, and Keqiang Li. Continuous-time policy optimization. InACC, pages 3382–3388, 2023

work page 2023
[34]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

work page 2025
[35]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

work page 2024
[36]

Aime2025 dataset

OpenCompass. Aime2025 dataset. https://huggingface.co/datasets/opencompass/AIME2025, 2025. Accessed: 2025-01-23

work page 2025
[37]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. 14

work page 2021
[38]

Solving quantitative reasoning problems with language models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. InAdvances in Neural Information Processing Systems, volume 35, pages 3843–3857, 2022

work page 2022
[39]

OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computa...

work page 2024
[40]

Compassverifier: A unified and robust verifier for llms evaluation and outcome reward

Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F Wong, Songyang Zhang, et al. Compassverifier: A unified and robust verifier for llms evaluation and outcome reward. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33454–33482, 2025

work page 2025
[41]

broken” (Prob: 0.05%) to describe the removal of edges. The canonical term “removed

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 15 Appendix A Gradient Norm Decomposition We first establish...

work page 2023
[42]

WhenNcandies are dividedbetween21 people (Albert and his 20 friends), the remainder is 5. Spurious Token Prob (P) Adv Top-5 Distribution between0.0667% 0.72 among(64.55%) | by(23.75%) | evenly(8.74%) | amongst(1.95%) | equally(0.92%) Case 5 Context:Totrisegment, we must first count the number of points in the polygon. ######Step 1:sume the number of point...

work page
[43]

Sincef(1) = 0, we know that: a+b+c= 0 This meansc=−a−b. Now, substitutec=−a−binto the quadratic function: f(x) =ax 2 +bx−(a+b) Spurious Token Prob (P) Adv Top-5 Distribution Now0.0708% 0.35 2(99.74%) | Next(0.12%) | Now(0.06%) | So(0.03%) | Sub(0.02%) Case 2 Context:\boxed{5\text{ agony}} Spurious Token Prob (P) Adv Top-5 Distribution \0.0015% 0.35}\n (99...

work page