pith. machine review for the scientific record. sign in

arxiv: 2602.15620 · v4 · submitted 2026-02-17 · 💻 cs.CL · cs.AI

Recognition: no theorem link

STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords reinforcement learninglarge language modelsspurious tokenspolicy optimizationmathematical reasoningentropy stabilitygradient suppressionfine-tuning stability
0
0 comments X

The pith

Silencing gradients from a tiny fraction of spurious tokens stabilizes RL fine-tuning of LLMs and raises math reasoning performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that roughly 0.01 percent of tokens in LLM reasoning traces receive full sequence-level rewards despite contributing almost nothing to the final answer, which inflates their gradients and drives entropy spikes followed by performance collapse late in training. It introduces the Silencing Spurious Tokens mechanism to zero out those gradients selectively and folds the change into a group-based policy objective called STAPO. Across Qwen 1.7B, 8B, and 14B models on six math benchmarks, the method keeps entropy flat and lifts average accuracy by 11.49 percent under greedy sampling and 3.73 percent under nucleus sampling relative to prior RL baselines. A sympathetic reader cares because current RL recipes for reasoning models still rely on ad-hoc fixes that fail at scale, and a targeted token-level intervention could remove the need for them.

Core claim

The central claim is that a small set of spurious tokens inherits the full outcome reward, producing outsized gradient updates that destabilize the policy and degrade reasoning quality. The authors define a unified evaluation of token-level effects across spurious risk, gradient norm, and entropy change, then propose the S2T mechanism to suppress gradients from these tokens inside a group-relative objective. The resulting STAPO algorithm produces stable entropy trajectories and consistent accuracy gains on mathematical reasoning tasks for Qwen models of three sizes.

What carries the argument

The Silencing Spurious Tokens (S2T) mechanism, which identifies low-contribution tokens and suppresses their gradient contributions within the group-based policy update.

If this is right

  • Late-stage performance collapse in RL fine-tuning of reasoning models can be prevented by token-level gradient editing rather than global entropy regularization.
  • The same S2T logic can be added to other group-relative objectives without changing their sampling or reward structure.
  • Entropy remains controlled across training without extra regularization terms once spurious gradient contributions are removed.
  • Accuracy gains appear consistently across 1.7B to 14B model scales on math benchmarks under both full and top-p sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could transfer to non-math RL tasks such as code generation where similar low-value tokens might receive oversized credit.
  • Detecting spurious tokens automatically rather than by fixed frequency thresholds would make the method easier to apply to new domains.
  • If spurious tokens also appear in preference data, the same silencing step might reduce reward-model exploitation in standard RLHF.

Load-bearing premise

That the identified spurious tokens are the dominant source of instability and that zeroing their gradients removes noise without discarding useful reasoning information or creating new biases.

What would settle it

Run identical STAPO training on the same Qwen models but disable S2T gradient suppression; if entropy still stays flat and accuracy matches the reported gains, the causal role of spurious tokens would be falsified.

read the original abstract

Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. We identify a key factor behind this instability: a small fraction of tokens, termed spurious tokens (around 0.01%), which contribute little to the reasoning outcome but receive disproportionately amplified gradient updates due to inheriting the full sequence-level reward. We present a unified framework for evaluating token-level optimization impacts across spurious risk, gradient norms, and entropy changes. Building on the analysis of token characteristics that severely disrupt optimization, we propose the Silencing Spurious Tokens (S2T) mechanism to efficiently suppress their gradient perturbations. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 11.49% ($\rho_{\mathrm{T}}$=1.0, top-p=1.0) and 3.73% ($\rho_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that a small fraction (~0.01%) of spurious tokens cause instability in RL fine-tuning of LLMs by receiving amplified gradients from sequence-level rewards. They introduce a unified framework to identify these tokens based on spurious risk, gradient norms, and entropy, and propose the S2T mechanism to silence their gradients. This is incorporated into STAPO, a group-based policy optimization method, which shows superior entropy stability and performance gains of 11.49% (ρ_T=1.0, top-p=1.0) and 3.73% (ρ_T=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL on six math reasoning benchmarks with Qwen 1.7B, 8B, and 14B models.

Significance. If the results hold and the improvements are specifically due to silencing the identified spurious tokens rather than generic regularization, the work could provide a targeted approach to stabilizing RL training for LLMs, reducing reliance on heuristic entropy methods and improving reliability for scaling reasoning in large models. The cross-model-size empirical results would be a strength if the attribution is validated.

major comments (3)
  1. Experiments section: The reported average performance improvements of 11.49% and 3.73% are given without error bars, number of runs, or statistical significance tests, which is load-bearing for the central claim of consistent superiority over baselines.
  2. Token identification and S2T mechanism: No ablation is presented that replaces the identified spurious tokens (0.01% fraction) with a random mask of equal size while keeping all other hyperparameters fixed; without this, the entropy stability and benchmark gains cannot be attributed specifically to the spurious-token framework rather than any low-frequency gradient suppression.
  3. S2T mechanism description: The claim that the identified tokens 'contribute little to the reasoning outcome' is not supported by any verification that silencing them preserves reasoning quality or avoids introducing new biases in the policy update.
minor comments (2)
  1. Abstract: The phrase 'consistent gains' should be qualified with whether improvements hold on every benchmark or are driven by averages.
  2. Notation: The parameters ρ_T and top-p appear in the results tables but their precise definitions and selection process could be stated more explicitly in the main text for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas to strengthen the paper. We address each major comment below and will incorporate revisions to provide more rigorous empirical support.

read point-by-point responses
  1. Referee: Experiments section: The reported average performance improvements of 11.49% and 3.73% are given without error bars, number of runs, or statistical significance tests, which is load-bearing for the central claim of consistent superiority over baselines.

    Authors: We fully agree that error bars, multiple runs, and statistical tests are essential to substantiate the performance claims. In the revised manuscript, we will rerun the experiments with at least 3 different random seeds, report mean and standard deviation for all metrics, and include p-values from statistical tests (such as Wilcoxon signed-rank test) to demonstrate the significance of the improvements over baselines. revision: yes

  2. Referee: Token identification and S2T mechanism: No ablation is presented that replaces the identified spurious tokens (0.01% fraction) with a random mask of equal size while keeping all other hyperparameters fixed; without this, the entropy stability and benchmark gains cannot be attributed specifically to the spurious-token framework rather than any low-frequency gradient suppression.

    Authors: This is a valid concern for attributing the benefits specifically to our framework. We will add a new ablation experiment in the revised paper where we randomly select and silence an equivalent fraction (0.01%) of tokens without using our identification criteria, and compare the results to STAPO on both stability and benchmark performance. This control will help confirm that the targeted silencing of spurious tokens is key. revision: yes

  3. Referee: S2T mechanism description: The claim that the identified tokens 'contribute little to the reasoning outcome' is not supported by any verification that silencing them preserves reasoning quality or avoids introducing new biases in the policy update.

    Authors: We appreciate this point and will enhance the manuscript with additional verification. Specifically, we will include experiments showing the effect of silencing on individual reasoning steps, such as by comparing the correctness of generated solutions with and without the S2T mechanism in controlled settings, and analyze potential biases by examining the distribution of generated tokens or reward signals post-silencing. This will support that reasoning quality is preserved. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper motivates STAPO via an empirical analysis of token-level statistics (spurious risk, gradient norms, entropy changes) to flag ~0.01% spurious tokens, then defines a silencing mechanism inside a group-based policy objective. Performance gains are reported as experimental outcomes on held-out benchmarks rather than as quantities derived from fitted parameters that reduce to the identification rule by construction. No self-citation chains, uniqueness theorems, or ansatz smuggling appear in the provided text; the token-selection rule is not shown to be a direct function of the same reward signal used for the final policy update. The derivation therefore remains self-contained against external benchmarks and does not collapse to its inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that sequence-level rewards are the source of spurious-token amplification and on the empirical observation that 0.01% of tokens dominate gradient disruption.

free parameters (2)
  • ρ_T
    Token silencing threshold used in the reported runs (values 0.7 and 1.0)
  • top-p
    Sampling parameter varied in the two reported settings
axioms (1)
  • domain assumption A small fraction of tokens inherit the full sequence reward yet contribute negligibly to the final reasoning outcome
    Stated as the key factor behind instability
invented entities (2)
  • Spurious tokens no independent evidence
    purpose: Explain source of gradient instability
    Defined as ~0.01% of tokens with low contribution but high gradient impact
  • S2T mechanism no independent evidence
    purpose: Suppress gradient perturbations from spurious tokens
    New component introduced to implement silencing

pith-pipeline@v0.9.0 · 5613 in / 1391 out tokens · 43067 ms · 2026-05-15T21:38:10.962118+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 11 internal anchors

  1. [1]

    Springer, Singapore, 2023

    Shengbo Eben Li.Reinforcement Learning for Sequential Decision and Optimal Control. Springer, Singapore, 2023

  2. [2]

    Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods

    Yuji Cao, Huan Zhao, Yuheng Cheng, Ting Shu, Yue Chen, Guolong Liu, Gaoqi Liang, Junhua Zhao, Jinyue Yan, and Yun Li. Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods. IEEE Transactions on Neural Networks and Learning Systems, 2024

  3. [3]

    A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025

    Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025

  4. [4]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025

  5. [5]

    NoisyGRPO: Incentivizing multimodal cot reasoning via noise injection and bayesian estimation

    Longtian Qiu, Shan Ning, Jiaxuan Sun, and Xuming He. NoisyGRPO: Incentivizing multimodal cot reasoning via noise injection and bayesian estimation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  6. [6]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  7. [7]

    Soft Adaptive Policy Optimization

    Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025

  8. [8]

    Entropic: Towards stable long-term training of llms via entropy stabilization with proportional-integral control.arXiv preprint arXiv:2511.15248, 2025

    Kai Yang, Xin Xu, Yangkun Chen, Weijie Liu, Jiafei Lyu, Zichuan Lin, Deheng Ye, and Saiyong Yang. Entropic: Towards stable long-term training of llms via entropy stabilization with proportional-integral control.arXiv preprint arXiv:2511.15248, 2025

  9. [9]

    Do not let low-probability tokens over-dominate in rl for llms.arXiv preprint arXiv:2505.12929, 2025

    Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han, Zhiyuan He, Dongsheng Li, and Yunjian Xu. Do not let low-probability tokens over-dominate in rl for llms.arXiv preprint arXiv:2505.12929, 2025

  10. [10]

    Sharpness-Guided Group Relative Policy Optimization via Probability Shaping

    Tue Le, Nghi DQ Bui, Linh Ngo Van, and Trung Le. Token-regulated group relative policy optimization for stable reinforcement learning in large language models.arXiv preprint arXiv:2511.00066, 2025

  11. [11]

    Low-probability tokens sustain exploration in reinforcement learning with verifiable reward

    Guanhua Huang, Tingqiang Xu, Mingze Wang, Qi Yi, Xue Gong, Siheng Li, Ruibin Xiong, Kejiao Li, Yuhao Jiang, and Bo Zhou. Low-probability tokens sustain exploration in reinforcement learning with verifiable reward. arXiv preprint arXiv:2510.03222, 2025

  12. [12]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  13. [13]

    Justrl: Scaling a 1.5 b llm with a simple rl recipe.arXiv preprint arXiv:2512.16649, 2025

    Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. Justrl: Scaling a 1.5 b llm with a simple rl recipe.arXiv preprint arXiv:2512.16649, 2025

  14. [14]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

  15. [15]

    Bapo: Stabilizing off-policy reinforcement learning for llms via balanced policy optimization with adaptive clipping.arXiv preprint arXiv:2510.18927, 2025

    Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Zhihao Zhang, Honglin Guo, et al. Bapo: Stabilizing off-policy reinforcement learning for llms via balanced policy optimization with adaptive clipping.arXiv preprint arXiv:2510.18927, 2025

  16. [16]

    On the entropy dynamics in reinforcement fine-tuning of large language models.arXiv preprint arXiv:2602.03392, 2026

    Shumin Wang, Yuexiang Xie, Wenhao Zhang, Yuchang Sun, Yanxi Chen, Yaliang Li, and Yanyong Zhang. On the entropy dynamics in reinforcement fine-tuning of large language models.arXiv preprint arXiv:2602.03392, 2026

  17. [17]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  18. [18]

    Addressing function approximation error in actor-critic methods

    Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018. 13

  19. [19]

    Trustregionpolicyoptimization

    JohnSchulman, SergeyLevine, PieterAbbeel, MichaelJordan, andPhilippMoritz. Trustregionpolicyoptimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

  20. [20]

    Distributional soft actor-critic with three refinements.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3935–3946, 2025

    Jingliang Duan, Wenxuan Wang, Liming Xiao, Jiaxin Gao, Shengbo Eben Li, Chang Liu, Ya-Qin Zhang, Bo Cheng, and Keqiang Li. Distributional soft actor-critic with three refinements.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3935–3946, 2025

  21. [21]

    Bootstrap off-policy with world model

    Guojian Zhan, Likun Wang, Xiangteng Zhang, Jiaxin Gao, Masayoshi Tomizuka, and Shengbo Eben Li. Bootstrap off-policy with world model. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  22. [22]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  23. [23]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  24. [24]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  25. [25]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, pages 53728–53741, 2023

  26. [26]

    Group sequence policy optimization, 2025

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025

  27. [27]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

  28. [28]

    Aspo: Asymmetric importance sampling policy optimization.arXiv preprint arXiv:2510.06062, 2025

    Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai. Aspo: Asymmetric importance sampling policy optimization.arXiv preprint arXiv:2510.06062, 2025

  29. [29]

    Jingliang Duan, Yang Guan, Shengbo Eben Li, Yangang Ren, Qi Sun, and Bo Cheng. Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors.IEEE transactions on neural networks and learning systems, 33(11):6584–6598, 2021

  30. [30]

    Diffusion actor-critic with entropy regulator.Advances in Neural Information Processing Systems, 37:54183–54204, 2024

    Yinuo Wang, Likun Wang, Yuxuan Jiang, Wenjun Zou, Tong Liu, Xujie Song, Wenxuan Wang, Liming Xiao, Jiang Wu, Jingliang Duan, et al. Diffusion actor-critic with entropy regulator.Advances in Neural Information Processing Systems, 37:54183–54204, 2024

  31. [31]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

  32. [32]

    Bicriteria policy optimization for high-accuracy reinforcement learning.IEEE Transactions on Neural Networks and Learning Systems, 2025

    Guojian Zhan, Xiangteng Zhang, Feihong Zhang, Letian Tao, and Shengbo Eben Li. Bicriteria policy optimization for high-accuracy reinforcement learning.IEEE Transactions on Neural Networks and Learning Systems, 2025

  33. [33]

    Continuous-time policy optimization

    Guojian Zhan, Yuxuan Jiang, Jingliang Duan, Shengbo Eben Li, Bo Cheng, and Keqiang Li. Continuous-time policy optimization. InACC, pages 3382–3388, 2023

  34. [34]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

  35. [35]

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

  36. [36]

    Aime2025 dataset

    OpenCompass. Aime2025 dataset. https://huggingface.co/datasets/opencompass/AIME2025, 2025. Accessed: 2025-01-23

  37. [37]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. 14

  38. [38]

    Solving quantitative reasoning problems with language models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. InAdvances in Neural Information Processing Systems, volume 35, pages 3843–3857, 2022

  39. [39]

    OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computa...

  40. [40]

    Compassverifier: A unified and robust verifier for llms evaluation and outcome reward

    Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F Wong, Songyang Zhang, et al. Compassverifier: A unified and robust verifier for llms evaluation and outcome reward. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33454–33482, 2025

  41. [41]

    broken” (Prob: 0.05%) to describe the removal of edges. The canonical term “removed

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 15 Appendix A Gradient Norm Decomposition We first establish...

  42. [42]

    WhenNcandies are dividedbetween21 people (Albert and his 20 friends), the remainder is 5. Spurious Token Prob (P) Adv Top-5 Distribution between0.0667% 0.72 among(64.55%) | by(23.75%) | evenly(8.74%) | amongst(1.95%) | equally(0.92%) Case 5 Context:Totrisegment, we must first count the number of points in the polygon. ######Step 1:sume the number of point...

  43. [43]

    Sincef(1) = 0, we know that: a+b+c= 0 This meansc=−a−b. Now, substitutec=−a−binto the quadratic function: f(x) =ax 2 +bx−(a+b) Spurious Token Prob (P) Adv Top-5 Distribution Now0.0708% 0.35 2(99.74%) | Next(0.12%) | Now(0.06%) | So(0.03%) | Sub(0.02%) Case 2 Context:\boxed{5\text{ agony}} Spurious Token Prob (P) Adv Top-5 Distribution \0.0015% 0.35}\n (99...