Recognition: no theorem link
STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
Pith reviewed 2026-05-15 21:38 UTC · model grok-4.3
The pith
Silencing gradients from a tiny fraction of spurious tokens stabilizes RL fine-tuning of LLMs and raises math reasoning performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a small set of spurious tokens inherits the full outcome reward, producing outsized gradient updates that destabilize the policy and degrade reasoning quality. The authors define a unified evaluation of token-level effects across spurious risk, gradient norm, and entropy change, then propose the S2T mechanism to suppress gradients from these tokens inside a group-relative objective. The resulting STAPO algorithm produces stable entropy trajectories and consistent accuracy gains on mathematical reasoning tasks for Qwen models of three sizes.
What carries the argument
The Silencing Spurious Tokens (S2T) mechanism, which identifies low-contribution tokens and suppresses their gradient contributions within the group-based policy update.
If this is right
- Late-stage performance collapse in RL fine-tuning of reasoning models can be prevented by token-level gradient editing rather than global entropy regularization.
- The same S2T logic can be added to other group-relative objectives without changing their sampling or reward structure.
- Entropy remains controlled across training without extra regularization terms once spurious gradient contributions are removed.
- Accuracy gains appear consistently across 1.7B to 14B model scales on math benchmarks under both full and top-p sampling.
Where Pith is reading between the lines
- The approach could transfer to non-math RL tasks such as code generation where similar low-value tokens might receive oversized credit.
- Detecting spurious tokens automatically rather than by fixed frequency thresholds would make the method easier to apply to new domains.
- If spurious tokens also appear in preference data, the same silencing step might reduce reward-model exploitation in standard RLHF.
Load-bearing premise
That the identified spurious tokens are the dominant source of instability and that zeroing their gradients removes noise without discarding useful reasoning information or creating new biases.
What would settle it
Run identical STAPO training on the same Qwen models but disable S2T gradient suppression; if entropy still stays flat and accuracy matches the reported gains, the causal role of spurious tokens would be falsified.
read the original abstract
Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. We identify a key factor behind this instability: a small fraction of tokens, termed spurious tokens (around 0.01%), which contribute little to the reasoning outcome but receive disproportionately amplified gradient updates due to inheriting the full sequence-level reward. We present a unified framework for evaluating token-level optimization impacts across spurious risk, gradient norms, and entropy changes. Building on the analysis of token characteristics that severely disrupt optimization, we propose the Silencing Spurious Tokens (S2T) mechanism to efficiently suppress their gradient perturbations. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 11.49% ($\rho_{\mathrm{T}}$=1.0, top-p=1.0) and 3.73% ($\rho_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a small fraction (~0.01%) of spurious tokens cause instability in RL fine-tuning of LLMs by receiving amplified gradients from sequence-level rewards. They introduce a unified framework to identify these tokens based on spurious risk, gradient norms, and entropy, and propose the S2T mechanism to silence their gradients. This is incorporated into STAPO, a group-based policy optimization method, which shows superior entropy stability and performance gains of 11.49% (ρ_T=1.0, top-p=1.0) and 3.73% (ρ_T=0.7, top-p=0.9) over GRPO, 20-Entropy, and JustRL on six math reasoning benchmarks with Qwen 1.7B, 8B, and 14B models.
Significance. If the results hold and the improvements are specifically due to silencing the identified spurious tokens rather than generic regularization, the work could provide a targeted approach to stabilizing RL training for LLMs, reducing reliance on heuristic entropy methods and improving reliability for scaling reasoning in large models. The cross-model-size empirical results would be a strength if the attribution is validated.
major comments (3)
- Experiments section: The reported average performance improvements of 11.49% and 3.73% are given without error bars, number of runs, or statistical significance tests, which is load-bearing for the central claim of consistent superiority over baselines.
- Token identification and S2T mechanism: No ablation is presented that replaces the identified spurious tokens (0.01% fraction) with a random mask of equal size while keeping all other hyperparameters fixed; without this, the entropy stability and benchmark gains cannot be attributed specifically to the spurious-token framework rather than any low-frequency gradient suppression.
- S2T mechanism description: The claim that the identified tokens 'contribute little to the reasoning outcome' is not supported by any verification that silencing them preserves reasoning quality or avoids introducing new biases in the policy update.
minor comments (2)
- Abstract: The phrase 'consistent gains' should be qualified with whether improvements hold on every benchmark or are driven by averages.
- Notation: The parameters ρ_T and top-p appear in the results tables but their precise definitions and selection process could be stated more explicitly in the main text for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us identify areas to strengthen the paper. We address each major comment below and will incorporate revisions to provide more rigorous empirical support.
read point-by-point responses
-
Referee: Experiments section: The reported average performance improvements of 11.49% and 3.73% are given without error bars, number of runs, or statistical significance tests, which is load-bearing for the central claim of consistent superiority over baselines.
Authors: We fully agree that error bars, multiple runs, and statistical tests are essential to substantiate the performance claims. In the revised manuscript, we will rerun the experiments with at least 3 different random seeds, report mean and standard deviation for all metrics, and include p-values from statistical tests (such as Wilcoxon signed-rank test) to demonstrate the significance of the improvements over baselines. revision: yes
-
Referee: Token identification and S2T mechanism: No ablation is presented that replaces the identified spurious tokens (0.01% fraction) with a random mask of equal size while keeping all other hyperparameters fixed; without this, the entropy stability and benchmark gains cannot be attributed specifically to the spurious-token framework rather than any low-frequency gradient suppression.
Authors: This is a valid concern for attributing the benefits specifically to our framework. We will add a new ablation experiment in the revised paper where we randomly select and silence an equivalent fraction (0.01%) of tokens without using our identification criteria, and compare the results to STAPO on both stability and benchmark performance. This control will help confirm that the targeted silencing of spurious tokens is key. revision: yes
-
Referee: S2T mechanism description: The claim that the identified tokens 'contribute little to the reasoning outcome' is not supported by any verification that silencing them preserves reasoning quality or avoids introducing new biases in the policy update.
Authors: We appreciate this point and will enhance the manuscript with additional verification. Specifically, we will include experiments showing the effect of silencing on individual reasoning steps, such as by comparing the correctness of generated solutions with and without the S2T mechanism in controlled settings, and analyze potential biases by examining the distribution of generated tokens or reward signals post-silencing. This will support that reasoning quality is preserved. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper motivates STAPO via an empirical analysis of token-level statistics (spurious risk, gradient norms, entropy changes) to flag ~0.01% spurious tokens, then defines a silencing mechanism inside a group-based policy objective. Performance gains are reported as experimental outcomes on held-out benchmarks rather than as quantities derived from fitted parameters that reduce to the identification rule by construction. No self-citation chains, uniqueness theorems, or ansatz smuggling appear in the provided text; the token-selection rule is not shown to be a direct function of the same reward signal used for the final policy update. The derivation therefore remains self-contained against external benchmarks and does not collapse to its inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- ρ_T
- top-p
axioms (1)
- domain assumption A small fraction of tokens inherit the full sequence reward yet contribute negligibly to the final reasoning outcome
invented entities (2)
-
Spurious tokens
no independent evidence
-
S2T mechanism
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Shengbo Eben Li.Reinforcement Learning for Sequential Decision and Optimal Control. Springer, Singapore, 2023
work page 2023
-
[2]
Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods
Yuji Cao, Huan Zhao, Yuheng Cheng, Ting Shu, Yue Chen, Guolong Liu, Gaoqi Liang, Junhua Zhao, Jinyue Yan, and Yun Li. Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods. IEEE Transactions on Neural Networks and Learning Systems, 2024
work page 2024
-
[3]
A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025
Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025
-
[4]
Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
NoisyGRPO: Incentivizing multimodal cot reasoning via noise injection and bayesian estimation
Longtian Qiu, Shan Ning, Jiaxuan Sun, and Xuming He. NoisyGRPO: Incentivizing multimodal cot reasoning via noise injection and bayesian estimation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[6]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Soft Adaptive Policy Optimization
Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Kai Yang, Xin Xu, Yangkun Chen, Weijie Liu, Jiafei Lyu, Zichuan Lin, Deheng Ye, and Saiyong Yang. Entropic: Towards stable long-term training of llms via entropy stabilization with proportional-integral control.arXiv preprint arXiv:2511.15248, 2025
-
[9]
Do not let low-probability tokens over-dominate in rl for llms.arXiv preprint arXiv:2505.12929, 2025
Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han, Zhiyuan He, Dongsheng Li, and Yunjian Xu. Do not let low-probability tokens over-dominate in rl for llms.arXiv preprint arXiv:2505.12929, 2025
-
[10]
Sharpness-Guided Group Relative Policy Optimization via Probability Shaping
Tue Le, Nghi DQ Bui, Linh Ngo Van, and Trung Le. Token-regulated group relative policy optimization for stable reinforcement learning in large language models.arXiv preprint arXiv:2511.00066, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Low-probability tokens sustain exploration in reinforcement learning with verifiable reward
Guanhua Huang, Tingqiang Xu, Mingze Wang, Qi Yi, Xue Gong, Siheng Li, Ruibin Xiong, Kejiao Li, Yuhao Jiang, and Bo Zhou. Low-probability tokens sustain exploration in reinforcement learning with verifiable reward. arXiv preprint arXiv:2510.03222, 2025
-
[12]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Justrl: Scaling a 1.5 b llm with a simple rl recipe.arXiv preprint arXiv:2512.16649, 2025
Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. Justrl: Scaling a 1.5 b llm with a simple rl recipe.arXiv preprint arXiv:2512.16649, 2025
-
[14]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Zhihao Zhang, Honglin Guo, et al. Bapo: Stabilizing off-policy reinforcement learning for llms via balanced policy optimization with adaptive clipping.arXiv preprint arXiv:2510.18927, 2025
-
[16]
Shumin Wang, Yuexiang Xie, Wenhao Zhang, Yuchang Sun, Yanxi Chen, Yaliang Li, and Yanyong Zhang. On the entropy dynamics in reinforcement fine-tuning of large language models.arXiv preprint arXiv:2602.03392, 2026
-
[17]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Addressing function approximation error in actor-critic methods
Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018. 13
work page 2018
-
[19]
JohnSchulman, SergeyLevine, PieterAbbeel, MichaelJordan, andPhilippMoritz. Trustregionpolicyoptimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015
work page 2015
-
[20]
Jingliang Duan, Wenxuan Wang, Liming Xiao, Jiaxin Gao, Shengbo Eben Li, Chang Liu, Ya-Qin Zhang, Bo Cheng, and Keqiang Li. Distributional soft actor-critic with three refinements.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(5):3935–3946, 2025
work page 2025
-
[21]
Bootstrap off-policy with world model
Guojian Zhan, Likun Wang, Xiangteng Zhang, Jiaxin Gao, Masayoshi Tomizuka, and Shengbo Eben Li. Bootstrap off-policy with world model. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[22]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
work page 2025
-
[24]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, pages 53728–53741, 2023
work page 2023
-
[26]
Group sequence policy optimization, 2025
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025
work page 2025
-
[27]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Aspo: Asymmetric importance sampling policy optimization.arXiv preprint arXiv:2510.06062, 2025
Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai. Aspo: Asymmetric importance sampling policy optimization.arXiv preprint arXiv:2510.06062, 2025
-
[29]
Jingliang Duan, Yang Guan, Shengbo Eben Li, Yangang Ren, Qi Sun, and Bo Cheng. Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors.IEEE transactions on neural networks and learning systems, 33(11):6584–6598, 2021
work page 2021
-
[30]
Yinuo Wang, Likun Wang, Yuxuan Jiang, Wenjun Zou, Tong Liu, Xujie Song, Wenxuan Wang, Liming Xiao, Jiang Wu, Jingliang Duan, et al. Diffusion actor-critic with entropy regulator.Advances in Neural Information Processing Systems, 37:54183–54204, 2024
work page 2024
-
[31]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Guojian Zhan, Xiangteng Zhang, Feihong Zhang, Letian Tao, and Shengbo Eben Li. Bicriteria policy optimization for high-accuracy reinforcement learning.IEEE Transactions on Neural Networks and Learning Systems, 2025
work page 2025
-
[33]
Continuous-time policy optimization
Guojian Zhan, Yuxuan Jiang, Jingliang Duan, Shengbo Eben Li, Bo Cheng, and Keqiang Li. Continuous-time policy optimization. InACC, pages 3382–3388, 2023
work page 2023
-
[34]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025
work page 2025
-
[35]
Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024
work page 2024
-
[36]
OpenCompass. Aime2025 dataset. https://huggingface.co/datasets/opencompass/AIME2025, 2025. Accessed: 2025-01-23
work page 2025
-
[37]
Measuring mathematical problem solving with the math dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. 14
work page 2021
-
[38]
Solving quantitative reasoning problems with language models
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. InAdvances in Neural Information Processing Systems, volume 35, pages 3843–3857, 2022
work page 2022
-
[39]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computa...
work page 2024
-
[40]
Compassverifier: A unified and robust verifier for llms evaluation and outcome reward
Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F Wong, Songyang Zhang, et al. Compassverifier: A unified and robust verifier for llms evaluation and outcome reward. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33454–33482, 2025
work page 2025
-
[41]
broken” (Prob: 0.05%) to describe the removal of edges. The canonical term “removed
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 15 Appendix A Gradient Norm Decomposition We first establish...
work page 2023
-
[42]
WhenNcandies are dividedbetween21 people (Albert and his 20 friends), the remainder is 5. Spurious Token Prob (P) Adv Top-5 Distribution between0.0667% 0.72 among(64.55%) | by(23.75%) | evenly(8.74%) | amongst(1.95%) | equally(0.92%) Case 5 Context:Totrisegment, we must first count the number of points in the polygon. ######Step 1:sume the number of point...
-
[43]
Sincef(1) = 0, we know that: a+b+c= 0 This meansc=−a−b. Now, substitutec=−a−binto the quadratic function: f(x) =ax 2 +bx−(a+b) Spurious Token Prob (P) Adv Top-5 Distribution Now0.0708% 0.35 2(99.74%) | Next(0.12%) | Now(0.06%) | So(0.03%) | Sub(0.02%) Case 2 Context:\boxed{5\text{ agony}} Spurious Token Prob (P) Adv Top-5 Distribution \0.0015% 0.35}\n (99...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.