arxiv: 2605.07153 · v1 · submitted 2026-05-08 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

Du Su, Fei Sun, Hongyu Zang, Jingang Wang, Junwei Zhang, Wanli Yang, Wenjie Shi, Xueqi Cheng

Pith reviewed 2026-05-11 01:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords reinforcement learninglarge language modelsparametric knowledgefactual recallquestion answeringprobability redistributionclosed-book QA

0 comments

The pith

Reinforcement learning improves factual recall in LLMs by redistributing probability mass over existing parametric knowledge rather than acquiring new facts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether reinforcement learning, already known for boosting reasoning, can also enhance direct recall of facts stored in an LLM's parameters. It runs controlled experiments in zero-shot closed-book question answering using only binary correctness rewards and strict fact-level deduplication to isolate recall improvements from reasoning or memorization. Results show consistent gains of about 27 percent relative across model families and benchmarks, beating both training and inference baselines. The mechanism is redistribution: correct answers that sat in the low-probability tail get boosted into greedy outputs without the model learning genuinely new information. The hardest training examples, those whose answers rarely appeared in initial samples, drive most of the improvement because rare correct rollouts still occur and get reinforced.

Core claim

In a zero-shot closed-book factual QA setting with binary rewards and train-test deduplication, RL training produces roughly 27 percent average relative accuracy gains across three model families. These gains exceed those from standard supervised fine-tuning or inference-time methods. Analysis shows the improvement arises from shifting probability mass toward already-present correct answers, elevating them from low-likelihood regions to reliable greedy generations. Attribution further reveals that examples where correct answers never surfaced in 128 pre-RL samples, comprising only about 18 percent of the data, account for approximately 83 percent of the total gain because occasional correct

What carries the argument

Binary-reward reinforcement learning applied to factual QA rollouts, which selectively reinforces rare correct answer sequences to redistribute probability mass within the model's existing parametric knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same RL redistribution mechanism could be tested on tasks beyond one-hop QA, such as multi-hop retrieval or domain-specific knowledge surfacing, to see if low-probability facts become accessible without new data.
If the hardest examples drive gains because they still produce occasional correct rollouts, training regimes might be optimized by upweighting examples with the lowest initial success rates.
This view of RL as a probability reshaper rather than a knowledge injector suggests it could complement rather than replace continued pre-training when the goal is reliable factual generation.

Load-bearing premise

That the zero-shot closed-book setup with fact-level deduplication and binary rewards fully separates improved parametric recall from side effects such as format learning or reduced refusal.

What would settle it

A post-training measurement showing that the model now correctly answers questions whose answers never appeared in any pre-RL generation samples, or that accuracy gains vanish when the model is forced to sample only from pre-RL probability distributions.

Figures

Figures reproduced from arXiv: 2605.07153 by Du Su, Fei Sun, Hongyu Zang, Jingang Wang, Junwei Zhang, Wanli Yang, Wenjie Shi, Xueqi Cheng.

**Figure 1.** Figure 1: Training dynamics of the four methods on NQ-Qwen. To further understand why RL outperforms the baselines, Figure 1 compares their training dynamics on NQ across normalized training progress, using Qwen as a representative example, with complete results across all three models presented in Appendix F. The baselines exhibit distinct failure modes. SFT rapidly overfits the training data without generalizin… view at source ↗

**Figure 2.** Figure 2: Comparison between test-time scaling strategies and RL across various datasets and LLMs. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-dataset acc. gain. Cross-dataset transfer. Beyond the in-domain setting, we further examine whether the improvement in factual recall transfers across datasets by training on one QA dataset and evaluating on another. We apply the same fact-level deduplication procedure to remove overlapping facts between the source training set and the target test set. This setting poses a significant challenge, as … view at source ↗

**Figure 4.** Figure 4: Post-RL repair rates for initially failed test queries, stratified by the pre-RL accessibility of [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Pass@k scaling curves for pre-RL and post-RL models on the NQ dataset. RL repairs even facts that are invisible under finite pre-RL sampling. A particularly striking result emerges in the zero-accessibility bin: even for queries where the correct answer never appears across all 128 pre-RL samples, RL successfully elevates the fact to the greedy decoding output at a rate of 6%–13%. Since these queries are h… view at source ↗

**Figure 6.** Figure 6: Recovery of full-data RL gains [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Reward dynamics of inaccessible@128 training. Repeated rollouts capture and reinforce suppressed knowledge. To investigate how examples with a 0/128 pre-RL success rate drive such gains, [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt for direct factual QA. Models are instructed to produce a single concise entity [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for factual QA with CoT. Models are instructed to reason step by step before [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Training dynamics of the four methods on the NQ benchmark across three LLMs. Solid [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Accuracy of majority voting at different sampling budgets ( [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Cross-dataset accuracy gains achieved by RL over original models. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Post-RL repair rates for initially failed test queries on TriviaQA. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Pass@k scaling curves for pre-RL and post-RL models on TriviaQA and PopQA. 0 50 100 150 Step 0.0 0.1 0.2 0.3 Llama 0 50 100 Step 0.0 0.1 0.2 0.3 OLMo 0 25 50 75 100 Step 0.0 0.1 0.2 Qwen Average Reward NQ TriviaQA [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Training reward dynamics on the inaccessible@128 subset. [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Prompt for LLM-based deduplication between training and test splits. The model [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: The complete prompt used for LLM-as-a-Judge evaluation. Given a question, a gold [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

read the original abstract

Reinforcement learning (RL) has achieved remarkable success in LLM reasoning, but whether it can also improve direct recall of parametric knowledge remains an open question. We study this question in a controlled zero-shot, one-hop, closed-book QA setting with no chain-of-thought, training only on binary correctness rewards and applying fact-level train-test deduplication to ensure gains reflect improved recall rather than reasoning or memorization. Across three model families and multiple factual QA benchmarks, RL yields ~27% average relative gains, surpassing both training- and inference-time baselines alike. Mechanistically, RL primarily redistributes probability mass over existing knowledge rather than acquiring new facts, moving correct answers from the low-probability tail into reliable greedy generations. Our data-attribution study reveals that the hardest examples are the most informative: those whose answers never appear in 128 pre-RL samples (only ~18% of training data) drive ~83% of the gain, since rare correct rollouts still emerge during training and get reinforced. Together, these findings broaden the role of RL beyond reasoning, repositioning it as a tool for unlocking rather than acquiring latent parametric knowledge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RL with binary rewards improves factual QA by reinforcing rare correct outputs on hard examples, but the setup leaves room for format or refusal effects to contribute to the measured gains.

read the letter

The main point is that this work shows RL can boost direct recall of parametric knowledge in LLMs rather than just teaching reasoning. In a zero-shot closed-book QA setup with fact-level deduplication, they report roughly 27% relative gains across three model families and several benchmarks, beating both training and inference baselines. The mechanistic part is that the model mostly shifts probability mass toward answers it already knew but kept in the low-probability tail, and the data attribution shows that examples with zero correct pre-RL samples (about 18% of the data) account for 83% of the total improvement because occasional correct rollouts get reinforced.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that RL with binary correctness rewards in a zero-shot closed-book one-hop QA setting (with fact-level deduplication) improves LLM factual recall by ~27% relative on average across three model families and multiple benchmarks. Gains are attributed to redistribution of probability mass over pre-existing parametric knowledge (moving correct answers from low-probability tails to greedy outputs) rather than acquisition of new facts, with a data-attribution analysis showing that the hardest examples (zero correct pre-RL samples, ~18% of data) drive ~83% of the improvement.

Significance. If the central interpretation holds, the work would meaningfully broaden RL's role from reasoning to efficient elicitation of latent parametric knowledge, with practical value for knowledge-intensive applications and the attribution results providing a useful lens on which examples benefit most. The multi-family, multi-benchmark design and focus on mechanistic redistribution add to its potential impact beyond raw performance numbers.

major comments (2)

[data-attribution study (results section)] The data-attribution result (hard examples with zero correct pre-RL samples driving ~83% of the gain) does not isolate improved parametric recall from format compliance or refusal suppression. Because the binary reward is only assigned to correctly formatted answers, any concurrent improvement in output conventions would inflate measured gains without changing underlying knowledge; this directly affects the claim that RL 'primarily redistributes probability mass over existing knowledge rather than acquiring new facts.'
[experimental setup and baselines (methods and results sections)] The experimental controls (zero-shot closed-book framing and fact-level deduplication) prevent overt memorization and leakage but do not include ablations that would rule out format learning, such as a format-only reward baseline or comparison to supervised fine-tuning on answer formatting. Without these, the ~27% relative gains cannot be attributed solely to recall of existing parametric knowledge.

minor comments (2)

[Abstract] The abstract reports an average relative gain of ~27% but does not specify the exact aggregation method (e.g., macro-average across benchmarks/models) or include error bars/variance; adding these would strengthen the empirical claims.
[experimental setup] Clarify the precise definition of 'one-hop' factual QA and confirm that all evaluated benchmarks strictly adhere to this criterion without implicit multi-hop elements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. These points help clarify the scope of our claims regarding RL as a mechanism for eliciting latent parametric knowledge. We respond to each major comment below, providing additional context from our experiments and outlining targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: [data-attribution study (results section)] The data-attribution result (hard examples with zero correct pre-RL samples driving ~83% of the gain) does not isolate improved parametric recall from format compliance or refusal suppression. Because the binary reward is only assigned to correctly formatted answers, any concurrent improvement in output conventions would inflate measured gains without changing underlying knowledge; this directly affects the claim that RL 'primarily redistributes probability mass over existing knowledge rather than acquiring new facts.'

Authors: We appreciate this concern about potential confounds in the reward signal. Our prompts explicitly instruct models to output direct answers in a fixed format (e.g., 'The answer is X'), and pre-RL sampling across 128 rollouts per example shows format compliance rates exceeding 92% even on incorrect responses. Post-RL, format compliance remains stable at ~94-96% while accuracy rises, indicating that gains are not driven by format acquisition. For refusal suppression, refusals (e.g., 'I don't know') occur in <3% of pre-RL samples and are not systematically reduced by RL; the binary reward penalizes both incorrect and refused outputs equally. The data-attribution result—that examples with zero correct pre-RL samples (~18% of data) account for ~83% of gains—further supports redistribution: these examples only improve when a correct rollout emerges during training and receives reinforcement, consistent with surfacing low-probability parametric knowledge rather than learning new conventions. We will add a dedicated paragraph with these compliance statistics and refusal rates in the revised results section to make this isolation explicit. revision: partial
Referee: [experimental setup and baselines (methods and results sections)] The experimental controls (zero-shot closed-book framing and fact-level deduplication) prevent overt memorization and leakage but do not include ablations that would rule out format learning, such as a format-only reward baseline or comparison to supervised fine-tuning on answer formatting. Without these, the ~27% relative gains cannot be attributed solely to recall of existing parametric knowledge.

Authors: We agree that dedicated format-learning ablations would provide stronger causal evidence and rule out the alternative interpretation more definitively. Our existing training baselines include full supervised fine-tuning on the QA pairs (which embeds both content and format) as well as inference-time methods like best-of-N sampling; these are outperformed by RL, but they do not isolate format alone. In the revision, we will add two new controls: (1) a format-only reward baseline that assigns reward for any correctly formatted output irrespective of factual correctness, and (2) an SFT baseline trained solely on formatting instructions without content labels. We expect both to yield near-zero accuracy gains, reinforcing that the observed improvements stem from content recall. These will be reported in an expanded 'Ablations' subsection of the results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on controlled experiments and sampling statistics, not self-referential definitions or fitted inputs

full rationale

The paper reports RL training outcomes on factual QA with binary rewards, fact-level deduplication, and pre/post-RL sampling to measure probability redistribution. The ~27% gains and 83% attribution to hard examples are computed directly from observed rollouts and accuracy deltas across independent benchmarks and model families. No equations define a quantity in terms of itself, no parameters are fitted then relabeled as predictions, and no self-citations supply uniqueness theorems or ansatzes. The derivation chain consists of standard RL policy optimization followed by empirical measurement; the controls (zero-shot closed-book, train-test split) are external to the measured quantities and do not reduce the reported gains to tautologies.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Paper is empirical; no explicit free parameters, axioms, or invented entities are introduced beyond standard RL and LLM assumptions. Full details unavailable from abstract alone.

pith-pipeline@v0.9.0 · 5516 in / 1049 out tokens · 41668 ms · 2026-05-11T01:23:03.225221+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
RL primarily redistributes probability mass over existing knowledge rather than acquiring new facts, moving correct answers from the low-probability tail into reliable greedy generations.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear
the hardest examples are the most informative: those whose answers never appear in 128 pre-RL samples ... drive ~83% of the gain

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 11 internal anchors

[1]

The internal state of an LLM knows when it ' s lying

Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it ' s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967--976, Singapore, December 2023. Association for Computational Linguistics

work page 2023
[2]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review arXiv 2024
[3]

Ask the right questions: Active question reformulation with reinforcement learning

Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Wojciech Gajewski, Andrea Gesmundo, Neil Houlsby, and Wei Wang. Ask the right questions: Active question reformulation with reinforcement learning. In International Conference on Learning Representations, 2018

work page 2018
[4]

Discovering latent knowledge in language models without supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, 2023

work page 2023
[5]

Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas O ˘guz, Rulin Shao, Gargi Ghosh, Jason Weston, and Wen tau Yih

Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas O g uz, Rulin Shao, Gargi Ghosh, Jason Weston, and Wen-tau Yih. Learning to reason for factuality. arXiv preprint arXiv:2508.05618, 2025

work page arXiv 2025
[6]

SFT memorizes, RL generalizes: A comparative study of foundation model post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. SFT memorizes, RL generalizes: A comparative study of foundation model post-training. In Forty-second International Conference on Machine Learning, 2025

work page 2025
[7]

Measuring and improving consistency in pretrained language models

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Sch \"u tze, and Yoav Goldberg. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9: 0 1012--1031, 2021

work page 2021
[8]

Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. Does fine-tuning LLM s on new knowledge encourage hallucinations? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7765--7784, Miami, Florida, USA, November 2024. Association for Computational Linguistics

work page 2024
[9]

Inside-out: Hidden factual knowledge in LLM s

Zorik Gekhman, Eyal Ben-David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, and Roi Reichart. Inside-out: Hidden factual knowledge in LLM s. In Second Conference on Language Modeling, 2025

work page 2025
[10]

arXiv preprint arXiv:2603.09906 , year=

Zorik Gekhman, Roee Aharoni, Eran Ofek, Mor Geva, Roi Reichart, and Jonathan Herzig. Thinking to recall: How reasoning unlocks parametric knowledge in llms. arXiv preprint arXiv:2603.09906, 2026

work page arXiv 2026
[11]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, September 2025. ISSN 1476-4687

work page 2025
[13]

Xu, Jun Araki, and Graham Neubig

Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020

work page 2020
[14]

T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. T rivia QA : A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1601--1611, Vancouver, Canada, July 2017. Association for Computational Linguistics

work page 2017
[15]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

work page 2019
[16]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611--626, New York, NY, USA, 2023. Association for Computing Machinery

work page 2023
[17]

Coderl: Mastering code generation through pretrained models and deep reinforcement learning

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35: 0 21314--21328, 2022

work page 2022
[18]

Reasoning models hallucinate more: Factuality-aware reinforcement learning for large reasoning models

Junyi Li and Hwee Tou Ng. Reasoning models hallucinate more: Factuality-aware reinforcement learning for large reasoning models. In Advances in Neural Information Processing Systems, 2025

work page 2025
[19]

Inference-time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Vi\' e gas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. In Advances in Neural Information Processing Systems, volume 36, pages 41451--41530, 2023

work page 2023
[20]

Generated knowledge prompting for commonsense reasoning

Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Yejin Choi, and Hannaneh Hajishirzi. Generated knowledge prompting for commonsense reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 3154--3169, Dublin, Ireland, May 2022. Association for Computational Linguistics

work page 2022
[21]

RLTF : Reinforcement learning from unit test feedback

Jiate Liu, Yiqin Zhu, Kaiwen Xiao, Qiang Fu, Xiao Han, Yang Wei, and Deheng Ye. RLTF : Reinforcement learning from unit test feedback. Transactions on Machine Learning Research, 2023. ISSN 2835-8856

work page 2023
[22]

ProRL : Prolonged reinforcement learning expands reasoning boundaries in large language models

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. ProRL : Prolonged reinforcement learning expands reasoning boundaries in large language models. In Advances in Neural Information Processing Systems, volume 38, pages 17998--18031. Curran Associates, Inc., 2025

work page 2025
[23]

Improving parametric knowledge access in reasoning language models

Melody Ma and John Hewitt. Improving parametric knowledge access in reasoning language models. arXiv preprint arXiv:2602.22193, 2026

work page arXiv 2026
[24]

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 9802--9822, Toronto, Canada, July 2023. Association for Co...

work page 2023
[25]

Teaching language models to support answers with veriﬁed quotes.Preprint arXiv:2203.11147,

Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022

work page arXiv 2022
[26]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Cand \`e s, and Tatsunori Hashimoto. s1: Simple test-time scaling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20275--20321, Suzhou, China, November 2025. Association for ...

work page 2025
[27]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT : Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious. arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review arXiv 2024
[29]

LLM s know more than they show: On the intrinsic representation of LLM hallucinations

Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. LLM s know more than they show: On the intrinsic representation of LLM hallucinations. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[30]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022
[31]

Qwen2.5 Technical Report

Qwen:, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, volume 36, pages 53728--53741. Curran Associates, Inc., 2023

work page 2023
[33]

KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality

Baochang Ren, Shuofei Qiao, Da Zheng, Huajun Chen, and Ningyu Zhang. Knowrl: Exploring knowledgeable reinforcement learning for factuality. arXiv preprint arXiv:2506.19807, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv: 1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279--1297, New York, NY, USA, 2025. Association for Computing Machinery

work page 2025
[37]

Logan IV, Eric Wallace, and Sameer Singh

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222--4235, Online, November 2020. Association for Computational Linguistics

work page 2020
[38]

Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[39]

Recitation-augmented language models

Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. Recitation-augmented language models. In The Eleventh International Conference on Learning Representations, 2023

work page 2023
[40]

R e FT : Reasoning with reinforced fine-tuning

Luong Trung, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. R e FT : Reasoning with reinforced fine-tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 7601--7614, Bangkok, Thailand, August 2024. Association for Computational Linguistics

work page 2024
[41]

Retrieve what you need: A mutual learning framework for open-domain question answering

Dingmin Wang, Qiuyuan Huang, Matthew Jackson, and Jianfeng Gao. Retrieve what you need: A mutual learning framework for open-domain question answering. Transactions of the Association for Computational Linguistics, 12: 0 247--263, 2024

work page 2024
[42]

Enhancing code LLM s with reinforcement learning in code generation: A survey

Junqiao Wang, Zeng Zhang, Yangfan He, Zihao Zhang, Xinyuan Song, Yuyang Song, Tianyu Shi, Yuchen Li, Hengyuan Xu, Kunyu Wu, Yi Xin, Zhongwei Wan, Xinhang Yuan, Zijun Wang, Kuan Lu, Menghao Huo, Jingqun Tang, Guangwu Qian, Keqin Li, Qiuwu Chen, and Lewei He. Enhancing code LLM s with reinforcement learning in code generation: A survey. In ICLR 2026 Worksho...

work page 2026
[43]

R3: Reinforced ranker-reader for open-domain question answering

Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry Tesauro, Bowen Zhou, and Jing Jiang. R3: Reinforced ranker-reader for open-domain question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, pages 5981--5988, 2018

work page 2018
[44]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023

work page 2023
[45]

Reinforcement learning for reasoning in large language models with one training example

Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Du, and Yelong Shen. Reinforcement learning for reasoning in large language models with one training example. In Advances in Neural Information Processing Systems, volume 38, pages 122721--122764. Cur...

work page 2025
[46]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824--24837. Curran Associates, Inc., 2022

work page 2022
[47]

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368, 2024

work page arXiv 2024
[48]

Truthrl: Incentivizing truthful llms via reinforcement learning

Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Sean Chen, Mohammad Kachuee, Teja Gollapudi, Tony Liao, Nicolas Scheffer, et al. Truthrl: Incentivizing truthful llms via reinforcement learning. arXiv preprint arXiv:2509.25760, 2025

work page arXiv 2025
[49]

Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLM s

Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLM s. In The Fourteenth International Conference on Learning Representations, 2026

work page 2026
[50]

C-pack: Packed resources for general chinese embeddings

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C-pack: Packed resources for general chinese embeddings. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 641--649, New York, NY, USA, 2024. Association for Computing Machinery

work page 2024
[51]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, et al. Qwen3 technical report. arXiv preprint arXiv: 2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. DAPO : An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023

work page arXiv 2023
[54]

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In Advances in Neural Information Processing Systems, volume 38, pages 57654--57689

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In Advances in Neural Information Processing Systems, volume 38, pages 57654--57689. Curran Associates, Inc., 2025

work page 2025
[55]

On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv preprint arXiv:2512.07783,

Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models. arXiv preprint arXiv:2512.07783, 2025

work page arXiv 2025
[56]

From system 1 to system 2: A survey of reasoning large language models

Duzhen Zhang, Zhong-Zhi Li, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhijiang Guo, Le Song, and Cheng-Lin Liu. From system 1 to system 2: A survey of reasoning large language models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 48 0 (3): 0 3335--...

work page 2026
[57]

L lama F actory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. L lama F actory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 400--410, Bangkok, Thailand, August 2024. Association for Computational Linguistics

work page 2024
[58]

Factual probing is [ MASK ]: Learning vs

Zexuan Zhong, Dan Friedman, and Danqi Chen. Factual probing is [ MASK ]: Learning vs. learning to recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5017--5033, Online, June 2021. Association for Computational Linguistics

work page 2021
[59]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909