pith. sign in

arxiv: 2605.24286 · v1 · pith:ZRCPO4RDnew · submitted 2026-05-22 · 💻 cs.LG · cs.CL

Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning

Pith reviewed 2026-06-30 15:25 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords chain-of-thought reasoningfaithfulnessinformation flowreinforcement learninglanguage modelsshortcut learningmodel monitoringattention masking
0
0 comments X

The pith

Training interventions that control information flow produce more faithful chain-of-thought reasoning in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that chain-of-thought reasoning can be made faithful by ensuring answer-relevant information flows through the visible reasoning steps rather than bypassing them via direct prompt-to-answer shortcuts. It defines three complementary information-flow properties—sufficiency, completeness, and necessity—and instantiates them with entropy, masked-KL, and gradient diagnostics that recover human judgments of faithfulness on hinted tasks. The authors then introduce update-time interventions for verifier-based reinforcement learning, such as attention masking and backward-only gradient masking, that encourage models to route information through the chain-of-thought. If these controls succeed, the resulting reasoning traces become more transparent for monitoring and less prone to hidden shortcuts or reward hacking.

Core claim

Faithful CoT reasoning routes answer-relevant information through the mediated prompt-to-CoT-to-answer path instead of direct prompt-to-answer shortcuts. This view is captured by a task-agnostic framework of sufficiency, completeness, and necessity properties measured via entropy-based, masked-KL, and gradient-based diagnostics. Update-time interventions including attention masking, backward-only gradient masking, CoT gradients, and adversarial prompt perturbations shift models toward stronger CoT mediation, making shortcut behavior more visible in the trace while improving the structural metrics.

What carries the argument

The structural information-flow perspective instantiated as the three properties of sufficiency, completeness, and necessity that diagnose whether answer-relevant information must pass through the chain-of-thought.

If this is right

  • The entropy, masked-KL, and gradient diagnostics recover externally judged faithfulness differences on hinted reasoning tasks.
  • The interventions increase behavioral and structural indicators of CoT mediation across hinted arithmetic, reward-hackable code repair, and DAPO-Math models.
  • Shortcut and reward-hacking behavior becomes more transparent inside the generated chain-of-thought.
  • Task-agnostic faithfulness metrics improve and wrong-hint susceptibility decreases in some evaluated settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same information-flow controls could be tested on models trained entirely without hints to check whether monitorability improves on open-ended tasks.
  • Gradient-based diagnostics may remain usable even when entropy-based or KL-based measures become unstable due to low-entropy outputs.
  • Combining these training interventions with process-level supervision might further reduce hidden reliance on prompt shortcuts.

Load-bearing premise

The three information-flow properties accurately capture what counts as faithful reasoning when judged externally, an assumption checked only on hinted-reasoning tasks.

What would settle it

Apply the training interventions to a new model, then measure whether the resulting chain-of-thought traces receive higher human faithfulness ratings than the baseline on tasks that include incorrect hints or no hints at all.

Figures

Figures reproduced from arXiv: 2605.24286 by Eric Easley, Jinghan Jia, Joe Benton.

Figure 1
Figure 1. Figure 1: Information flow in reasoning: faithful CoT vs. short￾cut solutions. Left: A misleading hint changes the model’s answer, while the CoT does not reveal the hint’s influence, indicating a prompt￾to-answer shortcut. Right: In faithful reasoning, answer-relevant information should flow through the mediated path P → C → A. In unfaithful reasoning, the model can additionally rely on a direct shortcut P → A that … view at source ↗
Figure 2
Figure 2. Figure 2: Task-agnostic faithfulness metrics. Left: attention masks isolate full, CoT-mediated, and prompt-only answer distributions. Right: gradient-based metrics compare answer dependence on prompt versus CoT tokens; greater CoT-gradient concentration indicates stronger CoT-mediated reasoning. 4 Measures: Operationalizing Faithfulness We instantiate the properties of Section 3 with three families of task-agnostic … view at source ↗
Figure 3
Figure 3. Figure 3: Validation of the proposed task-agnostic faithfulness metrics. DeepSeek-R1-Distill-14B vs. Qwen3-8B on answer-changing hinted GPQA examples, where an external verbalization criterion indicates Qwen3-8B is more faithful. Bars show mean metric values across validation examples; error bars denote standard error. All metrics significantly distinguish the two models under a two-sided Mann–Whitney U test, but on… view at source ↗
Figure 4
Figure 4. Figure 4: Training dynamics of vanilla RL and faithfulness-oriented interventions (CoT Gradient, Gradient Mask, Update Mask, and FACT) in transparency, robustness to wrong hints, and task performance on hinted arithmetic. The four panels report implicit hint mention rate, wrong-hint following rate, accuracy on wrong-hint-conditioned examples, and overall accuracy vs. training steps. Higher implicit hint mention, low… view at source ↗
Figure 5
Figure 5. Figure 5: Causal-effect and necessity analysis on hinted arithmetic. We report the normalized KL difference between CoT-hint and prompt-hint sensi￾tivity across training steps under sign-flip, scale-2×, and random hint-value changes. Positive values indicate CoT-dominant behavior, while negative val￾ues indicate prompt-dominant shortcut reliance; higher is better. Causal information flow ( [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 6
Figure 6. Figure 6: Faithfulness metrics on hinted arithmetic. We report sufficiency, completeness, and necessity using H(A | C), Grad-DE, and Grad-Nec vs. training steps. Lower H(A | C), lower Grad-DE, and higher Grad-Nec indicate better faithfulness. Under sign-flip, 2× scaling, and random hint changes, vanilla RL trends increas￾ingly negative as training proceeds, con￾sistent with the prompt-dominant short￾cut suggested by… view at source ↗
Figure 7
Figure 7. Figure 7: Behavioral dynamics of vanilla RL and faithfulness-oriented interventions on buggy-code fixing. The four panels report visible/hidden test case pass rate, lookup-table hack rate, and CoT hack verbalization rate vs. training steps. Visible and hidden test pass rates measure reward optimization and generalization, respectively; lookup-table hack rate measures shortcut exploitation of visible tests; and CoT h… view at source ↗
Figure 8
Figure 8. Figure 8: Faithfulness metrics on buggy-code fixing. The three panels report sufficiency, completeness, and necessity using H(A | C), Grad-DE, and Grad-Nec across training steps. Better faithfulness corresponds to more sufficient CoT information, weaker direct prompt-to-answer reliance, and stronger answer dependence on the CoT. Vanilla RL almost never verbalizes the lookup-table strategy despite pro￾ducing such sol… view at source ↗
Figure 9
Figure 9. Figure 9: Behavioral evaluation of vanilla RL and CoT Gradient on DAPO-Math. Models are trained on the DAPO-Math training set and evaluated on 1,000 validation examples. The four panels report no-hint accuracy, wrong-hint follow rate, explicit verbalization rate among hint-following responses, and implicit verbalization rate among hint-following responses vs. training steps. Higher no-hint accuracy indicates stronge… view at source ↗
Figure 10
Figure 10. Figure 10: Faithfulness metrics of vanilla RL and CoT Gradient on DAPO-Math. The three panels report sufficiency, completeness, and ne￾cessity using H(A | C), Grad-DE, and Grad-Nec across training steps, evaluated on 1,000 validation examples. Lower H(A | C), lower Grad-DE, and higher Grad-Nec indicate better faithfulness. Task-agnostic metrics ( [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
read the original abstract

Chain-of-thought (CoT) reasoning is useful for monitoring language models only when the reasoning trace faithfully reflects the computation that produces the final answer. However, models can rely on prompt-to-answer shortcuts that bypass the CoT, making the visible reasoning trace misleading even when it appears plausible. We study CoT faithfulness through a structural information-flow perspective: faithful reasoning should route answer-relevant information through the mediated path from prompt to CoT to answer, rather than through a direct prompt-to-answer shortcut. This perspective yields a task-agnostic framework based on three complementary properties, sufficiency, completeness, and necessity, which we instantiate with entropy-based, masked-KL, and gradient-based diagnostics. We show that these metrics recover externally judged faithfulness differences in hinted reasoning, and identify a low-entropy failure mode of KL-based diagnostics where gradient-based measures remain more stable. Building on this analysis, we introduce update-time interventions for verifier-based on-policy RL, including attention masking, backward-only gradient masking, CoT gradients, and adversarial perturbations of prompt representations. Across hinted arithmetic, reward-hackable code repair, and DAPO-Math models trained without hints but evaluated under wrong-hint injection, our interventions shift behavioral and structural indicators toward stronger CoT mediation. In particular, they make shortcut and reward-hacking behavior more transparent in the CoT and improve task-agnostic faithfulness metrics, while in some settings also reducing wrong-hint susceptibility. Our results suggest that controlling information flow during training is a practical route toward more faithful and monitorable CoT reasoning. Code is available at https://github.com/safety-research/faithful-cot.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that CoT faithfulness can be understood and improved via an information-flow lens requiring answer-relevant information to route through the prompt-to-CoT-to-answer path. It defines three properties (sufficiency, completeness, necessity) instantiated as entropy-based, masked-KL, and gradient diagnostics; shows these recover human faithfulness judgments on hinted-reasoning tasks; and introduces on-policy RL interventions (attention masking, backward-only gradient masking, CoT gradients, adversarial prompt perturbations) that shift behavioral and structural indicators toward greater CoT mediation on hinted arithmetic, reward-hackable code repair, and DAPO-Math under wrong-hint injection.

Significance. If the results hold, the work supplies a task-agnostic, information-theoretic framework plus concrete training interventions for producing more monitorable CoT, with open code at the cited GitHub repository strengthening reproducibility. The low-entropy failure mode identified for KL diagnostics and the gradient-based alternative are useful technical contributions.

major comments (2)
  1. [Abstract] Abstract and experiments on hinted reasoning: the three diagnostics are shown to recover externally judged faithfulness differences only on hinted-reasoning tasks, yet the central claim that the RL interventions produce more faithful CoT on code repair and DAPO-Math rests on metric shifts in those settings without reported human faithfulness judgments, so it is unclear whether the observed changes track faithfulness or merely the mechanics of the interventions (e.g., attention masking directly affecting the masked-KL term).
  2. [Abstract] Abstract: the reported shifts in behavioral and structural indicators after interventions are presented without error bars, dataset sizes, or statistical tests, so the strength of evidence that the interventions reliably improve CoT mediation cannot be assessed from the given information.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experiments on hinted reasoning: the three diagnostics are shown to recover externally judged faithfulness differences only on hinted-reasoning tasks, yet the central claim that the RL interventions produce more faithful CoT on code repair and DAPO-Math rests on metric shifts in those settings without reported human faithfulness judgments, so it is unclear whether the observed changes track faithfulness or merely the mechanics of the interventions (e.g., attention masking directly affecting the masked-KL term).

    Authors: Human judgments are reported only on hinted-reasoning tasks, where controlled external labels are feasible. The three diagnostics are task-agnostic by design (sufficiency/completeness/necessity via information flow) and were validated to recover human distinctions in that setting. On code repair and DAPO-Math, the interventions target information-flow mechanisms (attention masking blocks direct shortcuts; gradient masking and CoT gradients enforce mediation) and produce consistent shifts across multiple diagnostics, including gradient-based ones that are orthogonal to attention masking. This multi-metric pattern indicates the changes reflect the intended structural property rather than artifacts of any single diagnostic. We will add an explicit discussion of the validation scope and metric rationale in the revised manuscript. revision: partial

  2. Referee: [Abstract] Abstract: the reported shifts in behavioral and structural indicators after interventions are presented without error bars, dataset sizes, or statistical tests, so the strength of evidence that the interventions reliably improve CoT mediation cannot be assessed from the given information.

    Authors: We agree that the abstract omits these details. Dataset sizes appear in the experimental sections of the full manuscript. In revision we will add error bars and statistical tests to all reported shifts in figures/tables and update the abstract to reference the strengthened statistical presentation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metrics defined independently and validated externally

full rationale

The paper defines faithfulness via three information-flow properties (sufficiency, completeness, necessity) instantiated with entropy, masked-KL, and gradient diagnostics drawn from standard information theory. These are validated against external human judgments on hinted-reasoning tasks, and interventions are assessed via on-policy RL with reported behavioral and metric shifts. No load-bearing step reduces by the paper's equations or self-citation to a fitted input or self-defined quantity; the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Framework rests on standard transformer gradient attribution and information-theoretic quantities; no new physical entities or ad-hoc fitted constants are introduced in the abstract. Limited visibility into full methods prevents exhaustive enumeration.

axioms (1)
  • domain assumption Entropy, masked KL divergence and gradient attributions can be used to quantify information flow through specific paths in transformer models
    Invoked to define the three diagnostics.

pith-pipeline@v0.9.1-grok · 5830 in / 1390 out tokens · 58957 ms · 2026-06-30T15:25:59.840470+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 25 canonical work pages · 11 internal anchors

  1. [1]

    Show your work: Scratchpads for intermediate computation with language models

    Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. 2021

  2. [2]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  3. [3]

    Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

  4. [4]

    Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

  5. [5]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023

  6. [6]

    Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning

    Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024

  7. [7]

    Are deepseek r1 and other reasoning models more faithful? arXiv preprint arXiv:2501.08156, 2025

    James Chua and Owain Evans. Are deepseek r1 and other reasoning models more faithful? arXiv preprint arXiv:2501.08156, 2025

  8. [8]

    Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

    Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful.arXiv preprint arXiv:2503.08679, 2025

  9. [9]

    Measuring chain of thought faithfulness by unlearning reasoning steps

    Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasovic, and Yonatan Belinkov. Measuring chain of thought faithfulness by unlearning reasoning steps. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, pages 9935–9960, Suzhou, C...

  10. [10]

    Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

    Donald Ye, Max Loffgren, Om Kotadia, and Linus Wong. Mechanistic evidence for faithfulness decay in chain-of-thought reasoning.arXiv preprint arXiv:2602.11201, 2026

  11. [11]

    Counterfactual simulation training for chain-of-thought faithfulness, 2026

    Peter Hase and Christopher Potts. Counterfactual simulation training for chain-of-thought faithfulness, 2026. URLhttps://arxiv.org/abs/2602.20710

  12. [12]

    Analyzing and improving chain-of-thought monitorability through information theory.arXiv preprint arXiv:2602.18297, 2026

    Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, and Christos Louizos. Analyzing and improving chain-of-thought monitorability through information theory.arXiv preprint arXiv:2602.18297, 2026

  13. [13]

    Monitorbench: A comprehensive benchmark for chain-of- thought monitorability in large language models.arXiv preprint arXiv:2603.28590, 2026

    Han Wang, Yifan Sun, Brian Ko, Mann Talati, Jiawen Gong, Zimeng Li, Naicheng Yu, Xucheng Yu, Wei Shen, Vedant Jolly, et al. Monitorbench: A comprehensive benchmark for chain-of- thought monitorability in large language models.arXiv preprint arXiv:2603.28590, 2026

  14. [14]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  15. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 10

  16. [16]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  17. [17]

    Faithcot-bench: Benchmarking instance-level faithfulness of chain-of-thought reasoning

    Xu Shen, Song Wang, Zhen Tan, Laura Yao, Xinyu Zhao, Kaidi Xu, Xin Wang, and Tianlong Chen. Faithcot-bench: Benchmarking instance-level faithfulness of chain-of-thought reasoning. arXiv preprint arXiv:2510.04040, 2025

  18. [18]

    C2-faith: Benchmarking llm judges for causal and coverage faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2603.05167, 2026

    Avni Mittal and Rauno Arike. C2-faith: Benchmarking llm judges for causal and coverage faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2603.05167, 2026

  19. [19]

    Monitorability as a free gift: How rlvr spontaneously aligns reasoning.arXiv preprint arXiv:2602.03978, 2026

    Zidi Xiong, Shan Chen, and Himabindu Lakkaraju. Monitorability as a free gift: How rlvr spontaneously aligns reasoning.arXiv preprint arXiv:2602.03978, 2026

  20. [20]

    Reasoning models struggle to control their chains of thought.arXiv preprint arXiv:2603.05706, 2026

    Chen Yueh-Han, Robert McCarthy, Bruce W Lee, He He, Ian Kivlichan, Bowen Baker, Micah Carroll, and Tomek Korbak. Reasoning models struggle to control their chains of thought.arXiv preprint arXiv:2603.05706, 2026

  21. [21]

    Aligned, orthogonal or in- conflict: When can we safely optimize chain-of-thought?arXiv preprint arXiv:2603.30036, 2026

    Max Kaufmann, David Lindner, Roland S Zimmermann, et al. Aligned, orthogonal or in- conflict: When can we safely optimize chain-of-thought?arXiv preprint arXiv:2603.30036, 2026

  22. [22]

    Teaching models to verbalize reward hacking in chain-of-thought reasoning.arXiv preprint arXiv:2506.22777, 2025

    Miles Turpin, Andy Arditi, Marvin Li, Joe Benton, and Julian Michael. Teaching models to verbalize reward hacking in chain-of-thought reasoning.arXiv preprint arXiv:2506.22777, 2025

  23. [23]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety.arXiv preprint arXiv:1606.06565, 2016

  24. [24]

    Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  25. [25]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  26. [26]

    Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

    Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

  27. [27]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023

  28. [28]

    Ai control: Improving safety despite intentional subversion.arXiv preprint arXiv:2312.06942, 2023

    Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. Ai control: Improving safety despite intentional subversion.arXiv preprint arXiv:2312.06942, 2023

  29. [29]

    Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

    Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926, 2025

  30. [30]

    Monitoring monitorability

    Melody Y Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y Wei, Marcus Williams, Benjamin Arnav, Joost Huizinga, Ian Kivlichan, Mia Glaese, et al. Monitoring monitorability. arXiv preprint arXiv:2512.18311, 2025

  31. [31]

    Analysing Mathematical Reasoning Abilities of Neural Models

    David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models.arXiv preprint arXiv:1904.01557, 2019

  32. [32]

    URLhttps://www.science.org/doi/full/10.1126/science

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

  33. [33]

    Gemma Team Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram’e, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gael Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

  34. [34]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https:// qwenlm.github.io/blog/qwen2.5/

  35. [35]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  36. [36]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

  37. [37]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 12 Appendix A Information flow based intervention methods Pipeline Pipeline details.Figure A1 expands the intervention locations within the GRPO training loop. The top row shows the shared pipeline: the policy first generates K completions for ea...