The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs

Wenxiang Geng; Yiming Zhang; Yining Sun; Zenghui Ding; Zihan Chen

arxiv: 2606.00674 · v1 · pith:S3XJGU7Ynew · submitted 2026-05-30 · 💻 cs.LG · cs.AI

The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs

Zihan Chen , Yiming Zhang , Wenxiang Geng , Zenghui Ding , Yining Sun This is my paper

Pith reviewed 2026-06-28 19:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM reasoning shortcutsoutcome optimizationreward-induced manifold collapsesemantic coverage measureprocess reward modelscausal information bottleneckgeneralization boundsMarkovian screening

0 comments

The pith

Outcome optimization biases LLMs toward reasoning shortcuts whenever training distributions permit Markovian screening of causal mechanisms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that LLMs trained via outcome-based reinforcement learning achieve strong results on in-distribution benchmarks yet show brittle performance on out-of-distribution tasks, a pattern termed Reward-Induced Manifold Collapse. It connects this failure to an interaction between structural causal models and the information bottleneck principle, treating genuine reasoning as a high-complexity causal process while shortcuts amount to low-complexity spurious correlations. Under the bias of stochastic gradient descent, models preferentially adopt shortcuts when the data distribution permits Markovian screening of the underlying causal structure. The authors derive a generalization bound that depends on a Semantic Coverage Measure rather than sample size, which directly accounts for the limited benefit of scaling data drawn from homogeneous distributions.

Core claim

We establish a theoretical framework bridging Structural Causal Models (SCM) and the Information Bottleneck (IB) principle to explain Reward-Induced Manifold Collapse. Reasoning is defined as a high-complexity causal process and shortcut learning as the exploitation of low-complexity spurious correlations. Under the implicit inductive bias of SGD, models optimized for outcome rewards are biased toward shortcut solutions whenever the training distribution allows for a Markovian Screening of the true causal mechanism. We derive a new generalization bound based on Semantic Coverage Measure (η) rather than sample size, showing why data scaling on homogeneous distributions may fail to correct rea

What carries the argument

Markovian Screening of the true causal mechanism within the SCM-IB framework, which permits SGD to favor low-complexity spurious correlations over high-complexity causal reasoning.

If this is right

Data scaling on homogeneous distributions cannot eliminate reasoning flaws because the bound is controlled by Semantic Coverage Measure η rather than sample size.
Process Reward Models act as Topological Filters that impose step-wise mutual information constraints and thereby exclude the shortcut manifold.
Outcome optimization creates a systematic preference for low-complexity solutions over high-complexity causal processes when screening is possible.
The framework supplies a mathematical basis for preferring process supervision over pure outcome supervision in model alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alignment pipelines that rely solely on final-answer rewards may systematically underperform on tasks requiring novel causal reasoning.
The Semantic Coverage Measure could be used to diagnose datasets likely to induce shortcut learning before training begins.
Similar screening effects may appear in other domains where optimization operates on partially observed causal structures.

Load-bearing premise

The training distribution allows a Markovian Screening of the true causal mechanism, enabling SGD to bias models toward shortcut solutions.

What would settle it

Construct a training distribution that blocks Markovian screening of the causal mechanism and measure whether outcome-optimized models still exhibit the same degree of OOD reasoning collapse as on standard distributions.

Figures

Figures reproduced from arXiv: 2606.00674 by Wenxiang Geng, Yiming Zhang, Yining Sun, Zenghui Ding, Zihan Chen.

**Figure 1.** Figure 1: Conceptual overview of Reward-Induced Manifold Collapse and Topological Filtering. (A) Under outcome optimization (ORM), the learned representation is biased toward the low-complexity shortcut path (S) under the simplicity/compression bias of training, while the higher-complexity causal reasoning path (C) is screened off. (B) Under process supervision (PRM), step-wise verification acts as a topological fil… view at source ↗

**Figure 2.** Figure 2: The Anatomy of Manifold Collapse. Linear probe accuracy across model layers. Blue (dashed): ORM captures surface shortcuts (S) from early layers. Red: ORM fails to encode the true causal logic (C) even in deep layers, consistent with the “screening off” effect (I(Z; C | S) → 0). Green: PRM promotes the recovery of causal features in deeper layers under stepwise topological constraints. Results. The probin… view at source ↗

read the original abstract

Large Language Models (LLMs) aligned via outcome-based Reinforcement Learning (RL) frequently exhibit a critical failure mode: they achieve high performance on in-distribution benchmarks while demonstrating brittle reasoning capabilities on out-of-distribution (OOD) tasks. We term this phenomenon Reward-Induced Manifold Collapse. We establish a theoretical framework bridging Structural Causal Models (SCM) and the Information Bottleneck (IB) principle to explain this paradox. We define reasoning as a high-complexity causal process and shortcut learning as the exploitation of low-complexity spurious correlations. Under the implicit inductive bias of Stochastic Gradient Descent (SGD), models optimized for outcome rewards are biased toward shortcut solutions whenever the training distribution allows for a ``Markovian Screening'' of the true causal mechanism. We derive a new generalization bound based on Semantic Coverage Measure ($\eta$) rather than sample size, showing why data scaling on homogeneous distributions may fail to correct reasoning flaws. We also show that Process Reward Models (PRMs) function as Topological Filters, enforcing step-wise mutual information constraints that render the low-complexity shortcut manifold inadmissible. These results provide a mathematical grounding for the role of process supervision beyond simple credit assignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a generalization bound on LLM shortcut learning that replaces sample size with Semantic Coverage Measure η under a Markovian Screening assumption, and frames PRMs as topological filters.

read the letter

The core contribution is a bound showing why outcome RL on homogeneous data produces reward-induced manifold collapse: models latch onto low-complexity spurious correlations instead of high-complexity causal paths when the distribution permits Markovian screening of the true mechanism. It links this to the information bottleneck and SGD's simplicity bias, then argues that process reward models block the shortcut manifold by imposing step-wise mutual information constraints.

The framework is cleanly assembled. The SCM-IB bridge is explicit, the distinction between reasoning and shortcuts is operationalized, and the bound itself is presented as a direct consequence of the coverage measure η. Treating PRMs as filters rather than just credit-assignment devices gives a coherent story for why process supervision helps beyond empirical fixes. The derivations track internally once the screening assumption is granted.

The main soft spot is that the bound's usefulness hinges on estimating η in practice, and the paper does not yet show how to do that reliably or how sensitive the bound is when the Markovian condition is only approximately true. A second minor gap is limited direct comparison to prior information-bottleneck bounds on causal models; the combination is new for this LLM setting but the distance from existing work could be quantified more sharply. These are not load-bearing flaws.

The paper is for theorists working on causal explanations of generalization in deep learning and for alignment researchers who want a formal account of process versus outcome rewards. A reader already comfortable with SCMs and IB will extract the most value. It deserves peer review because the central derivation is self-contained and the assumptions are stated plainly enough for referees to evaluate.

Referee Report

2 major / 2 minor

Summary. The paper claims that outcome-based RL alignment in LLMs produces Reward-Induced Manifold Collapse, where models achieve high in-distribution performance but brittle OOD reasoning. It bridges SCMs with the Information Bottleneck, treats reasoning as high-complexity causal paths and shortcuts as low-complexity correlations, invokes SGD's simplicity bias, and derives a generalization bound replacing sample size with a Semantic Coverage Measure η under the assumption that the training distribution permits Markovian Screening of the true causal mechanism. It further claims that Process Reward Models act as topological filters enforcing step-wise mutual information constraints that exclude shortcut manifolds.

Significance. If the derivation of the η-based bound is valid and the Markovian Screening assumption holds without circularity, the framework would supply a causal-information-theoretic account for why homogeneous data scaling fails to eliminate reasoning shortcuts and would mathematically motivate process supervision over pure outcome rewards. The explicit linkage of SCMs, IB, and SGD bias is a potentially useful synthesis, though its load-bearing steps require verification against the actual equations.

major comments (2)

[Abstract] The abstract asserts a derived generalization bound based on Semantic Coverage Measure η rather than sample size, yet the visible text supplies neither the bound statement, its proof, nor the definition of η. Without these, it is impossible to assess whether the bound is non-vacuous or whether η is independently grounded versus fitted to the same data it purports to explain.
[Theoretical Framework] The central claim rests on the training distribution permitting 'Markovian Screening' of the true causal mechanism. The manuscript must explicitly state the formal condition under which this screening occurs and demonstrate that it is not tautological with the shortcut-learning phenomenon being explained.

minor comments (2)

Notation for η, the Information Bottleneck quantities, and the topological-filter property of PRMs should be introduced with explicit definitions before being used in claims.
The paper should include at least one concrete example (synthetic or real) showing how the η-bound predicts observed collapse where a standard sample-size bound does not.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below by clarifying the locations and content of the theoretical results and by strengthening the formal presentation of the key assumption. Both points will be incorporated via targeted revisions to improve accessibility and rigor without altering the core claims.

read point-by-point responses

Referee: [Abstract] The abstract asserts a derived generalization bound based on Semantic Coverage Measure η rather than sample size, yet the visible text supplies neither the bound statement, its proof, nor the definition of η. Without these, it is impossible to assess whether the bound is non-vacuous or whether η is independently grounded versus fitted to the same data it purports to explain.

Authors: The bound appears as Theorem 1 in Section 4: for a model trained under outcome optimization, the generalization gap satisfies R_gen(P_test) ≤ C · (1 - η) + ε, where η ∈ [0,1] is the Semantic Coverage Measure defined as η = 1 - sup_{f ∈ F_shortcut} I(f(X); Y) / I(C; Y) and C denotes the true causal mechanism. The proof (Appendix B) proceeds by combining the Information Bottleneck decomposition with the SGD simplicity bias, replacing the usual n-dependent term with an η-dependent term that remains large when the training distribution screens off high-complexity paths. η is derived directly from the SCM factorization and is not fitted to data; it can be estimated from interventional queries on the causal graph. We will revise the abstract to include a one-sentence statement of the bound and move the definition of η into the main text (Section 3) to address visibility. revision: yes
Referee: [Theoretical Framework] The central claim rests on the training distribution permitting 'Markovian Screening' of the true causal mechanism. The manuscript must explicitly state the formal condition under which this screening occurs and demonstrate that it is not tautological with the shortcut-learning phenomenon being explained.

Authors: We will add Definition 2 in Section 3.1: the training distribution P permits Markovian Screening of the true causal mechanism if there exists a low-complexity variable Z (spurious) such that Y ⊥ C | Z holds in the observational distribution while the interventional distribution P(Y | do(C)) retains dependence on C. This is a purely distributional property of the data-generating SCM and is independent of the learner; it is verified by checking d-separation in the graph and is illustrated by the concrete example in Figure 2 (homogeneous vs. diverse data). The phenomenon of shortcut learning is then a consequence of SGD bias acting on a distribution that already satisfies the screening condition, so the assumption is not tautological. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents a derived generalization bound replacing sample size with Semantic Coverage Measure η, framed via SCM-IB bridging and Markovian Screening. No quoted equations or steps reduce the bound to a self-definition of η, a fitted parameter renamed as prediction, or a self-citation chain. The framework invokes standard external principles (SCM, IB, SGD bias) without load-bearing self-citations or ansatz smuggling. The derivation remains self-contained with independent content from the causal setup; η is introduced as a new measure grounded in the topology of the training distribution rather than fitted to the target result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities can be extracted or verified.

pith-pipeline@v0.9.1-grok · 5749 in / 1062 out tokens · 25044 ms · 2026-06-28T19:33:48.863107+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 10 canonical work pages · 8 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=
[9]

Training language models to follow instructions with human feedback , url =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...
[11]

Chain-of-

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and ichter, brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , editor =. Chain-of-. Advances in. 2022 , pages =

2022
[12]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=
[14]

Defining and Characterizing Reward Gaming , url =

Skalse, Joar and Howe, Nikolaus and Krasheninnikov, Dmitrii and Krueger, David , booktitle =. Defining and Characterizing Reward Gaming , url =
[15]

Nature Machine Intelligence , volume=

Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , publisher=

2020
[16]

Transactions on machine learning research , year=

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models , author=. Transactions on machine learning research , year=
[19]

2015 ieee information theory workshop (itw) , pages=

Deep learning and the information bottleneck principle , author=. 2015 ieee information theory workshop (itw) , pages=. 2015 , organization=

2015
[20]

2009 , publisher=

Causality , author=. 2009 , publisher=

2009
[21]

A closer look at memorization in deep networks , year =

Arpit, Devansh and Jastrzundefinedbski, Stanis. A closer look at memorization in deep networks , year =. Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =
[24]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=
[26]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=
[28]

and Sifre, Laurent , title =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

2022
[29]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=. 2024 , url=

2024
[30]

EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework , author =
[31]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[32]

Measuring

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , editor =. Measuring. Proceedings of the
[33]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Frame: Feedback-refined agent methodology for enhancing medical research insights , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[34]

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man \'e . 2016. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565

work page internal anchor Pith review Pith/arXiv arXiv 2016
[35]

Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien

Devansh Arpit, Stanis aw Jastrzundefinedbski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. 2017. A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, page 233...

2017
[36]

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Micha Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. https://doi.org/10.1609/aaai.v38i16.29720 Graph of thoughts: solving elaborate problems with large language models . In Proceedings of the Thirty-Eighth AAAI Conferenc...

work page doi:10.1609/aaai.v38i16.29720 2024
[37]

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30

2017
[38]

Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665--673

2020
[39]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/be83ab3ecd0db773eb2dc1b0a17836a1-Paper-round2.pdf Measuring Mathematical Problem Solving With the MATH Dataset . In Proceedings of the Neural Information P...

2021
[40]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, and 3 others. 2022. Training compute-optimal ...

2022
[41]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[42]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199--22213

2022
[43]

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let's verify step by step. In The Twelfth International Conference on Learning Representations

2023
[44]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/b...

2022
[45]

Judea Pearl. 2009. Causality. Cambridge university press

2009
[46]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[47]

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/3d719fee332caa23d5038b8a90e81796-Paper-Conference.pdf Defining and characterizing reward gaming . In Advances in Neural Information Processing Systems, volume 35, pages 9460--9471. Curran Associates, Inc

2022
[48]

Naftali Tishby, Fernando C Pereira, and William Bialek. 2000. The information bottleneck method. arXiv preprint physics/0004057

work page internal anchor Pith review Pith/arXiv arXiv 2000
[49]

Naftali Tishby and Noga Zaslavsky. 2015. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pages 1--5. Ieee

2015
[50]

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. 2022. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

Guillermo Valle-Perez, Chico Q Camargo, and Ard A Louis. 2018. Deep learning generalizes because the parameter-function map is biased towards simple functions. arXiv preprint arXiv:1805.08522

work page internal anchor Pith review Pith/arXiv arXiv 2018
[52]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2022
[53]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf Chain-of- Thought Prompting Elicits Reasoning in Large Language Models . In Advances in Neural Information Processing System...

2022
[54]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809--11822

2023
[55]

Chengzhang Yu, Yiming Zhang, Zhixin Liu, Zenghui Ding, Yining Sun, and Zhanpeng Jin. 2025. Frame: Feedback-refined agent methodology for enhancing medical research insights. In Findings of the Association for Computational Linguistics: ACL 2025, pages 7690--7704

2025
[56]

Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, and Yuwen Xiong. 2025. Easyr1: An efficient, scalable, multi-modality rl training framework. https://github.com/hiyouga/EasyR1

2025
[57]

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. http://arxiv.org/abs/2403.13372 Llamafactory: Unified efficient fine-tuning of 100+ language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand. Assoc...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[2] [2]

Publications Manual , year = "1983", publisher =

1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[5] [5]

Dan Gusfield , title =. 1997

1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[8] [8]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

[9] [9]

Training language models to follow instructions with human feedback , url =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

[10] [11]

Chain-of-

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and ichter, brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , editor =. Chain-of-. Advances in. 2022 , pages =

2022

[11] [12]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

[12] [14]

Defining and Characterizing Reward Gaming , url =

Skalse, Joar and Howe, Nikolaus and Krasheninnikov, Dmitrii and Krueger, David , booktitle =. Defining and Characterizing Reward Gaming , url =

[13] [15]

Nature Machine Intelligence , volume=

Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , publisher=

2020

[14] [16]

Transactions on machine learning research , year=

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models , author=. Transactions on machine learning research , year=

[15] [19]

2015 ieee information theory workshop (itw) , pages=

Deep learning and the information bottleneck principle , author=. 2015 ieee information theory workshop (itw) , pages=. 2015 , organization=

2015

[16] [20]

2009 , publisher=

Causality , author=. 2009 , publisher=

2009

[17] [21]

A closer look at memorization in deep networks , year =

Arpit, Devansh and Jastrzundefinedbski, Stanis. A closer look at memorization in deep networks , year =. Proceedings of the 34th International Conference on Machine Learning - Volume 70 , pages =

[18] [24]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=

[19] [26]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

[20] [28]

and Sifre, Laurent , title =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

2022

[21] [29]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , address=. 2024 , url=

2024

[22] [30]

EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework , author =

[23] [31]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

[24] [32]

Measuring

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , editor =. Measuring. Proceedings of the

[25] [33]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Frame: Feedback-refined agent methodology for enhancing medical research insights , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025

[26] [34]

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man \'e . 2016. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565

work page internal anchor Pith review Pith/arXiv arXiv 2016

[27] [35]

Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien

Devansh Arpit, Stanis aw Jastrzundefinedbski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. 2017. A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, page 233...

2017

[28] [36]

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Micha Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. https://doi.org/10.1609/aaai.v38i16.29720 Graph of thoughts: solving elaborate problems with large language models . In Proceedings of the Thirty-Eighth AAAI Conferenc...

work page doi:10.1609/aaai.v38i16.29720 2024

[29] [37]

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30

2017

[30] [38]

Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665--673

2020

[31] [39]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/be83ab3ecd0db773eb2dc1b0a17836a1-Paper-round2.pdf Measuring Mathematical Problem Solving With the MATH Dataset . In Proceedings of the Neural Information P...

2021

[32] [40]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, and 3 others. 2022. Training compute-optimal ...

2022

[33] [41]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020

[34] [42]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199--22213

2022

[35] [43]

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let's verify step by step. In The Twelfth International Conference on Learning Representations

2023

[36] [44]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/b...

2022

[37] [45]

Judea Pearl. 2009. Causality. Cambridge university press

2009

[38] [46]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[39] [47]

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/3d719fee332caa23d5038b8a90e81796-Paper-Conference.pdf Defining and characterizing reward gaming . In Advances in Neural Information Processing Systems, volume 35, pages 9460--9471. Curran Associates, Inc

2022

[40] [48]

Naftali Tishby, Fernando C Pereira, and William Bialek. 2000. The information bottleneck method. arXiv preprint physics/0004057

work page internal anchor Pith review Pith/arXiv arXiv 2000

[41] [49]

Naftali Tishby and Noga Zaslavsky. 2015. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pages 1--5. Ieee

2015

[42] [50]

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. 2022. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275

work page internal anchor Pith review Pith/arXiv arXiv 2022

[43] [51]

Guillermo Valle-Perez, Chico Q Camargo, and Ard A Louis. 2018. Deep learning generalizes because the parameter-function map is biased towards simple functions. arXiv preprint arXiv:1805.08522

work page internal anchor Pith review Pith/arXiv arXiv 2018

[44] [52]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2022

[45] [53]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf Chain-of- Thought Prompting Elicits Reasoning in Large Language Models . In Advances in Neural Information Processing System...

2022

[46] [54]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809--11822

2023

[47] [55]

Chengzhang Yu, Yiming Zhang, Zhixin Liu, Zenghui Ding, Yining Sun, and Zhanpeng Jin. 2025. Frame: Feedback-refined agent methodology for enhancing medical research insights. In Findings of the Association for Computational Linguistics: ACL 2025, pages 7690--7704

2025

[48] [56]

Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, and Yuwen Xiong. 2025. Easyr1: An efficient, scalable, multi-modality rl training framework. https://github.com/hiyouga/EasyR1

2025

[49] [57]

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. http://arxiv.org/abs/2403.13372 Llamafactory: Unified efficient fine-tuning of 100+ language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand. Assoc...

work page internal anchor Pith review Pith/arXiv arXiv 2024