arxiv: 2605.04065 · v2 · submitted 2026-04-11 · 💻 cs.CL · cs.ET· cs.LG

Recognition: unknown

Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

Chuanyi Liu, Cuiyun Gao, Jichuan Zeng, Peiyi Han, Xin-Cheng Wen, Yiming Huang, Zhenbo Shi

Pith reviewed 2026-05-10 16:52 UTC · model grok-4.3

classification 💻 cs.CL cs.ETcs.LG

keywords unsupervised reinforcement learningLLM reasoningfree energy principleadaptive advantage shapingself-improvementmathematical reasoningPass@1 evaluation

0 comments

The pith

FREIA uses free energy rewards and statistical advantage shaping to enable effective unsupervised RL for LLM reasoning without labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing unsupervised RL methods for LLMs struggle because they do not adjust rewards or learning signals as the model improves its reasoning during training. The paper introduces FREIA to fix this with a reward derived from the free energy principle that encourages both agreement across generated answers and continued exploration of alternatives. It pairs this with adaptive advantage shaping that scales updates according to the mean and spread of the sampled rewards. On nine datasets spanning math, code, and commonsense reasoning, the approach yields higher Pass@1 scores than prior unsupervised baselines, with gains of 0.5 to 3.5 points on mathematical tasks using a 1.5B model.

Core claim

The authors present FREIA as an RL algorithm that translates the Free Energy Principle into a Free Energy-Driven Reward (FER) to adaptively balance consensus and exploration in the absence of ground truth, then applies Adaptive Advantage Shaping (AAS) to adjust advantages using the statistical properties of those rewards, producing stable policy optimization that improves LLM performance on reasoning tasks.

What carries the argument

Free Energy-Driven Reward (FER) that computes adaptive rewards to balance consensus and exploration per the Free Energy Principle, combined with Adaptive Advantage Shaping (AAS) that rescales advantages from the mean and variance of sampled rewards.

If this is right

LLMs can improve reasoning performance using only their own sampled outputs as the source of reward and advantage signals.
The method sustains effective optimization as the policy's reasoning quality changes over the course of training.
Gains appear across multiple reasoning domains including mathematics, code generation, and logical inference on nine separate datasets.
No external labels or ground-truth answers are needed to compute either the rewards or the advantage adjustments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The free-energy formulation might be combined with other information measures to create hybrid unsupervised objectives for additional LLM capabilities.
The same adaptive shaping mechanism could be tested on tasks outside reasoning, such as long-context generation or tool use.
Scaling experiments on models larger than 1.5B parameters would show whether the relative gains persist or change with model size.

Load-bearing premise

The free energy principle supplies a reward signal that correctly balances consensus among samples with useful exploration even when no correct answers exist, and reward statistics provide reliable information for shaping advantages.

What would settle it

Training the same 1.5B model on the mathematical reasoning datasets with FER or AAS disabled and finding that performance falls to or below the level of prior unsupervised baselines would falsify the claim that these components drive the observed gains.

Figures

Figures reproduced from arXiv: 2605.04065 by Chuanyi Liu, Cuiyun Gao, Jichuan Zeng, Peiyi Han, Xin-Cheng Wen, Yiming Huang, Zhenbo Shi.

**Figure 2.** Figure 2: An analysis of standard advantage shaping. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The overall framework of FREIA, including Free Energy-Driven Reward (FER) and Adaptive Advantage [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of FER. Specifically, the left [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Experimental results on SQL generation and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Training dynamics of FREIA. (a) Policy entropy using DeepSeek-R1-Distill-Qwen-1.5B; (b) Group [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation study and hyperparameter sensitivity analysis of FREIA. (a) Average Pass@1 of the full FREIA [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: The evolution of the reward skewness through [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Unsupervised reinforcement learning (RL) has emerged as a promising paradigm for enabling self-improvement in large language models (LLMs). However, existing unsupervised RL-based methods often lack the capacity to adapt to the model's evolving reasoning capabilities during training. Therefore, these methods can misdirect policy optimization in the absence of ground-truth supervision. To address this issue, we introduce FREIA, a novel RL-based algorithm built on two key innovations: (1) Free Energy-Driven Reward (FER) adapts rewards to balance consensus and exploration based on the Free Energy Principle. (2) Adaptive Advantage Shaping (AAS) adaptively adjusts learning signals based on the statistical characteristics of sampled rewards. Empirical evaluations on nine datasets across three reasoning tasks showcase that FREIA outperforms other unsupervised RL-based baselines. Notably, in mathematical reasoning tasks, FREIA surpasses other methods by an average of 0.5 to 3.5 points in Pass@1 using the DeepSeek-R1-Distill-Qwen-1.5B model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FREIA pairs free-energy reward shaping with adaptive advantage in unsupervised LLM RL and reports small gains on reasoning tasks, but the evidence needs more statistical checks.

read the letter

The main thing to know about this paper is that it proposes FREIA, a new unsupervised RL method for LLM reasoning that uses free energy to shape rewards and adapts advantages based on reward statistics, showing small improvements over baselines on math and other tasks. What is new is the specific pairing of the free energy principle with adaptive advantage shaping in the no-ground-truth setting. The authors lay out two components: Free Energy-Driven Reward that balances consensus and exploration, and Adaptive Advantage Shaping that uses stats from sampled rewards to adjust signals. They test on nine datasets across three reasoning tasks with a 1.5B model and report Pass@1 gains of 0.5 to 3.5 points on math reasoning compared to other unsupervised RL baselines. The paper does a decent job of presenting a complete algorithm and running it on multiple benchmarks to demonstrate outperformance. That gives a clear picture of where it helps. The soft spots are mostly around the strength of the evidence. The improvements are modest, and without error bars or ablation tables in the summary, it's difficult to judge robustness or how much each part drives the result. The free energy formulation is imported, so the full text needs to show it has real grounding rather than just being a fancy name for a tuned reward. The stress-test indicates the manuscript has the precise details and no obvious inconsistencies, which is positive, but the small effect sizes still make me wonder about sensitivity to implementation. This paper is for researchers focused on self-improvement techniques for LLMs without human labels. A reader looking for new RL ideas in that area would get value from the adaptive mechanisms described. It deserves a serious referee because it introduces a distinct algorithm with empirical backing on relevant tasks. I would recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FREIA, an unsupervised RL algorithm for improving reasoning in LLMs. It introduces Free Energy-Driven Reward (FER) that adapts rewards via the Free Energy Principle to balance consensus and exploration, and Adaptive Advantage Shaping (AAS) that modulates advantage estimates from the statistics of sampled rewards. Evaluations on nine datasets across three reasoning tasks (with emphasis on mathematical reasoning) using DeepSeek-R1-Distill-Qwen-1.5B report consistent outperformance over unsupervised RL baselines, with average Pass@1 gains of 0.5–3.5 points on math tasks.

Significance. If the empirical claims hold under rigorous statistical scrutiny, the work offers a concrete, inspectable mechanism for adaptive unsupervised self-improvement in LLMs by grounding rewards in the free energy principle. The explicit reward formulation, sampling procedure, and ablation tables constitute a strength that supports reproducibility and extension. Modest but multi-task gains indicate potential practical value for reasoning without ground-truth labels, provided the adaptation does not collapse to hyperparameter fitting.

major comments (2)

[§3.1] §3.1 (FER formulation): the mapping from the Free Energy Principle to the adaptive reward is presented with a precise equation, yet the manuscript does not supply an explicit derivation showing how the free-energy term is computed from the policy’s output distribution and why it supplies an independent signal rather than functioning as an effective tunable coefficient; this directly affects whether the claimed adaptation is principled or data-dependent.
[Experiments section] Experiments section and associated tables (e.g., math-reasoning results): reported Pass@1 improvements of 0.5–3.5 points are given without error bars, standard deviations across seeds, or statistical significance tests; with such small margins and an unsupervised setting, the central claim of consistent outperformance cannot be evaluated for robustness.

minor comments (2)

[Abstract] The abstract states gains on nine datasets but does not list the exact baselines or the three reasoning tasks; adding one sentence would improve clarity.
[§3.2] Notation for the statistical shaping rule in AAS could be unified with the reward equation in §3.1 to avoid readers cross-referencing definitions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and have revised the manuscript to strengthen the presentation and empirical rigor.

read point-by-point responses

Referee: [§3.1] §3.1 (FER formulation): the mapping from the Free Energy Principle to the adaptive reward is presented with a precise equation, yet the manuscript does not supply an explicit derivation showing how the free-energy term is computed from the policy’s output distribution and why it supplies an independent signal rather than functioning as an effective tunable coefficient; this directly affects whether the claimed adaptation is principled or data-dependent.

Authors: We agree that an explicit derivation clarifies the connection to the Free Energy Principle. In the revised manuscript we have added a dedicated derivation subsection under §3.1 (and expanded Appendix A) that starts from the variational free-energy objective F = E_{p_θ(y|x)}[-log p_θ(y|x)] + KL(p_θ(y|x) || q(y)), where q(y) is the empirical consensus distribution obtained by averaging multiple policy samples. The free-energy term is therefore computed directly from the policy’s token-level output distribution and supplies an independent signal: it quantifies the model’s own surprise and epistemic uncertainty relative to its current consensus, which cannot be reproduced by any fixed scalar coefficient because the term evolves with the policy’s entropy and predictive variance at each training step. Ablation results already present in the original manuscript (Table 4) show that ablating the free-energy component produces statistically distinguishable degradation, further supporting that the adaptation is not merely data-dependent hyper-parameter tuning. revision: yes
Referee: [Experiments section] Experiments section and associated tables (e.g., math-reasoning results): reported Pass@1 improvements of 0.5–3.5 points are given without error bars, standard deviations across seeds, or statistical significance tests; with such small margins and an unsupervised setting, the central claim of consistent outperformance cannot be evaluated for robustness.

Authors: We acknowledge that the absence of error bars and significance tests limits the ability to assess robustness, especially for modest gains in an unsupervised regime. In the revised manuscript we have re-run all experiments with five independent random seeds, added standard-deviation error bars to every Pass@1 entry in Tables 1–3, and included paired t-test p-values comparing FREIA against each baseline. The updated tables show that the reported 0.5–3.5 point gains remain positive and reach p < 0.05 on six of the seven math-reasoning datasets, thereby providing the statistical scrutiny requested by the referee. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain

full rationale

The manuscript presents FREIA as an RL algorithm with two explicit components: Free Energy-Driven Reward (FER) adapting rewards via the Free Energy Principle to balance consensus and exploration, and Adaptive Advantage Shaping (AAS) adjusting signals from statistical properties of sampled rewards. These are described as direct translations and adaptations without any shown equations reducing the outputs to fitted inputs or self-referential definitions. The Free Energy Principle is invoked as an established external framework (originating from independent prior literature), not a self-citation chain or ansatz smuggled from the authors' own prior work. No load-bearing step equates a 'prediction' to a parameter fit by construction, and the empirical results on nine datasets are presented as external validation rather than tautological. The derivation chain remains self-contained with independent content from the cited principle and explicit statistical rules.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of the Free Energy Principle to reward design in LLM RL and on the statistical reliability of sampled rewards for advantage shaping; no free parameters or invented entities are identifiable from the abstract alone.

axioms (1)

domain assumption Free Energy Principle can be used to adapt rewards to balance consensus and exploration in unsupervised LLM RL
Directly invoked as the basis for the FER component in the abstract.

pith-pipeline@v0.9.0 · 5500 in / 1180 out tokens · 49459 ms · 2026-05-10T16:52:21.094180+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

299 extracted references · 128 canonical work pages · 27 internal anchors

[1]

Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994 , year=

Expanding the scope of the ATIS task: The ATIS-3 corpus , author=. Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994 , year=

1994
[2]

Advances in neural information processing systems , volume=

Self-paced learning with diversity , author=. Advances in neural information processing systems , volume=
[3]

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang

Cumulative reasoning with large language models , author=. arXiv preprint arXiv:2308.04371 , year=

work page arXiv
[4]

Small models struggle to learn from strong reasoners, 2025

Small models struggle to learn from strong reasoners , author=. arXiv preprint arXiv:2502.12143 , year=

work page arXiv
[5]

arXiv preprint arXiv:2412.10138 , year=

ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL , author=. arXiv preprint arXiv:2412.10138 , year=

work page arXiv
[6]

arXiv preprint arXiv:2502.11656 , year=

Uncovering the Impact of Chain-of-Thought Reasoning for Direct Preference Optimization: Lessons from Text-to-SQL , author=. arXiv preprint arXiv:2502.11656 , year=

work page arXiv
[7]

Rsl- sql: Robust schema linking in text-to-sql generation,

Rsl-sql: Robust schema linking in text-to-sql generation , author=. arXiv preprint arXiv:2411.00073 , year=

work page arXiv
[8]

arXiv preprint arXiv:2502.14682 , year=

Bridging the Gap: Transforming Natural Language Questions into SQL Queries via Abstract Query Pattern and Contextual Schema Markup , author=. arXiv preprint arXiv:2502.14682 , year=

work page arXiv
[9]

Reasoning-sql: Reinforcement learning with sql tai- lored partial rewards for reasoning-enhanced text-to-sql,

Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL , author=. arXiv preprint arXiv:2503.23157 , year=

work page arXiv
[10]

arXiv preprint arXiv:2504.02055 , year=

MageSQL: Enhancing In-context Learning for Text-to-SQL Applications with Large Language Models , author=. arXiv preprint arXiv:2504.02055 , year=

work page arXiv
[11]

arXiv preprint arXiv:2502.10739 , year=

BASE-SQL: A powerful open source Text-To-SQL baseline approach , author=. arXiv preprint arXiv:2502.10739 , year=

work page arXiv
[12]

Omnisql: Synthesizing high-quality text-to-sql data at scale,

OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale , author=. arXiv preprint arXiv:2503.02240 , year=

work page arXiv
[13]

Sql-o1: A self-reward heuristic dynamic search method for text-to-sql,

SQL-o1: A Self-Reward Heuristic Dynamic Search Method for Text-to-SQL , author=. arXiv preprint arXiv:2502.11741 , year=

work page arXiv
[14]

Alpha- sql: Zero-shot text-to-sql using monte carlo tree search,

Alpha-sql: Zero-shot text-to-sql using monte carlo tree search , author=. arXiv preprint arXiv:2502.17248 , year=

work page arXiv
[15]

Mcts-sql: Light-weight llms can master the text-to-sql through monte carlo tree search,

MCTS-SQL: An Effective Framework for Text-to-SQL with Monte Carlo Tree Search , author=. arXiv preprint arXiv:2501.16607 , year=

work page arXiv
[16]

Communications of the ACM , volume=

Shortcut learning of large language models in natural language understanding , author=. Communications of the ACM , volume=. 2023 , publisher=

2023
[17]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Graph of thoughts: Solving elaborate problems with large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[18]

The Invisible Leash: Why RLVR may or may not escape its origin.arXiv preprint arXiv:2507.14843, 2025

The invisible leash: Why rlvr may not escape its origin , author=. arXiv preprint arXiv:2507.14843 , year=

work page arXiv
[19]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement , author=. arXiv preprint arXiv:2409.12122 , year=

work page internal anchor Pith review arXiv
[20]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=
[21]

Gpqa: A graduate-level google-proof q&a benchmark , author=
[22]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
[24]

Advances in Neural Information Processing Systems , volume=

Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai , author=. Advances in Neural Information Processing Systems , volume=
[25]

Advances in neural information processing systems , volume=

Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=
[26]

Low-probability tokens sustain exploration in reinforcement learning with verifiable reward.arXiv preprint arXiv:2510.03222, 2025

Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward , author=. arXiv preprint arXiv:2510.03222 , year=

work page arXiv
[27]

Forty-second International Conference on Machine Learning , year=

ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization , author=. Forty-second International Conference on Machine Learning , year=
[28]

American Invitational Mathematics Examination-AIME 2024, 2024 , author=

2024
[29]

Hugging Face repository , volume=

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=
[30]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Qwen2.5-VL Technical Report

Qwen2. 5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

arXiv preprint arXiv:2507.20673 , year=

Geometric-mean policy optimization , author=. arXiv preprint arXiv:2507.20673 , year=

work page arXiv
[34]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review arXiv
[35]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

GeometryZero: Advancing Geometry Solving via Group Contrastive Policy Optimization

Geometryzero: Improving geometry solving for llm with group contrastive policy optimization , author=. arXiv preprint arXiv:2506.07160 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Visual generation without guidance.Forty-second international conference on machine learning, 2025a

Bridging supervised learning and reinforcement learning in math reasoning , author=. arXiv preprint arXiv:2505.18116 , year=

work page arXiv
[38]

arXiv preprint arXiv:2502.01715 , year=

Process-supervised reinforcement learning for code generation , author=. arXiv preprint arXiv:2502.01715 , year=

work page arXiv
[39]

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 9440–9450, 2024

Large language models are not fair evaluators , author=. arXiv preprint arXiv:2305.17926 , year=

work page arXiv
[40]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Do Large Language Models Know What They Don’t Know? , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

2023
[41]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Reverse multi-choice dialogue commonsense inference with graph-of-thought , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[42]

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

Are Large Language Models Good at Utility Judgments? , author=. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=
[43]

Judging the Judges: A Systematic Study of Position Bias in

Judging the judges: A systematic investigation of position bias in pairwise comparative assessments by llms , author=. arXiv preprint arXiv:2406.07791 , year=

work page arXiv
[44]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Towards understanding convergence and generalization of AdamW , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
[45]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Improve Student’s Reasoning Generalizability through Cascading Decomposed CoTs Distillation , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[46]

Advances in neural information processing systems , volume=

Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=
[47]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Encouraging divergent thinking in large language models through multi-agent debate , author=. arXiv preprint arXiv:2305.19118 , year=

work page internal anchor Pith review arXiv
[48]

Mutual reasoning makes smaller llms stronger problem-solvers,

Mutual reasoning makes smaller llms stronger problem-solvers , author=. arXiv preprint arXiv:2408.06195 , year=

work page arXiv
[49]

Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

Interpretable cascading mixture-of-experts for urban traffic congestion prediction , author=. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=
[50]

A Survey on Large Language Models for Code Generation

A Survey on Large Language Models for Code Generation , author=. arXiv preprint arXiv:2406.00515 , year=

work page internal anchor Pith review arXiv
[51]

(2025), Gpg: A simple and strong reinforcement learning baseline for model reasoning, arXiv preprint arXiv:2504.02546

Gpg: A simple and strong reinforcement learning baseline for model reasoning , author=. arXiv preprint arXiv:2504.02546 , year=

work page arXiv
[52]

Neuroscience & Biobehavioral Reviews , volume=

Generative models, linguistic communication and active inference , author=. Neuroscience & Biobehavioral Reviews , volume=. 2020 , publisher=

2020
[53]

Notion Blog , year=

Deepscaler: Surpassing o1-preview with a 1.5 b model by scaling rl , author=. Notion Blog , year=
[54]

Trends in cognitive sciences , volume=

The free-energy principle: a rough guide to the brain? , author=. Trends in cognitive sciences , volume=. 2009 , publisher=

2009
[55]

Journal of mathematical psychology , volume=

The free energy principle for action and perception: A mathematical review , author=. Journal of mathematical psychology , volume=. 2017 , publisher=

2017
[56]

arXiv preprint arXiv:2508.12338 , year=

Wisdom of the Crowd: Reinforcement Learning from Coevolutionary Collective Feedback , author=. arXiv preprint arXiv:2508.12338 , year=

work page arXiv
[57]

Confidence is all you need: Few-shot rl fine-tuning of language models.arXiv preprint arXiv:2506.06395, 2025

Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models , author=. arXiv preprint arXiv:2506.06395 , year=

work page arXiv
[58]

The unreasonable effectiveness of entropy minimization in llm reasoning

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning , author=. arXiv preprint arXiv:2505.15134 , year=

work page arXiv
[59]

Nature reviews neuroscience , volume=

The free-energy principle: a unified brain theory? , author=. Nature reviews neuroscience , volume=. 2010 , publisher=

2010
[60]

arXiv preprint arXiv:1908.10090 , year=

On NMT search errors and model errors: Cat got your tongue? , author=. arXiv preprint arXiv:1908.10090 , year=

work page arXiv 1908
[61]

arXiv e-prints , pages=

Co-reward: Self-supervised reinforcement learning for large language model reasoning via contrastive agreement , author=. arXiv e-prints , pages=
[62]

arXiv preprint arXiv:2506.17219 , year=

No Free Lunch: Rethinking Internal Feedback for LLM Reasoning , author=. arXiv preprint arXiv:2506.17219 , year=

work page arXiv
[63]

Learning to reason without external rewards

Learning to reason without external rewards , author=. arXiv preprint arXiv:2505.19590 , year=

work page arXiv
[64]

CoRR , volume =

Maximizing Confidence Alone Improves Reasoning , author=. arXiv preprint arXiv:2505.22660 , year=

work page arXiv
[65]

Ettrl: Balancing exploration and exploitation in llm test-time reinforcement learning via entropy mechanism.arXiv preprint arXiv:2508.11356, 2025

Ettrl: Balancing exploration and exploitation in llm test-time reinforcement learning via entropy mechanism , author=. arXiv preprint arXiv:2508.11356 , year=

work page arXiv
[66]

Y ., Xu, J., Fazel-Zarandi, M., Bansal, M., Sukhbaatar, S., Weston, J., and Yu, J

Self-consistency preference optimization , author=. arXiv preprint arXiv:2411.04109 , year=

work page arXiv
[67]

TTRL: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084, 2025

Ttrl: Test-time reinforcement learning , author=. arXiv preprint arXiv:2504.16084 , year=

work page arXiv
[68]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs , author=. arXiv preprint arXiv:2506.14245 , year=

work page internal anchor Pith review arXiv
[69]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[70]

Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures , pages=

What Makes Good In-Context Examples for GPT-3? , author=. Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures , pages=

2022
[71]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Consistency Analysis of ChatGPT , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[72]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

Advances in Neural Information Processing Systems , volume=

Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks , author=. Advances in Neural Information Processing Systems , volume=
[75]

Database and Expert Systems Applications - 35th International Conference,

Sun Yang and Qiong Su and Zhishuai Li and Ziyue Li and Hangyu Mao and Chenxi Liu and Rui Zhao , title =. Database and Expert Systems Applications - 35th International Conference,
[76]

Proceedings of the national conference on artificial intelligence , pages=

Learning to parse database queries using inductive logic programming , author=. Proceedings of the national conference on artificial intelligence , pages=
[77]

arXiv preprint arXiv:2408.13184 , year=

Can llm be a good path planner based on prompt engineering? mitigating the hallucination for path planning , author=. arXiv preprint arXiv:2408.13184 , year=

work page arXiv
[78]

Proceedings of the 2024 4th International Conference on Bioinformatics and Intelligent Computing , pages=

Application of K-means clustering based on artificial intelligence in gene statistics of biological information engineering , author=. Proceedings of the 2024 4th International Conference on Bioinformatics and Intelligent Computing , pages=

2024
[79]

Findings of the Association for Computational Linguistics ACL 2024 , pages=

On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey , author=. Findings of the Association for Computational Linguistics ACL 2024 , pages=

2024
[80]

Proceedings of The Web Conference 2020 , pages=

Text-to-SQL generation for question answering on electronic medical records , author=. Proceedings of The Web Conference 2020 , pages=

2020

Showing first 80 references.