arxiv: 2605.08368 · v1 · submitted 2026-05-08 · 💻 cs.AI · cond-mat.stat-mech· cs.LG

Recognition: no theorem link

On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective

Yuhao Li , Shengchao Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:47 UTC · model grok-4.3

classification 💻 cs.AI cond-mat.stat-mechcs.LG

keywords capability elicitationcapability creationpost-trainingaccessible supportfree-energy perspectivesupervised fine-tuningreinforcement learninglarge language models

0 comments

The pith

Post-training reweights behaviors within a pretrained model's accessible support to elicit capabilities, or expands that support to create new ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the common distinction between supervised fine-tuning as imitation and reinforcement learning as discovery is too coarse for understanding large language model post-training. Instead, it proposes distinguishing capability elicitation, which increases the probability of behaviors the model could already produce, from capability creation, which changes what the model can practically reach. This distinction is made operational through the concept of accessible support, defined as the set of behaviors reachable under finite budgets. Using a free-energy perspective, both SFT and RL are viewed as reweighting a reference distribution based on external signals, with the key factor being whether the update remains close to the base model. If the paper is right, research should focus on whether post-training expands the reachable behavioral space through mechanisms like search or new information rather than the choice of SFT or RL.

Core claim

Post-training that reweights behaviors within the accessible support is capability elicitation; whereas changing the support itself corresponds to capability creation. Both SFT and RL can be seen as reweighting the pretrained reference distribution, only with different external signals, and when the update remains close to the base model, the main effect is local reweighting, not capability creation.

What carries the argument

Accessible support, the set of behaviors that a model can practically produce under finite budgets, which determines whether post-training elicits or creates capabilities by reweighting within it or changing it.

If this is right

The central question for post-training is no longer whether it is framed as SFT or RL, but whether it reweights behaviors already within reach or expands the model's reachable behavioral space through search, interaction, tool use, or new information.
When the update remains close to the base model, the main effect is local reweighting, not capability creation.
SFT and RL can both be seen as reweighting a pretrained reference distribution, only with different external signals: demonstrations define low-energy behavior for SFT and rewards define low-energy behavior for RL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This view suggests measuring post-training by testing whether new behaviors could have been reached with finite effort before the update.
Training procedures that incorporate explicit search or external interaction are positioned as more likely to expand accessible support.
The framework could be used to reinterpret scaling curves as mixtures of elicitation and creation effects at different training stages.

Load-bearing premise

The notion of accessible support can be made precise and measurable enough to distinguish elicitation from creation in practice.

What would settle it

An experiment showing a concrete behavior that a model produces after post-training but could not produce before under any finite budget of compute, data, or interaction, separate from mere reweighting of already-reachable outputs.

Figures

Figures reproduced from arXiv: 2605.08368 by Shengchao Liu, Yuhao Li.

**Figure 1.** Figure 1: Schematic energy landscape for accessible support. Behaviors in basins are easily produced by the base model, while tail behaviors are rare but reachable under larger sampling or search budgets. Barrier regions require crossing low-probability intermediate states. In the unsupported limit, the effective energy becomes divergent, and the local reweighting view no longer applies. This is a distributional eff… view at source ↗

read the original abstract

Debates about large language model post-training often treat supervised fine-tuning (SFT) as imitation and reinforcement learning (RL) as discovery. But this distinction is too coarse. What matters is whether a training procedure increases the probability of behaviors the pretrained model could already produce, or whether it changes what the model can practically reach. We argue that post-training research should distinguish between capability elicitation and capability creation. We make this distinction operational by introducing the notion of accessible support: the set of behaviors that a model can practically produce under finite budgets. Post-training that reweights behaviors within this support is capability elicitation; whereas changing the support itself corresponds to capability creation. We develop this argument through a free-energy view of post-training. SFT and RL can both be seen as reweighting a pretrained reference distribution, only with different external signals. Demonstration signals define low-energy behavior for SFT, and reward signals define low-energy behavior for RL. When the update remains close to the base model, the main effect is local reweighting, not capability creation. Within this framework, the central question is no longer whether post-training is framed as SFT or RL, but whether it reweights behaviors already within reach, or instead expands the model's reachable behavioral space through search, interaction, tool use, or the incorporation of new information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes post-training around elicitation versus creation but leaves the key distinction too vague to apply.

read the letter

The paper's main point is that post-training should be judged by whether it elicits capabilities the model could already produce or creates new ones by expanding what it can reach. They use the term accessible support for the set of behaviors reachable under finite budgets and frame both SFT and RL as reweighting a reference distribution toward lower energy states defined by data or rewards. This reframing is useful because it moves past the coarse SFT-versus-RL debate. It correctly notes that when updates stay close to the base model, the effect is local reweighting rather than genuine expansion. That matches what we see in practice with most fine-tuning runs. The main limitation is that accessible support is never made precise. The paper describes it but gives no mathematical characterization, no way to parameterize budgets, and no example on a small model to show how you would classify a training update. The free-energy view also remains an analogy without derivations that could be checked against data. As a result, the central distinction stays non-actionable for now. This kind of paper is for researchers thinking about how to evaluate and design post-training methods. It could help shift conversations toward better measures of capability change. A reader wanting concrete results or formal proofs will come away empty. The thinking is clear and engages the literature honestly, even if it doesn't break new empirical ground. I would send it for peer review. Referees could push on formalizing the support idea or adding tests, which might turn the perspective into something more solid.

Referee Report

1 major / 2 minor

Summary. The paper claims that debates on LLM post-training oversimplify SFT as imitation and RL as discovery. It introduces 'accessible support' (behaviors a model can practically produce under finite budgets) to distinguish capability elicitation (reweighting probabilities within this support) from capability creation (expanding the support itself). Both SFT and RL are reframed as reweighting a pretrained reference distribution under a free-energy perspective, where demonstration or reward signals define low-energy behaviors; the key question is whether updates remain local to the base model or expand reachable behaviors via search, interaction, or new information.

Significance. If operationalized, the distinction could usefully reorient post-training research toward explicit analysis of whether updates elicit existing behaviors or create new ones, moving beyond coarse SFT/RL labels. The free-energy analogy correctly highlights that both methods optimize energy-like objectives and that proximity to the base model favors reweighting. However, the manuscript is entirely conceptual with no derivations, data, benchmarks, or examples, so its significance remains prospective rather than demonstrated.

major comments (1)

[Abstract] Abstract: The central claim requires that 'accessible support' (behaviors reachable under finite budgets) can be identified precisely enough to classify any post-training update as reweighting inside the support or expansion of the support. The paper defines it only descriptively ('the set of behaviors that a model can practically produce under finite budgets') and states that SFT/RL are reweightings of a reference distribution, but supplies no mathematical characterization (e.g., no measure on behavior space, no budget parameterization, no decision procedure), no algorithm, and no worked example.

minor comments (2)

The manuscript would benefit from a brief toy-model illustration showing how one would determine whether a specific update changes the accessible support.
Clarify whether the free-energy perspective is intended as a strict analogy or as a formal mapping that could yield testable predictions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and insightful review. The comments accurately highlight the conceptual focus of the manuscript and the need for greater precision around the definition of accessible support. We respond to the major comment below, indicating the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim requires that 'accessible support' (behaviors reachable under finite budgets) can be identified precisely enough to classify any post-training update as reweighting inside the support or expansion of the support. The paper defines it only descriptively ('the set of behaviors that a model can practically produce under finite budgets') and states that SFT/RL are reweightings of a reference distribution, but supplies no mathematical characterization (e.g., no measure on behavior space, no budget parameterization, no decision procedure), no algorithm, and no worked example.

Authors: We agree that the current treatment of accessible support is descriptive rather than equipped with a formal measure on behavior space, explicit budget parameterization, or a decision procedure. The manuscript is a perspective paper whose primary aim is to reframe post-training debates; it does not purport to deliver a complete operational framework. In the revised version we will (i) add a more explicit parameterization of the finite budget (in terms of sampling temperature, sequence length, and computational resources) and (ii) include a short worked example illustrating how one might assess whether a given behavior lies inside or outside the accessible support in a simplified setting. We will also state clearly that a full algorithm or classification procedure lies beyond the scope of this work and remains an open question for future research. These changes will make the central claim more precise while preserving the paper's conceptual character. revision: partial

Circularity Check

1 steps flagged

Central distinction introduced by definition of 'accessible support' without formalization or external derivation

specific steps

self definitional [Abstract]
"We make this distinction operational by introducing the notion of accessible support: the set of behaviors that a model can practically produce under finite budgets. Post-training that reweights behaviors within this support is capability elicitation; whereas changing the support itself corresponds to capability creation."

The elicitation-vs-creation distinction is defined exactly as reweighting inside versus expansion of the newly introduced 'accessible support' term. The term is presented as making the distinction operational, but the definition supplies no independent measure, budget parameterization, or decision procedure; the classification therefore holds by construction of the definition rather than by derivation from prior results or data.

full rationale

The paper's load-bearing claim—that post-training is elicitation when it reweights inside accessible support and creation when it expands the support—is made by introducing the term 'accessible support' and then defining the distinction directly in terms of it. This reduces the claimed operationalization to a definitional move rather than a derivation from independent equations, benchmarks, or measurable procedures. The free-energy framing is asserted as a perspective under which SFT and RL are both reweightings, but supplies no checked equations or external validation that would make the support concept falsifiable outside the definition itself. No self-citations or fitted parameters appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework introduces one invented entity and relies on a domain assumption about free-energy minimization; no fitted parameters or external benchmarks are used.

axioms (1)

domain assumption Post-training procedures can be viewed as reweighting a pretrained reference distribution using external signals that define low-energy behaviors.
Stated directly in the abstract as the basis for unifying SFT and RL.

invented entities (1)

accessible support no independent evidence
purpose: To operationalize the boundary between capability elicitation and capability creation.
Defined as the set of behaviors a model can practically produce under finite budgets; no independent evidence or measurement protocol supplied.

pith-pipeline@v0.9.0 · 5548 in / 1284 out tokens · 57137 ms · 2026-05-12T00:47:44.749512+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 9 internal anchors

[1]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A...

work page 2020
[2]

Touvron, H., Martin, L., Stone, K., Albert, P ., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P ., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V ., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V ., Kh...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

D., Ermon, S

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S. & Finn, C.Direct Preference Optimization: Your Language Model is Secretly a Reward ModelinThirty-seventh Conference on Neural Information Processing Systems (2023).https://openreview.net/forum?id=HPuSIXJaa9

work page 2023
[4]

Fine-Tuning Language Models from Human Preferences

Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P . & Irving, G.Fine-Tuning Language Models from Human Preferences2020. arXiv:1909.08593 [cs.CL].https://arxiv.org/abs/1909.08593

work page internal anchor Pith review Pith/arXiv arXiv 1909
[5]

& Christiano, P

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D. & Christiano, P . F. Learning to summarize with human feedbackinAdvances in Neural Information Processing Systems(eds Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. & Lin, H.)33(Curran Associates, Inc., 2020), 3008–3021.https://proceedin gs.neurips.cc/paper_fi...

work page 2020
[6]

F., Leike, J

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P ., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P ., Christiano, P . F., Leike, J. & Lowe, R.Training language models to follow instructions with human feedbackinAdvances in Neural Information Processing...

work page 2022
[7]

W., Lester, B., Du, N., Dai, A

Wei, J., Bosma, M., Zhao, V ., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M. & Le, Q. V .Finetuned Language Models are Zero-Shot LearnersinInternational Conference on Learning Representations(2022).https://openreview.net/for um?id=gEZrGCozdqR

work page 2022
[8]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D. & Hajishirzi, H.Self-Instruct: Aligning Language Models with Self-Generated Instructions2023. arXiv:2212.10560 [cs.CL].https://arxiv.org/abs/2212.10560

work page internal anchor Pith review arXiv
[9]

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P ., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P ., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P ., Such, F. P ., Cummings, D., Plappert, M....

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Code Llama: Open Foundation Models for Code

Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y., Liu, J., Sauvestre, R., Remez, T., Rapin, J., Kozhevnikov, A., Evtimov, I., Bitton, J., Bhatt, M., Ferrer, C. C., Grattafiori, A., Xiong, W., Défossez, A., Copet, J., Azhar, F., Touvron, H., Martin, L., Usunier, N., Scialom, T. & Synnaeve, G.Code Llama: Open Foundation Model...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

& Cobbe, K.Let’s Verify Step by StepinThe Twelfth International Conference on Learning Representations(2024).https://openr eview.net/forum?id=v8L0pN6EOi

Lightman, H., Kosaraju, V ., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I. & Cobbe, K.Let’s Verify Step by StepinThe Twelfth International Conference on Learning Representations(2024).https://openr eview.net/forum?id=v8L0pN6EOi

work page 2024
[12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P ., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y. & Guo, D.DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models2024. arXiv:2402.03300 [cs.CL].https://a rxiv.org/abs/2402.03300. 11

work page internal anchor Pith review Pith/arXiv arXiv
[13]

V ., Levine, S

Chu, T., Zhai, Y., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q. V ., Levine, S. & Ma, Y.SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-traininginForty-second International Conference on Machine Learning(2025).https://openreview.net/forum?id=dYur3yabMj

work page 2025
[14]

Supervised fine-tuning versus reinforcement learning: A study of post-training methods for large language models.arXiv preprint arXiv:2603.13985, 2026

Jiang, H., Zhang, W., Yao, J., Cai, H., Wang, S. & Song, R.Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models2026. arXiv:2603.13985 [cs.AI].https://arxiv.org/a bs/2603.13985

work page arXiv
[15]

& Levy, O.LIMA: Less Is More for AlignmentinAdvances in Neural Information Processing Systems (eds Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M

Zhou, C., Liu, P ., Xu, P ., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P ., YU, L., Zhang, S., Ghosh, G., Lewis, M., Zettlemoyer, L. & Levy, O.LIMA: Less Is More for AlignmentinAdvances in Neural Information Processing Systems (eds Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M. & Levine, S.)36(Curran Associates, Inc., 2023), 55006–5502...

work page 2023
[16]

Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wang, G. & Wu, F. Instruction Tuning for Large Language Models: A Survey.ACM Comput. Surv.58.ISSN: 0360-0300.https://doi.org/10.11 45/3777411(Jan. 2026)

work page 2026
[17]

& Hovy, E.Reinforcement Learning Enhanced LLMs: A Survey2025

Wang, S., Zhang, S., Zhang, J., Hu, R., Li, X., Zhang, T., Li, J., Wu, F., Wang, G. & Hovy, E.Reinforcement Learning Enhanced LLMs: A Survey2025. arXiv:2412.10400 [cs.CL].https://arxiv.org/abs/2412.10400

work page arXiv
[18]

OpenAI o1 System Card

OpenAIet al. OpenAI o1 System Card2024. arXiv:2412.16720 [cs.AI].https://arxiv.org/abs/2412.16720

work page internal anchor Pith review Pith/arXiv arXiv
[19]

ISSN: 1476-4687.http://dx.doi.org/10.1038/s41586-025-09422-z(2025)

Guo, D.et al.DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning.Nature645,633–638. ISSN: 1476-4687.http://dx.doi.org/10.1038/s41586-025-09422-z(2025)

work page doi:10.1038/s41586-025-09422-z(2025 2025
[20]

& Goodman, N.STaR: Bootstrapping Reasoning With ReasoninginAdvances in Neural Information Processing Systems(eds Oh, A

Zelikman, E., Wu, Y., Mu, J. & Goodman, N.STaR: Bootstrapping Reasoning With ReasoninginAdvances in Neural Information Processing Systems(eds Oh, A. H., Agarwal, A., Belgrave, D. & Cho, K.) (2022).https://openreview .net/forum?id=_3ELRdg2sgI

work page 2022
[21]

arXiv:2506.08060 [cs.LG]

Sharma, A.Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques2025. arXiv:2506.08060 [cs.LG]. https://arxiv.org/abs/2506.08060

work page arXiv
[22]

Toshniwal, S., Du, W., Moshkov, I., Kisacanin, B., Ayrapetyan, A. & Gitman, I.OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction DatainThe Thirteenth International Conference on Learning Representa- tions(2025).https://openreview.net/forum?id=mTCbq2QssD

work page 2025
[23]

& Chen, W.ToRA: A Tool-Integrated Reason- ing Agent for Mathematical Problem SolvinginThe Twelfth International Conference on Learning Representations(2024)

Gou, Z., Shao, Z., Gong, Y., yelong shen, Yang, Y., Huang, M., Duan, N. & Chen, W.ToRA: A Tool-Integrated Reason- ing Agent for Mathematical Problem SolvinginThe Twelfth International Conference on Learning Representations(2024). https://openreview.net/forum?id=Ep0TtjVoap

work page 2024
[24]

Theodorou, E. A. & Todorov, E.Relative entropy and free energy dualities: Connections to Path Integral and KL control in2012 IEEE 51st IEEE Conference on Decision and Control (CDC)(2012), 1466–1473

work page 2012
[25]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Levine, S.Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review2018. arXiv:1805.00909 [cs.LG].https://arxiv.org/abs/1805.00909

work page internal anchor Pith review arXiv
[26]

& Kiela, D.Model Alignment as Prospect Theoretic Optimization inForty-first International Conference on Machine Learning(2024).https://openreview.net/forum?id=iUwHnoENn l

Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D. & Kiela, D.Model Alignment as Prospect Theoretic Optimization inForty-first International Conference on Machine Learning(2024).https://openreview.net/forum?id=iUwHnoENn l

work page 2024
[27]

Gheshlaghi Azar, M., Daniel Guo, Z., Piot, B., Munos, R., Rowland, M., Valko, M. & Calandriello, D.A General Theoretical Paradigm to Understand Learning from Human PreferencesinProceedings of The 27th International Conference on Artificial Intelligence and Statistics(eds Dasgupta, S., Mandt, S. & Li, Y.)238(PMLR, Feb. 2024), 4447–4455.http s://proceedings...

work page 2024
[28]

V ., Geng, X., Liu, H., Abbeel, P ., Levine, S

Gudibande, A., Wallace, E., Snell, C. V ., Geng, X., Liu, H., Abbeel, P ., Levine, S. & Song, D.The False Promise of Imitating Proprietary Language ModelsinThe Twelfth International Conference on Learning Representations(2024). https://openreview.net/forum?id=Kz3yckpCN5

work page 2024
[29]

Removing Sandbagging in LLMs by Training with Weak Supervision

Ryd, E., Bartsch, H., Stastny, J., Benton, J. & Hebbar, V .Removing Sandbagging in LLMs by Training with Weak Supervision2026. arXiv:2604.22082 [cs.LG].https://arxiv.org/abs/2604.22082

work page internal anchor Pith review Pith/arXiv arXiv
[30]

URL https://arxiv

Korbak, T., Perez, E. & Buckley, C. L.RL with KL penalties is better viewed as Bayesian inference2022. arXiv:2205.11 275 [cs.LG].https://arxiv.org/abs/2205.11275

work page arXiv
[31]

Sharma, A., Keh, S., Mitchell, E., Finn, C., Arora, K. & Kollar, T.A Critical Evaluation of AI Feedback for Aligning Large Language ModelsinThe Thirty-eighth Annual Conference on Neural Information Processing Systems(2024).http s://openreview.net/forum?id=FZQYfmsmX9

work page 2024
[32]

& Liu, P .LIMO: Less is More for ReasoninginSecond Conference on Language Modeling(2025).https://openreview.net/forum?id=T2TZ0RY4Zk

Ye, Y., Huang, Z., Xiao, Y., Chern, E., Xia, S. & Liu, P .LIMO: Less is More for ReasoninginSecond Conference on Language Modeling(2025).https://openreview.net/forum?id=T2TZ0RY4Zk. 12

work page 2025
[33]

& Dhoedt, B

Mazzaglia, P ., Verbelen, T., Çatal, O. & Dhoedt, B. The Free Energy Principle for Perception and Action: A Deep Learning Perspective.Entropy24.ISSN: 1099-4300.https://www.mdpi.com/1099-4300/24/2/301(2022)

work page 2022
[34]

Taniguchi, R

Taniguchi, T., Ueda, R., Nakamura, T., Suzuki, M. & Taniguchi, A.Generative Emergent Communication: Large Lan- guage Model is a Collective World Model2025. arXiv:2501.00226 [cs.AI].https://arxiv.org/abs/2501.00226

work page arXiv
[35]

& Murfet, D.Stagewise Reinforcement Learning and the Geometry of the Regret Landscape2026

Elliott, C., Urdshals, E., Quarel, D., Farrugia-Roberts, M. & Murfet, D.Stagewise Reinforcement Learning and the Geometry of the Regret Landscape2026. arXiv:2601.07524 [cs.LG].https://arxiv.org/abs/2601.07524

work page arXiv
[36]

H., Wu, Y., Le, Q

Trinh, T. H., Wu, Y., Le, Q. V ., He, H. & Luong, T. Solving olympiad geometry without human demonstrations. Nature625,476–482 (2024)

work page 2024
[37]

& Li, Y.ARGS: Alignment as Reward-Guided SearchinThe Twelfth International Confer- ence on Learning Representations(2024).https://openreview.net/forum?id=shgx0eqdw6

Khanov, M., Burapacheep, J. & Li, Y.ARGS: Alignment as Reward-Guided SearchinThe Twelfth International Confer- ence on Learning Representations(2024).https://openreview.net/forum?id=shgx0eqdw6

work page 2024
[38]

Quan, S.Automatically Generating Custom Context-Driven SFT Data for LLMs with Multi-GranularityinAdaptive Foundation Models: Evolving AI for Personalized and Efficient Learning(2024).https://openreview.net/forum?id =wu8NIjf8pD. 13 A Detailed Derivation for the Boltzmann Reweighting Solution To derive the minimizer of free-energy, we introduce a Lagrange m...

work page 2024