arxiv: 2512.15605 · v3 · submitted 2025-12-17 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction

Mathieu Blondel , Michael E. Sander , Germain Vivier-Ardisson , Tianlin Liu , Vincent Roulet

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:25 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords autoregressive modelsenergy-based modelschain rulebijectionsoft Bellman equationmaximum entropy RLnext-token predictionlookahead

0 comments

The pith

Autoregressive language models are equivalent to energy-based models in function space through a bijection from the chain rule of probability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that autoregressive models, trained only on next-token prediction, match energy-based models that score entire sequences by energy. This match follows directly from factoring joint probabilities into conditionals via the chain rule. A reader would care because the equivalence explains how next-token training can still produce planning behavior that aligns with optimal policies in maximum-entropy reinforcement learning. The same link also equates supervised training of the two model classes and supplies error bounds when distilling an energy-based model into an autoregressive one.

Core claim

Taking the chain rule of probability as starting point yields an explicit bijection between autoregressive models and energy-based models in function space; this bijection is a special case of the soft Bellman equation from maximum-entropy reinforcement learning. Supervised learning on next-token prediction is therefore equivalent to learning the corresponding energy-based model, and theoretical error bounds exist for distilling energy-based models into autoregressive ones. The result supplies a concrete account of why next-token predictors can exhibit lookahead despite their local training objective.

What carries the argument

The explicit bijection in function space that maps the autoregressive chain-rule factorization to an energy function, shown to be identical to a special case of the soft Bellman equation.

If this is right

Supervised learning of autoregressive models is formally identical to supervised learning of the corresponding energy-based models.
Distillation of an energy-based model into an autoregressive model admits explicit theoretical error bounds.
Next-token prediction can recover the same lookahead behavior as optimal policies in maximum-entropy reinforcement learning.
The soft-Bellman correspondence lets any analysis of one model class transfer directly to the other.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Techniques developed for sampling from energy-based models, such as MCMC, could be repurposed to improve autoregressive decoding on tasks that require global consistency.
Alignment methods that optimize energy-based objectives may be applied to autoregressive models without architectural change.
The unification suggests that chain-of-thought prompting in language models works by implicitly minimizing an energy over future tokens.

Load-bearing premise

The chain rule of probability directly produces the claimed bijection in function space with no further restrictions on model capacity, training dynamics, or the form of the energy function.

What would settle it

A simple finite vocabulary and short sequence length where the next-token probabilities of a trained autoregressive model cannot be rearranged into a sequence-level energy function that satisfies the soft Bellman optimality condition.

Figures

Figures reproduced from arXiv: 2512.15605 by Germain Vivier-Ardisson, Mathieu Blondel, Michael E. Sander, Tianlin Liu, Vincent Roulet.

**Figure 1.** Figure 1: Summary of mappings discussed in this paper. requiring Markov-chain Monte-Carlo (MCMC) methods. Therefore, (2) is often reformulated as argmax q∈F(S×A) EXEY R(X, Y ) − KL(p ARM q (·|X), ρ(·|X)), where Y above is distributed according to p ARM q (·|X). Intuitively, instead of performing maximization in the space P(Y|X ) of all possible distributions, we perform maximization in the space of ARMs parameteriz… view at source ↗

**Figure 2.** Figure 2: Empirical validation of Proposition 2. Left: Minimizing the expected risk of an ARM and an EBM parameterized by causal and non-causal Transformers, respectively. Right: L∞ distance between the logits of the trained ARM and the logits of the optimal EBM, before and after applying the mapping M. Our results confirm that the EBM and ARM converge to the same minima, as predicted by Proposition 2. Perhaps more … view at source ↗

**Figure 3.** Figure 3: Loss convergence and logits distances for different Transformer sizes. • In [PITH_FULL_IMAGE:figures/full_fig_p030_3.png] view at source ↗

**Figure 4.** Figure 4: Loss convergence and logits distances for different Transformer sizes in the case T > V . 31 [PITH_FULL_IMAGE:figures/full_fig_p031_4.png] view at source ↗

**Figure 5.** Figure 5: Comparing KL divergence and logits distance in infinity norm. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_5.png] view at source ↗

read the original abstract

Autoregressive models (ARMs) currently constitute the dominant paradigm for large language models (LLMs). Energy-based models (EBMs) represent another class of models, which have historically been less prevalent in LLM development, yet naturally characterize the optimal policy in post-training alignment. In this paper, we provide a unified view of these two model classes. Taking the chain rule of probability as a starting point, we establish an explicit bijection between ARMs and EBMs in function space, which we show to correspond to a special case of the soft Bellman equation in maximum entropy reinforcement learning. Building upon this bijection, we derive the equivalence between supervised learning of ARMs and EBMs. Furthermore, we analyze the distillation of EBMs into ARMs by providing theoretical error bounds. Our results provide insights into the ability of ARMs to plan ahead, despite being based on the next-token prediction paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that any autoregressive factorization defines an energy function E(x) = -sum log p(x_t | x_<t) that satisfies a special case of the soft Bellman equation, giving a direct equivalence between ARMs and EBMs.

read the letter

The core contribution is the explicit bijection in function space: start from the chain rule, set the energy to the negative sum of conditional log probabilities, and the resulting model matches the optimal policy under the soft Bellman equation from max-ent RL. This is straightforward once written down, but the authors spell out the mapping cleanly and then use it to equate supervised ARM training with EBM training. They also give error bounds on distilling an EBM into an ARM, which is the most concrete technical step. The lookahead insight follows from the fact that the energy already encodes the full joint, so next-token training implicitly optimizes a global objective. That part is mostly interpretive rather than a new derivation. The math appears to hold without extra assumptions on capacity or training dynamics, which keeps the result general but also means it applies to any autoregressive model by construction. No circularity or hidden restrictions show up in the argument. The paper is aimed at researchers who already think about RLHF, alignment, or EBMs for language models. It gives a compact way to move between the two formalisms and could be cited when discussing why next-token models sometimes exhibit planning behavior. I would send it to peer review; the unification is clean enough to be worth referee time even if the practical payoff is modest.

Referee Report

2 major / 3 minor

Summary. The paper claims to establish an explicit bijection between autoregressive language models (ARMs) and energy-based models (EBMs) in function space by starting from the chain rule of probability, showing that this bijection corresponds to a special case of the soft Bellman equation from maximum-entropy reinforcement learning. It derives the equivalence of supervised learning objectives for ARMs and EBMs, provides theoretical error bounds on distilling EBMs into ARMs, and uses the framework to explain the lookahead capabilities of next-token prediction.

Significance. If the bijection and derivations are correct, the work supplies a clean theoretical unification of the dominant LLM paradigm with EBMs that are already known to characterize optimal policies under alignment objectives. The explicit RL link offers a principled explanation for why next-token ARMs can exhibit planning behavior, and the distillation error bounds are directly usable for practical model compression. The derivation is parameter-free and rests only on the chain rule plus the standard soft Bellman equation, which are strengths.

major comments (2)

[§3] §3 (bijection derivation): the manuscript must explicitly verify that the mapping E(x) = −∑_t log p(x_t | x_<t) is bijective in function space for arbitrary joint distributions and that the resulting energy satisfies the soft Bellman equation without hidden restrictions on model capacity or the form of the energy function; this step is load-bearing for all subsequent claims.
[§4] §4 (error bounds): the stated theoretical error bounds for EBM-to-ARM distillation are central to the practical contribution; the proof should be expanded to show the precise dependence on the number of distillation steps and any assumptions on the proposal distribution.

minor comments (3)

[Notation] Notation for the energy function and the soft Bellman operator should be introduced once in §2 and used consistently thereafter to avoid reader confusion.
[Abstract] The abstract asserts 'explicit bijection' and 'theoretical error bounds' but does not preview the key equations; adding one or two displayed equations in the abstract or introduction would improve accessibility.
[Discussion] A short discussion of how the bijection behaves under finite-capacity neural-network parameterizations (as opposed to the infinite-capacity function-space case) would strengthen the bridge to practical LLMs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation of minor revision. The comments help clarify the presentation of the core bijection and strengthen the error bounds. We address each major comment below.

read point-by-point responses

Referee: [§3] §3 (bijection derivation): the manuscript must explicitly verify that the mapping E(x) = −∑_t log p(x_t | x_<t) is bijective in function space for arbitrary joint distributions and that the resulting energy satisfies the soft Bellman equation without hidden restrictions on model capacity or the form of the energy function; this step is load-bearing for all subsequent claims.

Authors: We agree that an explicit verification strengthens the load-bearing step. In the revised manuscript we will insert a short lemma in §3 proving bijectivity in function space: given any joint distribution p over finite-length sequences, the chain rule yields a unique E(x) = −log p(x) = −∑_t log p(x_t | x_<t); conversely, any real-valued energy E induces a unique normalized p(x) ∝ exp(−E(x)) whose autoregressive factorization recovers the original conditionals. The same E satisfies the soft Bellman equation E(x) = −log ∑_{x'} exp(−E(x')) (up to additive constants independent of x) by direct substitution of the normalization, with no restrictions on model capacity or energy functional form required—the identity holds pointwise for arbitrary positive measures. revision: yes
Referee: [§4] §4 (error bounds): the stated theoretical error bounds for EBM-to-ARM distillation are central to the practical contribution; the proof should be expanded to show the precise dependence on the number of distillation steps and any assumptions on the proposal distribution.

Authors: We thank the referee for this request. In the revision we will expand the proof of Theorem 4 to derive the explicit dependence of the total variation (or KL) error on the number of distillation steps K, obtaining a contraction of the form O(ρ^K) where ρ < 1 depends on the temperature and the minimal probability mass of the proposal. We will also state the standing assumptions on the proposal distribution q (full support over the sequence space and finite second moments) that are used to control the variance of the importance weights. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained from chain rule

full rationale

The paper starts from the standard chain rule of probability to define an explicit bijection in function space between the autoregressive factorization p(x) = ∏ p(x_t | x_<t) and an energy function E(x) = -∑ log p(x_t | x_<t). This is shown to satisfy a special case of the soft Bellman equation by sequential decomposition of the log-probability, which is a direct algebraic identity holding for any joint distribution. No step renames a fitted parameter as a prediction, imports uniqueness via self-citation, or defines the target result in terms of itself. The RL link follows from established maximum-entropy RL without requiring the present paper's result as an assumption. The central claim therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two standard background results with no free parameters or newly invented entities.

axioms (2)

standard math Chain rule of probability
Used as the explicit starting point to construct the bijection between ARMs and EBMs.
domain assumption Soft Bellman equation in maximum-entropy RL
The bijection is asserted to be a special case of this established equation.

pith-pipeline@v0.9.0 · 5478 in / 1308 out tokens · 38188 ms · 2026-05-16T21:25:35.218166+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Taking the chain rule of probability as a starting point, we establish an explicit bijection between ARMs and EBMs in function space, which we show to correspond to a special case of the soft Bellman equation
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

q(st, yt) := r(st, yt) + Vq(st ⊕ yt) (recursive mapping M)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
cs.LG 2026-05 unverdicted novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents
cs.AI 2026-04 unverdicted novelty 4.0

Ontology grounding improves accuracy and role consistency of enterprise LLM agents, with larger gains in domains poorly covered by training data.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 2 Pith papers · 5 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

H., Hinton, G

Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. A learning algorithm for Boltzmann machines. Cognitive Science, 9 0 (1): 0 147--169, 1985

work page 1985
[3]

Better estimation of the KL divergence between language models

Amini, A., Vieira, T., and Cotterell, R. Better estimation of the KL divergence between language models. In Advances in Neural Information Processing Systems, 2025

work page 2025
[4]

Planning by probabilistic inference

Attias, H. Planning by probabilistic inference. In International Workshop on Artificial Intelligence and Statistics, pp.\ 9--16. PMLR, 2003

work page 2003
[5]

G., G \'o mez, V., and Kappen, H

Azar, M. G., G \'o mez, V., and Kappen, H. J. Dynamic policy programming. The Journal of Machine Learning Research, 13 0 (1): 0 3207--3245, 2012

work page 2012
[6]

and Nagarajan, V

Bachmann, G. and Nagarajan, V. The pitfalls of next-token prediction. In Proceedings of the International Conference on Machine Learning. PMLR, 2024

work page 2024
[7]

Flow network based generative models for non-iterative diverse candidate generation

Bengio, E., Jain, M., Korablyov, M., Precup, D., and Bengio, Y. Flow network based generative models for non-iterative diverse candidate generation. Advances in Neural Information Processing Systems, 34: 0 27381--27394, 2021

work page 2021
[8]

J., Tiwari, M., and Bengio, E

Bengio, Y., Lahlou, S., Deleu, T., Hu, E. J., Tiwari, M., and Bengio, E. GFlowNet foundations. Journal of Machine Learning Research, 24 0 (210): 0 1--55, 2023

work page 2023
[9]

and Roulet, V

Blondel, M. and Roulet, V. The E lements of D ifferentiable P rogramming. arXiv preprint arXiv:2403.14606, 2024

work page arXiv 2024
[10]

Approximate inference in discrete distributions with Monte Carlo tree search and value functions

Buesing, L., Heess, N., and Weber, T. Approximate inference in discrete distributions with Monte Carlo tree search and value functions. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pp.\ 624--634. PMLR, 2020

work page 2020
[11]

D., Tarassov, E., Pietquin, O., Richemond, P

Clavier, P., Grinsztajn, N., Avalos, R., Flet-Berliac, Y., Ergun, I., Domingues, O. D., Tarassov, E., Pietquin, O., Richemond, P. H., Strub, F., et al. ShiQ: Bringing back Bellman to LLMs . arXiv preprint arXiv:2505.11081, 2025

work page arXiv 2025
[12]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Formal Aspects of Language Modeling

Cotterell, R., Svete, A., Meister, C., Liu, T., and Du, L. Formal aspects of language modeling. arXiv preprint arXiv:2311.04329, 2023

work page arXiv 2023
[14]

BERT : Pre-training of deep bidirectional transformers for language understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2019

work page 2019
[15]

Introduction to Natural Language Processing

Eisenstein, J. Introduction to Natural Language Processing. The MIT Press, 2019

work page 2019
[16]

The mystery of the pathological path-star task for language models

Frydenlund, A. The mystery of the pathological path-star task for language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024
[17]

V., and Peyr \'e , G

Furuya, T., de Hoop, M. V., and Peyr \'e , G. Transformers are universal in-context learners. In Proceedings of the International Conference on Learning Representations, 2025

work page 2025
[18]

A theory of regularized Markov Decision Processes

Geist, M., Scherrer, B., and Pietquin, O. A theory of regularized Markov Decision Processes . In Proceedings of the International Conference on Machine Learning, 2019

work page 2019
[19]

Aligning language models with preferences through f -divergence minimization

Go, D., Korbak, T., Kruszewski, G., Rozen, J., Ryu, N., and Dymetman, M. Aligning language models with preferences through f -divergence minimization. In Proceedings of the International Conference on Machine Learning, 2023

work page 2023
[20]

Reinforcement learning with deep energy-based policies

Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In Proceedings of the International Conference on Machine Learning, 2017

work page 2017
[21]

Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, pp.\ 1861--1870, 2018

work page 2018
[22]

F lax: A neural network library and ecosystem for JAX , 2024

Heek, J., Levskaya, A., Oliver, A., Ritter, M., Rondepierre, B., Steiner, A., and van Z ee, M. F lax: A neural network library and ecosystem for JAX , 2024. URL http://github.com/google/flax

work page 2024
[23]

and Ermon, S

Ho, J. and Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, 2016

work page 2016
[24]

Kingma, D. P. and Ba, J. Adam: a method for stochastic optimization. In Proceedings of the International Conference on Machine Learning, 2014

work page 2014
[25]

RL with KL penalties is better viewed as bayesian inference

Korbak, T., Perez, E., and Buckley, C. RL with KL penalties is better viewed as bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.\ 1083--1091, 2022

work page 2022
[26]

LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang, F. J. A tutorial on energy-based models. In Predicting Structured Data. The MIT Press, 2007

work page 2007
[27]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Levine, S. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Trajectory balance: Improved credit assignment in GFlowNets

Malkin, N., Jain, M., Bengio, E., Sun, C., and Bengio, Y. Trajectory balance: Improved credit assignment in GFlowNets . In Advances in Neural Information Processing Systems, 2022

work page 2022
[29]

and Blondel, M

Mensch, A. and Blondel, M. Differentiable dynamic programming for structured prediction and attention. In Proceedings of the International Conference on Machine Learning, pp.\ 3462--3471. PMLR, 2018

work page 2018
[30]

Bridging the gap between value and policy based reinforcement learning

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, 2017

work page 2017
[31]

A unified view of entropy-regularized Markov Decision Processes

Neu, G., Jonsson, A., and G \'o mez, V. A unified view of entropy-regularized Markov Decision Processes . In Advances in Neural Information Processing Systems, 2017

work page 2017
[32]

Combining policy gradient and Q-learning

O'Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. Combining policy gradient and Q-learning . In Proceedings of the International Conference on Learning Representations, 2017

work page 2017
[33]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022

work page 2022
[34]

Richemond, P. H. and Maginnis, B. A short variational proof of equivalence between policy gradients and soft Q learning . arXiv preprint arXiv:1712.08650, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

H., Tang, Y., Guo, D., Calandriello, D., Azar, M

Richemond, P. H., Tang, Y., Guo, D., Calandriello, D., Azar, M. G., Rafailov, R., Pires, B. A., Tarassov, E., Spangher, L., Ellsworth, W., et al. Offline regularised reinforcement learning for large language models alignment. arXiv preprint arXiv:2405.19107, 2024

work page arXiv 2024
[36]

E., and Blondel, M

Roulet, V., Liu, T., Vieillard, N., Sander, M. E., and Blondel, M. Loss functions and operators generated by f -divergences. In Proceedings of the International Conference on Machine Learning, 2025

work page 2025
[37]

E., Roulet, V., Liu, T., and Blondel, M

Sander, M. E., Roulet, V., Liu, T., and Blondel, M. Joint learning of energy-based models and their partition function. In Proceedings of the International Conference on Machine Learning, 2025

work page 2025
[38]

Equivalence Between Policy Gradients and Soft Q-Learning

Schulman, J., Chen, X., and Abbeel, P. Equivalence between policy gradients and soft Q-learning . arXiv preprint arXiv:1704.06440, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

and Kingma, D

Song, Y. and Kingma, D. P. How to train your energy-based models. arXiv preprint arXiv:2101.03288, 2021

work page arXiv 2021
[40]

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, 2020

work page 2020
[41]

Tiapkin, D., Morozov, N., Naumov, A., and Vetrov, D. P. Generative flow networks as entropy-regularized RL . In Proceedings of the International Conference on Artificial Intelligence and Statistics, pp.\ 4213--4221. PMLR, 2024

work page 2024
[42]

General duality between optimal control and estimation

Todorov, E. General duality between optimal control and estimation. In Proceedings of the IEEE conference on decision and control, 2008

work page 2008
[43]

Toussaint, M. et al. Probabilistic inference as a model of planned behavior. K \"u nstliche Intell. , 23 0 (3): 0 23--29, 2009

work page 2009
[44]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, 2017

work page 2017
[45]

J., Jordan, M

Wainwright, M. J., Jordan, M. I., et al. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning , 1 0 (1--2): 0 1--305, 2008

work page 2008
[46]

Beyond reverse KL : Generalizing direct preference optimization with diverse divergence constraints

Wang, C., Jiang, Y., Yang, C., Liu, H., and Chen, Y. Beyond reverse KL : Generalizing direct preference optimization with diverse divergence constraints. In Proceedings of the International Conference on Learning Representations, 2024

work page 2024
[47]

Probabilistic inference in language models via twisted sequential Monte Carlo

Zhao, S., Brekelmans, R., Makhzani, A., and Grosse, R. Probabilistic inference in language models via twisted sequential Monte Carlo . In Proceedings of the International Conference on Machine Learning, 2024

work page 2024
[48]

Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010

work page 2010
[49]

D., Maas, A

Ziebart, B. D., Maas, A. L., Bagnell, J. A., Dey, A. K., et al. Maximum entropy inverse reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence , 2008

work page 2008
[50]

Fine-Tuning Language Models from Human Preferences

Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909