pith. sign in

arxiv: 2412.02125 · v2 · submitted 2024-12-03 · 💻 cs.AI · cs.LG

Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies

Pith reviewed 2026-05-23 08:17 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords preference goal tuninglatent controlfrozen policygoal-conditioned policiespost-training adaptationMinecraft SkillForgetrajectory preference objectiveout-of-distribution generalization
0
0 comments X

The pith

Optimizing only a latent goal embedding lets a frozen policy match task preferences without any parameter updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes post-training of goal-conditioned policies as a latent control problem in which the goal embedding acts as the sole adjustable variable. Preference Goal Tuning optimizes this embedding with a trajectory-level preference objective so that the frozen policy produces more of the desired behaviors and fewer of the undesired ones. On the Minecraft SkillForge benchmark the method improves over expert prompts and, by keeping the policy weights untouched, delivers stronger out-of-distribution performance than full fine-tuning. The separation of task alignment from physical dynamics is presented as the source of the observed robustness.

Core claim

Preference Goal Tuning keeps the policy frozen and updates only the latent goal embedding using a trajectory-level preference objective, achieving average relative improvements of 72.0% and 81.6% on two foundation policies across 17 Minecraft tasks while surpassing full fine-tuning by 13.4% in out-of-distribution settings.

What carries the argument

The latent goal embedding, used as a continuous control variable that is optimized by a trajectory-level preference objective while the policy parameters stay frozen.

If this is right

  • PGT needs only minimal data to adapt a policy to new task preferences.
  • The same frozen policy can be reused across many tasks by storing different learned goal embeddings.
  • Out-of-distribution robustness exceeds that of standard fine-tuning on the Minecraft benchmark.
  • Expert-crafted text prompts are outperformed by the optimized latent goals on every reported task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Storing multiple goal embeddings could let one policy serve many distinct preference alignments without retraining.
  • If the preference signal comes from human feedback or demonstrations, the method could lower the cost of adapting large agents in robotics or games.
  • The approach might be tested on other goal-conditioned models whose embeddings can be treated as continuous controls.
  • Whether performance holds when preferences become more complex or when the frozen policy is much larger is not addressed.

Load-bearing premise

That changing only the goal embedding can sufficiently alter the trajectory distribution induced by the frozen policy to satisfy arbitrary task preferences.

What would settle it

Run the identical out-of-distribution tasks with full fine-tuning given exactly the same preference data and training budget as PGT, then check whether the 13.4% performance gap remains or reverses.

read the original abstract

Goal-conditioned policies enable decision-making models to execute diverse behaviors based on specified goals, yet their downstream performance is often highly sensitive to the choice of instructions or prompts. To bypass the limitations of discrete text prompts, we formulate post-training adaptation as a latent control problem, where the goal embedding serves as a continuous control variable to modulate the behavior of a frozen policy. We propose Preference Goal Tuning (PGT), a framework that optimizes this latent control variable to align the induced trajectory distribution with task preferences. Unlike standard fine-tuning that updates policy parameters, PGT keeps the policy frozen and updates only the latent goal using a trajectory-level preference objective. This approach essentially searches for the optimal conditioning input that maximizes the likelihood of preferred behaviors while suppressing undesirable ones. We evaluate PGT on the Minecraft SkillForge benchmark across 17 tasks. With minimal data, PGT achieves average relative improvements of 72.0\% and 81.6\% on two foundation policies, consistently outperforming expert-crafted prompts. Crucially, by decoupling task alignment (latent goal) from physical dynamics (frozen policy), PGT surpasses full fine-tuning by 13.4\% in out-of-distribution settings, demonstrating superior robustness and generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Preference Goal Tuning (PGT), a post-training method that formulates adaptation of goal-conditioned policies as optimization of a continuous latent goal embedding (the control variable) while keeping the policy parameters frozen. A trajectory-level preference objective is used to align the induced distribution with task preferences. On the Minecraft SkillForge benchmark across 17 tasks, PGT reports average relative improvements of 72.0% and 81.6% over two foundation policies, outperforming expert prompts, and a 13.4% gain over full fine-tuning in out-of-distribution settings.

Significance. If the central empirical claims hold with proper verification, PGT would demonstrate that latent control via preference optimization over goal embeddings can yield more robust generalization than parameter updates, offering an efficient alternative for adapting frozen policies without access to gradients or policy updates.

major comments (3)
  1. [Abstract] Abstract: the concrete relative improvement figures (72.0%, 81.6%, and the 13.4% OOD gain over full fine-tuning) are stated without any accompanying experimental protocol, number of runs, variance estimates, statistical tests, or ablation results, leaving the load-bearing empirical claim only partially supported.
  2. [Method] Method (latent goal optimization): the claim that a trajectory-level preference objective applied solely to the goal embedding is sufficient to modulate the frozen policy's induced distribution rests on the unverified assumption that the embedding space contains points producing preferred behaviors and that the preference model can distinguish nearby embeddings; no analysis, controllability test, or coverage argument is provided to establish this property.
  3. [Experiments] Experiments: the OOD superiority claim (13.4% over full fine-tuning) is measured on held-out tasks, but the manuscript provides no comparison of how the preference objective is evaluated or optimized across in-distribution vs. OOD regimes, nor any diagnostic showing that the selected goal embeddings actually alter behavior as intended rather than selecting from the original training distribution.
minor comments (2)
  1. [Method] Notation for the preference objective and goal embedding update rule should be introduced with explicit equations rather than prose descriptions.
  2. [Experiments] The Minecraft SkillForge benchmark tasks and the two foundation policies should be referenced with citations or a table of task definitions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments. We provide point-by-point responses below and will make revisions to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the concrete relative improvement figures (72.0%, 81.6%, and the 13.4% OOD gain over full fine-tuning) are stated without any accompanying experimental protocol, number of runs, variance estimates, statistical tests, or ablation results, leaving the load-bearing empirical claim only partially supported.

    Authors: We agree that the abstract would be strengthened by including a reference to the experimental protocol. In the revised manuscript, we will update the abstract to briefly describe the evaluation setup, number of runs, and note the presence of variance estimates and statistical tests in the main body. revision: yes

  2. Referee: [Method] Method (latent goal optimization): the claim that a trajectory-level preference objective applied solely to the goal embedding is sufficient to modulate the frozen policy's induced distribution rests on the unverified assumption that the embedding space contains points producing preferred behaviors and that the preference model can distinguish nearby embeddings; no analysis, controllability test, or coverage argument is provided to establish this property.

    Authors: The empirical results on 17 tasks demonstrate the effectiveness, but we acknowledge the lack of explicit controllability analysis. We will add a controllability test and coverage argument in the method section of the revision. revision: yes

  3. Referee: [Experiments] Experiments: the OOD superiority claim (13.4% over full fine-tuning) is measured on held-out tasks, but the manuscript provides no comparison of how the preference objective is evaluated or optimized across in-distribution vs. OOD regimes, nor any diagnostic showing that the selected goal embeddings actually alter behavior as intended rather than selecting from the original training distribution.

    Authors: We will revise the experiments section to include a direct comparison of the preference objective evaluation between ID and OOD regimes, as well as additional diagnostics such as behavior alteration visualizations to confirm the embeddings induce intended changes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmark evaluations rather than internal reductions.

full rationale

The paper formulates PGT as optimizing a latent goal embedding via a trajectory-level preference objective while keeping the policy frozen, then reports empirical gains (e.g., 13.4% OOD improvement over fine-tuning) on the independent Minecraft SkillForge benchmark across 17 held-out tasks. No equations or steps reduce by construction to fitted inputs, self-definitions, or self-citation chains; the controllability assumption is tested via external performance metrics rather than being presupposed in the derivation. This is the standard case of a self-contained empirical method whose validity is assessed outside its own fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the latent goal space is expressive enough to steer behavior without policy updates; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption The behavior of a goal-conditioned policy can be modulated to arbitrary preferred trajectory distributions solely by optimizing its conditioning input while parameters remain fixed.
    This premise is required for the latent-control formulation to replace parameter updates.

pith-pipeline@v0.9.0 · 5778 in / 1225 out tokens · 25774 ms · 2026-05-23T08:17:51.447013+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 17 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Programming by feedback

    Riad Akrour, Marc Schoenauer, Mich \`e le Sebag, and Jean-Christophe Souplet. Programming by feedback. In International Conference on Machine Learning, volume 32, pp.\ 1503--1511. JMLR. org, 2014

  3. [3]

    Hindsight experience replay

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. Advances in neural information processing systems, 30, 2017

  4. [4]

    A general theoretical paradigm to understand learning from human preferences

    Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp.\ 4447--4455. PMLR, 2024

  5. [5]

    Video pretraining (vpt): Learning to act by watching unlabeled online videos, 2022

    Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos, 2022. URL https://arxiv.org/abs/2206.11795

  6. [6]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov...

  7. [7]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, ...

  8. [8]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  9. [9]

    Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction

    Shaofei Cai, Zihao Wang, Xiaojian Ma, Anji Liu, and Yitao Liang. Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13734--13744, 2023 a

  10. [10]

    Groot: Learning to follow instructions by watching gameplay videos, 2023 b

    Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, and Yitao Liang. Groot: Learning to follow instructions by watching gameplay videos, 2023 b

  11. [11]

    GROOT -1.5: Learning to follow multi-modal instructions from weak supervision

    Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, and Yitao Liang. GROOT -1.5: Learning to follow multi-modal instructions from weak supervision. In Multi-modal Foundation Model meets Embodied AI Workshop @ ICML2024, 2024. URL https://openreview.net/forum?id=zxdi4Kdfjq

  12. [12]

    Goal-conditioned reinforcement learning with imagined subgoals

    Elliot Chane-Sane, Cordelia Schmid, and Ivan Laptev. Goal-conditioned reinforcement learning with imagined subgoals. In International conference on machine learning, pp.\ 1430--1440. PMLR, 2021

  13. [13]

    Exploring large language model based intelligent agents: Definitions, methods, and prospects

    Yuheng Cheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang, Zekai Wang, Feng Yin, Junhua Zhao, et al. Exploring large language model based intelligent agents: Definitions, methods, and prospects. arXiv preprint arXiv:2401.03428, 2024

  14. [14]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

  15. [15]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Embodiment Collaboration et al. Open x-embodiment: Robotic learning datasets and rt-x models, 2024. URL https://arxiv.org/abs/2310.08864

  16. [16]

    Goal-conditioned imitation learning

    Yiming Ding, Carlos Florensa, Pieter Abbeel, and Mariano Phielipp. Goal-conditioned imitation learning. Advances in neural information processing systems, 32, 2019

  17. [17]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024

  18. [18]

    Minedojo: Building open-ended embodied agents with internet-scale knowledge

    Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=r...

  19. [19]

    u rnkranz, Eyke H \

    Johannes F \"u rnkranz, Eyke H \"u llermeier, Weiwei Cheng, and Sang-Hyeun Park. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning, 89: 0 123--156, 2012

  20. [20]

    Reinforced Self-Training (ReST) for Language Modeling

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023

  21. [21]

    arXiv preprint arXiv:1907.13440 , year=

    William H. Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel, Manuela Veloso, and Ruslan Salakhutdinov. Minerl: A large-scale dataset of minecraft demonstrations, 2019. URL https://arxiv.org/abs/1907.13440

  22. [22]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

  23. [23]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015. URL https://arxiv.org/abs/1503.02531

  24. [24]

    ORPO: Monolithic Preference Optimization without Reference Model

    Jiwoo Hong, Noah Lee, and James Thorne. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691, 2024

  25. [25]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

  26. [26]

    Bc-z: Zero-shot task generalization with robotic imitation learning, 2022

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning, 2022. URL https://arxiv.org/abs/2202.02005

  27. [27]

    The malmo platform for artificial intelligence experimentation

    Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. In Ijcai, volume 16, pp.\ 4246--4247, 2016

  28. [28]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

  29. [29]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023. URL https://arxiv.org/abs/2304.02643

  30. [30]

    doi: 10.1073/pnas.1611835114

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): ...

  31. [31]

    Interactively shaping agents via human reinforcement: The tamer framework

    W Bradley Knox and Peter Stone. Interactively shaping agents via human reinforcement: The tamer framework. In Proceedings of the fifth international conference on Knowledge capture, pp.\ 9--16, 2009

  32. [32]

    Vera: Vector-based random matrix adaptation,

    Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. Vera: Vector-based random matrix adaptation, 2024. URL https://arxiv.org/abs/2310.11454

  33. [33]

    Behavior generation with latent actions

    Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions. arXiv preprint arXiv:2403.03181, 2024

  34. [34]

    Steve-1: A generative model for text-to-behavior in minecraft

    Shalev Lifshitz, Keiran Paster, Harris Chan, Jimmy Ba, and Sheila McIlraith. Steve-1: A generative model for text-to-behavior in minecraft. Advances in Neural Information Processing Systems, 36, 2024

  35. [35]

    Mcu: A task-centric framework for open-ended agent evaluation in minecraft

    Haowei Lin, Zihao Wang, Jianzhu Ma, and Yitao Liang. Mcu: A task-centric framework for open-ended agent evaluation in minecraft. arXiv preprint arXiv:2310.08367, 2023

  36. [36]

    Selecting large language model to fine-tune via rectified scaling law

    Haowei Lin, Baizhou Huang, Haotian Ye, Qinyu Chen, Zihao Wang, Sujian Li, Jianzhu Ma, Xiaojun Wan, James Zou, and Yitao Liang. Selecting large language model to fine-tune via rectified scaling law. arXiv preprint arXiv:2402.02314, 2024

  37. [37]

    Gradient episodic memory for continual learning, 2022

    David Lopez-Paz and Marc'Aurelio Ranzato. Gradient episodic memory for continual learning, 2022. URL https://arxiv.org/abs/1706.08840

  38. [38]

    Simpo: Simple preference optimization with a reference-free reward

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734, 2024

  39. [39]

    Self-imitation learning

    Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Self-imitation learning. In International conference on machine learning, pp.\ 3878--3887. PMLR, 2018

  40. [40]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

  41. [41]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

  42. [42]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv.org/abs/2305.18290

  43. [43]

    A Generalist Agent

    Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A generalist agent, 2022. URL https:...

  44. [44]

    Mastering the game of go with deep neural networks and tree search

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529 0 (7587): 0 484--489, 2016

  45. [45]

    Learning structured output representation using deep conditional generative models

    Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/8d5...

  46. [46]

    Preference fine-tuning of llms should leverage suboptimal, on-policy data

    Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. Preference fine-tuning of llms should leverage suboptimal, on-policy data. arXiv preprint arXiv:2404.14367, 2024

  47. [47]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  48. [48]

    Advances in prospect theory: Cumulative representation of uncertainty

    Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and uncertainty, 5: 0 297--323, 1992

  49. [49]

    Will we run out of data? limits of llm scaling based on human-generated data, 2024

    Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Will we run out of data? limits of llm scaling based on human-generated data, 2024. URL https://arxiv.org/abs/2211.04325

  50. [50]

    Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents

    Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, Yitao Liang, and Team CraftJarvis. Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp.\ 34153--34189, 2023 a

  51. [51]

    Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models

    Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv: 2311.05997, 2023 b

  52. [52]

    Foundation models for decision making: Problems, methods, and opportunities, 2023

    Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, and Dale Schuurmans. Foundation models for decision making: Problems, methods, and opportunities, 2023. URL https://arxiv.org/abs/2303.04129

  53. [53]

    Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022

    Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022. URL https://arxiv.org/abs/2106.10199

  54. [54]

    Proagent: Building proactive cooperative ai with large language models

    Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, et al. Proagent: Building proactive cooperative ai with large language models. CoRR, 2023

  55. [55]

    Learning to drive by watching youtube videos: Action-conditioned contrastive policy pretraining, 2022

    Qihang Zhang, Zhenghao Peng, and Bolei Zhou. Learning to drive by watching youtube videos: Action-conditioned contrastive policy pretraining, 2022. URL https://arxiv.org/abs/2204.02393

  56. [56]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023 a

  57. [57]

    Calibrating sequence likelihood improves conditional language generation

    Yao Zhao, Misha Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J Liu. Calibrating sequence likelihood improves conditional language generation. arXiv preprint arXiv:2210.00045, 2022

  58. [58]

    Slic-hf: Sequence likelihood calibration with human feedback

    Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023 b

  59. [59]

    Fine-Tuning Language Models from Human Preferences

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2020. URL https://arxiv.org/abs/1909.08593

  60. [60]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  61. [61]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  62. [62]

    Victoria Beckham

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...