Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies

Borong Zhang; Guangyu Zhao; Haobo Fu; Haowei Lin; Haoxuan Ru; Kewei Lian; Qiang Fu; Shaofei Cai; Yitao Liang; Zhancun Mu

arxiv: 2412.02125 · v2 · submitted 2024-12-03 · 💻 cs.AI · cs.LG

Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies

Guangyu Zhao , Kewei Lian , Haoxuan Ru , Borong Zhang , Haowei Lin , Zhancun Mu , Haobo Fu , Qiang Fu

show 3 more authors

Shaofei Cai Zihao Wang Yitao Liang

This is my paper

Pith reviewed 2026-05-23 08:17 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords preference goal tuninglatent controlfrozen policygoal-conditioned policiespost-training adaptationMinecraft SkillForgetrajectory preference objectiveout-of-distribution generalization

0 comments

The pith

Optimizing only a latent goal embedding lets a frozen policy match task preferences without any parameter updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes post-training of goal-conditioned policies as a latent control problem in which the goal embedding acts as the sole adjustable variable. Preference Goal Tuning optimizes this embedding with a trajectory-level preference objective so that the frozen policy produces more of the desired behaviors and fewer of the undesired ones. On the Minecraft SkillForge benchmark the method improves over expert prompts and, by keeping the policy weights untouched, delivers stronger out-of-distribution performance than full fine-tuning. The separation of task alignment from physical dynamics is presented as the source of the observed robustness.

Core claim

Preference Goal Tuning keeps the policy frozen and updates only the latent goal embedding using a trajectory-level preference objective, achieving average relative improvements of 72.0% and 81.6% on two foundation policies across 17 Minecraft tasks while surpassing full fine-tuning by 13.4% in out-of-distribution settings.

What carries the argument

The latent goal embedding, used as a continuous control variable that is optimized by a trajectory-level preference objective while the policy parameters stay frozen.

If this is right

PGT needs only minimal data to adapt a policy to new task preferences.
The same frozen policy can be reused across many tasks by storing different learned goal embeddings.
Out-of-distribution robustness exceeds that of standard fine-tuning on the Minecraft benchmark.
Expert-crafted text prompts are outperformed by the optimized latent goals on every reported task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Storing multiple goal embeddings could let one policy serve many distinct preference alignments without retraining.
If the preference signal comes from human feedback or demonstrations, the method could lower the cost of adapting large agents in robotics or games.
The approach might be tested on other goal-conditioned models whose embeddings can be treated as continuous controls.
Whether performance holds when preferences become more complex or when the frozen policy is much larger is not addressed.

Load-bearing premise

That changing only the goal embedding can sufficiently alter the trajectory distribution induced by the frozen policy to satisfy arbitrary task preferences.

What would settle it

Run the identical out-of-distribution tasks with full fine-tuning given exactly the same preference data and training budget as PGT, then check whether the 13.4% performance gap remains or reverses.

read the original abstract

Goal-conditioned policies enable decision-making models to execute diverse behaviors based on specified goals, yet their downstream performance is often highly sensitive to the choice of instructions or prompts. To bypass the limitations of discrete text prompts, we formulate post-training adaptation as a latent control problem, where the goal embedding serves as a continuous control variable to modulate the behavior of a frozen policy. We propose Preference Goal Tuning (PGT), a framework that optimizes this latent control variable to align the induced trajectory distribution with task preferences. Unlike standard fine-tuning that updates policy parameters, PGT keeps the policy frozen and updates only the latent goal using a trajectory-level preference objective. This approach essentially searches for the optimal conditioning input that maximizes the likelihood of preferred behaviors while suppressing undesirable ones. We evaluate PGT on the Minecraft SkillForge benchmark across 17 tasks. With minimal data, PGT achieves average relative improvements of 72.0\% and 81.6\% on two foundation policies, consistently outperforming expert-crafted prompts. Crucially, by decoupling task alignment (latent goal) from physical dynamics (frozen policy), PGT surpasses full fine-tuning by 13.4\% in out-of-distribution settings, demonstrating superior robustness and generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PGT tunes only the goal embedding with a trajectory preference objective on a frozen policy and claims 13.4% OOD gains over fine-tuning on Minecraft tasks, but the controllability assumption is the load-bearing part.

read the letter

The main thing to know is that this paper frames post-training adaptation as latent control: they freeze the policy weights entirely and optimize only the continuous goal embedding using a trajectory-level preference objective to steer behavior toward preferred outcomes. On the Minecraft SkillForge benchmark with 17 tasks and two foundation policies, they report average relative improvements of 72% and 81.6% over expert prompts, plus that 13.4% edge over full fine-tuning in out-of-distribution settings. The decoupling of task alignment from dynamics is a clean move and the OOD result is the part worth paying attention to if it holds.

Referee Report

3 major / 2 minor

Summary. The paper proposes Preference Goal Tuning (PGT), a post-training method that formulates adaptation of goal-conditioned policies as optimization of a continuous latent goal embedding (the control variable) while keeping the policy parameters frozen. A trajectory-level preference objective is used to align the induced distribution with task preferences. On the Minecraft SkillForge benchmark across 17 tasks, PGT reports average relative improvements of 72.0% and 81.6% over two foundation policies, outperforming expert prompts, and a 13.4% gain over full fine-tuning in out-of-distribution settings.

Significance. If the central empirical claims hold with proper verification, PGT would demonstrate that latent control via preference optimization over goal embeddings can yield more robust generalization than parameter updates, offering an efficient alternative for adapting frozen policies without access to gradients or policy updates.

major comments (3)

[Abstract] Abstract: the concrete relative improvement figures (72.0%, 81.6%, and the 13.4% OOD gain over full fine-tuning) are stated without any accompanying experimental protocol, number of runs, variance estimates, statistical tests, or ablation results, leaving the load-bearing empirical claim only partially supported.
[Method] Method (latent goal optimization): the claim that a trajectory-level preference objective applied solely to the goal embedding is sufficient to modulate the frozen policy's induced distribution rests on the unverified assumption that the embedding space contains points producing preferred behaviors and that the preference model can distinguish nearby embeddings; no analysis, controllability test, or coverage argument is provided to establish this property.
[Experiments] Experiments: the OOD superiority claim (13.4% over full fine-tuning) is measured on held-out tasks, but the manuscript provides no comparison of how the preference objective is evaluated or optimized across in-distribution vs. OOD regimes, nor any diagnostic showing that the selected goal embeddings actually alter behavior as intended rather than selecting from the original training distribution.

minor comments (2)

[Method] Notation for the preference objective and goal embedding update rule should be introduced with explicit equations rather than prose descriptions.
[Experiments] The Minecraft SkillForge benchmark tasks and the two foundation policies should be referenced with citations or a table of task definitions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments. We provide point-by-point responses below and will make revisions to address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: the concrete relative improvement figures (72.0%, 81.6%, and the 13.4% OOD gain over full fine-tuning) are stated without any accompanying experimental protocol, number of runs, variance estimates, statistical tests, or ablation results, leaving the load-bearing empirical claim only partially supported.

Authors: We agree that the abstract would be strengthened by including a reference to the experimental protocol. In the revised manuscript, we will update the abstract to briefly describe the evaluation setup, number of runs, and note the presence of variance estimates and statistical tests in the main body. revision: yes
Referee: [Method] Method (latent goal optimization): the claim that a trajectory-level preference objective applied solely to the goal embedding is sufficient to modulate the frozen policy's induced distribution rests on the unverified assumption that the embedding space contains points producing preferred behaviors and that the preference model can distinguish nearby embeddings; no analysis, controllability test, or coverage argument is provided to establish this property.

Authors: The empirical results on 17 tasks demonstrate the effectiveness, but we acknowledge the lack of explicit controllability analysis. We will add a controllability test and coverage argument in the method section of the revision. revision: yes
Referee: [Experiments] Experiments: the OOD superiority claim (13.4% over full fine-tuning) is measured on held-out tasks, but the manuscript provides no comparison of how the preference objective is evaluated or optimized across in-distribution vs. OOD regimes, nor any diagnostic showing that the selected goal embeddings actually alter behavior as intended rather than selecting from the original training distribution.

Authors: We will revise the experiments section to include a direct comparison of the preference objective evaluation between ID and OOD regimes, as well as additional diagnostics such as behavior alteration visualizations to confirm the embeddings induce intended changes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmark evaluations rather than internal reductions.

full rationale

The paper formulates PGT as optimizing a latent goal embedding via a trajectory-level preference objective while keeping the policy frozen, then reports empirical gains (e.g., 13.4% OOD improvement over fine-tuning) on the independent Minecraft SkillForge benchmark across 17 held-out tasks. No equations or steps reduce by construction to fitted inputs, self-definitions, or self-citation chains; the controllability assumption is tested via external performance metrics rather than being presupposed in the derivation. This is the standard case of a self-contained empirical method whose validity is assessed outside its own fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the latent goal space is expressive enough to steer behavior without policy updates; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption The behavior of a goal-conditioned policy can be modulated to arbitrary preferred trajectory distributions solely by optimizing its conditioning input while parameters remain fixed.
This premise is required for the latent-control formulation to replace parameter updates.

pith-pipeline@v0.9.0 · 5778 in / 1225 out tokens · 25774 ms · 2026-05-23T08:17:51.447013+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 17 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Programming by feedback

Riad Akrour, Marc Schoenauer, Mich \`e le Sebag, and Jean-Christophe Souplet. Programming by feedback. In International Conference on Machine Learning, volume 32, pp.\ 1503--1511. JMLR. org, 2014

work page 2014
[3]

Hindsight experience replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. Advances in neural information processing systems, 30, 2017

work page 2017
[4]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp.\ 4447--4455. PMLR, 2024

work page 2024
[5]

Video pretraining (vpt): Learning to act by watching unlabeled online videos, 2022

Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos, 2022. URL https://arxiv.org/abs/2206.11795

work page arXiv 2022
[6]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[9]

Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction

Shaofei Cai, Zihao Wang, Xiaojian Ma, Anji Liu, and Yitao Liang. Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13734--13744, 2023 a

work page 2023
[10]

Groot: Learning to follow instructions by watching gameplay videos, 2023 b

Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, and Yitao Liang. Groot: Learning to follow instructions by watching gameplay videos, 2023 b

work page 2023
[11]

GROOT -1.5: Learning to follow multi-modal instructions from weak supervision

Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, and Yitao Liang. GROOT -1.5: Learning to follow multi-modal instructions from weak supervision. In Multi-modal Foundation Model meets Embodied AI Workshop @ ICML2024, 2024. URL https://openreview.net/forum?id=zxdi4Kdfjq

work page 2024
[12]

Goal-conditioned reinforcement learning with imagined subgoals

Elliot Chane-Sane, Cordelia Schmid, and Ivan Laptev. Goal-conditioned reinforcement learning with imagined subgoals. In International conference on machine learning, pp.\ 1430--1440. PMLR, 2021

work page 2021
[13]

Exploring large language model based intelligent agents: Definitions, methods, and prospects

Yuheng Cheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang, Zekai Wang, Feng Yin, Junhua Zhao, et al. Exploring large language model based intelligent agents: Definitions, methods, and prospects. arXiv preprint arXiv:2401.03428, 2024

work page arXiv 2024
[14]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

work page 2017
[15]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Embodiment Collaboration et al. Open x-embodiment: Robotic learning datasets and rt-x models, 2024. URL https://arxiv.org/abs/2310.08864

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Goal-conditioned imitation learning

Yiming Ding, Carlos Florensa, Pieter Abbeel, and Mariano Phielipp. Goal-conditioned imitation learning. Advances in neural information processing systems, 32, 2019

work page 2019
[17]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Minedojo: Building open-ended embodied agents with internet-scale knowledge

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=r...

work page 2022
[19]

u rnkranz, Eyke H \

Johannes F \"u rnkranz, Eyke H \"u llermeier, Weiwei Cheng, and Sang-Hyeun Park. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning, 89: 0 123--156, 2012

work page 2012
[20]

Reinforced Self-Training (ReST) for Language Modeling

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

arXiv preprint arXiv:1907.13440 , year=

William H. Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel, Manuela Veloso, and Ruslan Salakhutdinov. Minerl: A large-scale dataset of minecraft demonstrations, 2019. URL https://arxiv.org/abs/1907.13440

work page arXiv 2019
[22]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

work page 2016
[23]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015. URL https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[24]

ORPO: Monolithic Preference Optimization without Reference Model

Jiwoo Hong, Noah Lee, and James Thorne. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

Bc-z: Zero-shot task generalization with robotic imitation learning, 2022

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning, 2022. URL https://arxiv.org/abs/2202.02005

work page arXiv 2022
[27]

The malmo platform for artificial intelligence experimentation

Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. In Ijcai, volume 16, pp.\ 4246--4247, 2016

work page 2016
[28]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023. URL https://arxiv.org/abs/2304.02643

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

doi: 10.1073/pnas.1611835114

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): ...

work page doi:10.1073/pnas.1611835114 2017
[31]

Interactively shaping agents via human reinforcement: The tamer framework

W Bradley Knox and Peter Stone. Interactively shaping agents via human reinforcement: The tamer framework. In Proceedings of the fifth international conference on Knowledge capture, pp.\ 9--16, 2009

work page 2009
[32]

Vera: Vector-based random matrix adaptation,

Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. Vera: Vector-based random matrix adaptation, 2024. URL https://arxiv.org/abs/2310.11454

work page arXiv 2024
[33]

Behavior generation with latent actions

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions. arXiv preprint arXiv:2403.03181, 2024

work page arXiv 2024
[34]

Steve-1: A generative model for text-to-behavior in minecraft

Shalev Lifshitz, Keiran Paster, Harris Chan, Jimmy Ba, and Sheila McIlraith. Steve-1: A generative model for text-to-behavior in minecraft. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[35]

Mcu: A task-centric framework for open-ended agent evaluation in minecraft

Haowei Lin, Zihao Wang, Jianzhu Ma, and Yitao Liang. Mcu: A task-centric framework for open-ended agent evaluation in minecraft. arXiv preprint arXiv:2310.08367, 2023

work page arXiv 2023
[36]

Selecting large language model to fine-tune via rectified scaling law

Haowei Lin, Baizhou Huang, Haotian Ye, Qinyu Chen, Zihao Wang, Sujian Li, Jianzhu Ma, Xiaojun Wan, James Zou, and Yitao Liang. Selecting large language model to fine-tune via rectified scaling law. arXiv preprint arXiv:2402.02314, 2024

work page arXiv 2024
[37]

Gradient episodic memory for continual learning, 2022

David Lopez-Paz and Marc'Aurelio Ranzato. Gradient episodic memory for continual learning, 2022. URL https://arxiv.org/abs/1706.08840

work page arXiv 2022
[38]

Simpo: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734, 2024

work page arXiv 2024
[39]

Self-imitation learning

Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Self-imitation learning. In International conference on machine learning, pp.\ 3878--3887. PMLR, 2018

work page 2018
[40]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

work page 2022
[42]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv.org/abs/2305.18290

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

A Generalist Agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A generalist agent, 2022. URL https:...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

Mastering the game of go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529 0 (7587): 0 484--489, 2016

work page 2016
[45]

Learning structured output representation using deep conditional generative models

Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/8d5...

work page 2015
[46]

Preference fine-tuning of llms should leverage suboptimal, on-policy data

Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. Preference fine-tuning of llms should leverage suboptimal, on-policy data. arXiv preprint arXiv:2404.14367, 2024

work page arXiv 2024
[47]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Advances in prospect theory: Cumulative representation of uncertainty

Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and uncertainty, 5: 0 297--323, 1992

work page 1992
[49]

Will we run out of data? limits of llm scaling based on human-generated data, 2024

Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Will we run out of data? limits of llm scaling based on human-generated data, 2024. URL https://arxiv.org/abs/2211.04325

work page arXiv 2024
[50]

Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents

Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, Yitao Liang, and Team CraftJarvis. Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp.\ 34153--34189, 2023 a

work page 2023
[51]

Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv: 2311.05997, 2023 b

work page arXiv 2023
[52]

Foundation models for decision making: Problems, methods, and opportunities, 2023

Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, and Dale Schuurmans. Foundation models for decision making: Problems, methods, and opportunities, 2023. URL https://arxiv.org/abs/2303.04129

work page arXiv 2023
[53]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022

Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022. URL https://arxiv.org/abs/2106.10199

work page arXiv 2022
[54]

Proagent: Building proactive cooperative ai with large language models

Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, et al. Proagent: Building proactive cooperative ai with large language models. CoRR, 2023

work page 2023
[55]

Learning to drive by watching youtube videos: Action-conditioned contrastive policy pretraining, 2022

Qihang Zhang, Zhenghao Peng, and Bolei Zhou. Learning to drive by watching youtube videos: Action-conditioned contrastive policy pretraining, 2022. URL https://arxiv.org/abs/2204.02393

work page arXiv 2022
[56]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Calibrating sequence likelihood improves conditional language generation

Yao Zhao, Misha Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J Liu. Calibrating sequence likelihood improves conditional language generation. arXiv preprint arXiv:2210.00045, 2022

work page arXiv 2022
[58]

Slic-hf: Sequence likelihood calibration with human feedback

Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023 b

work page arXiv 2023
[59]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2020. URL https://arxiv.org/abs/1909.08593

work page internal anchor Pith review Pith/arXiv arXiv 2020
[60]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[61]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[62]

Victoria Beckham

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2000

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

Programming by feedback

Riad Akrour, Marc Schoenauer, Mich \`e le Sebag, and Jean-Christophe Souplet. Programming by feedback. In International Conference on Machine Learning, volume 32, pp.\ 1503--1511. JMLR. org, 2014

work page 2014

[3] [3]

Hindsight experience replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. Advances in neural information processing systems, 30, 2017

work page 2017

[4] [4]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp.\ 4447--4455. PMLR, 2024

work page 2024

[5] [5]

Video pretraining (vpt): Learning to act by watching unlabeled online videos, 2022

Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos, 2022. URL https://arxiv.org/abs/2206.11795

work page arXiv 2022

[6] [6]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[9] [9]

Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction

Shaofei Cai, Zihao Wang, Xiaojian Ma, Anji Liu, and Yitao Liang. Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13734--13744, 2023 a

work page 2023

[10] [10]

Groot: Learning to follow instructions by watching gameplay videos, 2023 b

Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, and Yitao Liang. Groot: Learning to follow instructions by watching gameplay videos, 2023 b

work page 2023

[11] [11]

GROOT -1.5: Learning to follow multi-modal instructions from weak supervision

Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, and Yitao Liang. GROOT -1.5: Learning to follow multi-modal instructions from weak supervision. In Multi-modal Foundation Model meets Embodied AI Workshop @ ICML2024, 2024. URL https://openreview.net/forum?id=zxdi4Kdfjq

work page 2024

[12] [12]

Goal-conditioned reinforcement learning with imagined subgoals

Elliot Chane-Sane, Cordelia Schmid, and Ivan Laptev. Goal-conditioned reinforcement learning with imagined subgoals. In International conference on machine learning, pp.\ 1430--1440. PMLR, 2021

work page 2021

[13] [13]

Exploring large language model based intelligent agents: Definitions, methods, and prospects

Yuheng Cheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang, Zekai Wang, Feng Yin, Junhua Zhao, et al. Exploring large language model based intelligent agents: Definitions, methods, and prospects. arXiv preprint arXiv:2401.03428, 2024

work page arXiv 2024

[14] [14]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

work page 2017

[15] [15]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Embodiment Collaboration et al. Open x-embodiment: Robotic learning datasets and rt-x models, 2024. URL https://arxiv.org/abs/2310.08864

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Goal-conditioned imitation learning

Yiming Ding, Carlos Florensa, Pieter Abbeel, and Mariano Phielipp. Goal-conditioned imitation learning. Advances in neural information processing systems, 32, 2019

work page 2019

[17] [17]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Minedojo: Building open-ended embodied agents with internet-scale knowledge

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=r...

work page 2022

[19] [19]

u rnkranz, Eyke H \

Johannes F \"u rnkranz, Eyke H \"u llermeier, Weiwei Cheng, and Sang-Hyeun Park. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning, 89: 0 123--156, 2012

work page 2012

[20] [20]

Reinforced Self-Training (ReST) for Language Modeling

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

arXiv preprint arXiv:1907.13440 , year=

William H. Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel, Manuela Veloso, and Ruslan Salakhutdinov. Minerl: A large-scale dataset of minecraft demonstrations, 2019. URL https://arxiv.org/abs/1907.13440

work page arXiv 2019

[22] [22]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

work page 2016

[23] [23]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015. URL https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015

[24] [24]

ORPO: Monolithic Preference Optimization without Reference Model

Jiwoo Hong, Noah Lee, and James Thorne. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021

[26] [26]

Bc-z: Zero-shot task generalization with robotic imitation learning, 2022

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning, 2022. URL https://arxiv.org/abs/2202.02005

work page arXiv 2022

[27] [27]

The malmo platform for artificial intelligence experimentation

Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. In Ijcai, volume 16, pp.\ 4246--4247, 2016

work page 2016

[28] [28]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023. URL https://arxiv.org/abs/2304.02643

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

doi: 10.1073/pnas.1611835114

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): ...

work page doi:10.1073/pnas.1611835114 2017

[31] [31]

Interactively shaping agents via human reinforcement: The tamer framework

W Bradley Knox and Peter Stone. Interactively shaping agents via human reinforcement: The tamer framework. In Proceedings of the fifth international conference on Knowledge capture, pp.\ 9--16, 2009

work page 2009

[32] [32]

Vera: Vector-based random matrix adaptation,

Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. Vera: Vector-based random matrix adaptation, 2024. URL https://arxiv.org/abs/2310.11454

work page arXiv 2024

[33] [33]

Behavior generation with latent actions

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions. arXiv preprint arXiv:2403.03181, 2024

work page arXiv 2024

[34] [34]

Steve-1: A generative model for text-to-behavior in minecraft

Shalev Lifshitz, Keiran Paster, Harris Chan, Jimmy Ba, and Sheila McIlraith. Steve-1: A generative model for text-to-behavior in minecraft. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[35] [35]

Mcu: A task-centric framework for open-ended agent evaluation in minecraft

Haowei Lin, Zihao Wang, Jianzhu Ma, and Yitao Liang. Mcu: A task-centric framework for open-ended agent evaluation in minecraft. arXiv preprint arXiv:2310.08367, 2023

work page arXiv 2023

[36] [36]

Selecting large language model to fine-tune via rectified scaling law

Haowei Lin, Baizhou Huang, Haotian Ye, Qinyu Chen, Zihao Wang, Sujian Li, Jianzhu Ma, Xiaojun Wan, James Zou, and Yitao Liang. Selecting large language model to fine-tune via rectified scaling law. arXiv preprint arXiv:2402.02314, 2024

work page arXiv 2024

[37] [37]

Gradient episodic memory for continual learning, 2022

David Lopez-Paz and Marc'Aurelio Ranzato. Gradient episodic memory for continual learning, 2022. URL https://arxiv.org/abs/1706.08840

work page arXiv 2022

[38] [38]

Simpo: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734, 2024

work page arXiv 2024

[39] [39]

Self-imitation learning

Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Self-imitation learning. In International conference on machine learning, pp.\ 3878--3887. PMLR, 2018

work page 2018

[40] [40]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

work page 2022

[42] [42]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv.org/abs/2305.18290

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

A Generalist Agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A generalist agent, 2022. URL https:...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [44]

Mastering the game of go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529 0 (7587): 0 484--489, 2016

work page 2016

[45] [45]

Learning structured output representation using deep conditional generative models

Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/8d5...

work page 2015

[46] [46]

Preference fine-tuning of llms should leverage suboptimal, on-policy data

Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. Preference fine-tuning of llms should leverage suboptimal, on-policy data. arXiv preprint arXiv:2404.14367, 2024

work page arXiv 2024

[47] [47]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Advances in prospect theory: Cumulative representation of uncertainty

Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and uncertainty, 5: 0 297--323, 1992

work page 1992

[49] [49]

Will we run out of data? limits of llm scaling based on human-generated data, 2024

Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Will we run out of data? limits of llm scaling based on human-generated data, 2024. URL https://arxiv.org/abs/2211.04325

work page arXiv 2024

[50] [50]

Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents

Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, Yitao Liang, and Team CraftJarvis. Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp.\ 34153--34189, 2023 a

work page 2023

[51] [51]

Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models

Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv: 2311.05997, 2023 b

work page arXiv 2023

[52] [52]

Foundation models for decision making: Problems, methods, and opportunities, 2023

Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, and Dale Schuurmans. Foundation models for decision making: Problems, methods, and opportunities, 2023. URL https://arxiv.org/abs/2303.04129

work page arXiv 2023

[53] [53]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022

Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022. URL https://arxiv.org/abs/2106.10199

work page arXiv 2022

[54] [54]

Proagent: Building proactive cooperative ai with large language models

Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, et al. Proagent: Building proactive cooperative ai with large language models. CoRR, 2023

work page 2023

[55] [55]

Learning to drive by watching youtube videos: Action-conditioned contrastive policy pretraining, 2022

Qihang Zhang, Zhenghao Peng, and Bolei Zhou. Learning to drive by watching youtube videos: Action-conditioned contrastive policy pretraining, 2022. URL https://arxiv.org/abs/2204.02393

work page arXiv 2022

[56] [56]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [57]

Calibrating sequence likelihood improves conditional language generation

Yao Zhao, Misha Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J Liu. Calibrating sequence likelihood improves conditional language generation. arXiv preprint arXiv:2210.00045, 2022

work page arXiv 2022

[58] [58]

Slic-hf: Sequence likelihood calibration with human feedback

Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023 b

work page arXiv 2023

[59] [59]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2020. URL https://arxiv.org/abs/1909.08593

work page internal anchor Pith review Pith/arXiv arXiv 2020

[60] [60]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[61] [61]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[62] [62]

Victoria Beckham

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2000