arxiv: 2605.00347 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Chengshuai Shi, Chi Jin, Danqi Chen, Gabriel Sarch, Karthik Narasimhan, Ruirong Feng, Seth Karten, Wenjia Yang, Wenzhe Li, Xinran Liang, Yizhou Lu, Zihan Ding, Ziran Yang

Pith reviewed 2026-05-09 20:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords vision-language modelsreinforcement learninglong-horizon decision-makingvideo gamesembodied agentsPPO algorithmgeneralization

0 comments

The pith

Adapted reinforcement learning lets vision-language models handle 100+ turn decisions in games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to apply reinforcement learning to vision-language models for interactive tasks that span over 100 turns, using the game Super Mario Land as the testbed. Existing methods either depend on massive supervised fine-tuning or work only for short sequences, so the authors systematically test algorithmic pieces and settle on a modified PPO algorithm that adds a lightweight turn-level critic. This change stabilizes training and raises sample efficiency, while the models' pretraining supplies useful action priors that cut down on hand-crafted designs. From these pieces they assemble the Odysseus framework, which delivers at least three times the average progress of current frontier models across game levels and improves generalization both inside and outside the training game without eroding general capabilities.

Core claim

We introduce Odysseus, an open training framework for VLM agents. An adapted PPO variant with a lightweight turn-level critic substantially improves training stability and sample efficiency over critic-free baselines. Pretrained VLMs supply strong action priors that further boost efficiency and reduce manual action engineering. The resulting agents achieve substantial gains across multiple levels of Super Mario Land, at least three times the average game progress of frontier models, consistent improvements under in-game and cross-game generalization, and retention of general-domain capabilities.

What carries the argument

Adapted PPO with a lightweight turn-level critic, which stabilizes long-horizon RL training for VLMs and improves sample efficiency while using pretrained action priors.

If this is right

Substantial performance gains across multiple levels of the game.
At least three times the average game progress compared with frontier models.
Consistent improvements in both in-game and cross-game generalization.
Retention of general-domain capabilities after specialized training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stability techniques may transfer to other long-horizon embodied tasks that combine vision and language.
Open release of the framework could accelerate community experiments on scaling RL for VLMs beyond games.
Further work could test whether the turn-level critic remains effective when horizons exceed several hundred steps.

Load-bearing premise

The reported performance gains and generalization come mainly from the adapted RL components and VLM priors rather than from unstated implementation choices, game-specific tuning, or evaluation details.

What would settle it

A direct head-to-head comparison on the same Super Mario Land levels that isolates the turn-level critic from critic-free methods and measures both final progress and training stability.

Figures

Figures reproduced from arXiv: 2605.00347 by Chengshuai Shi, Chi Jin, Danqi Chen, Gabriel Sarch, Karthik Narasimhan, Ruirong Feng, Seth Karten, Wenjia Yang, Wenzhe Li, Xinran Liang, Yizhou Lu, Zihan Ding, Ziran Yang.

**Figure 2.** Figure 2: The interaction protocol between the VLM agent and the game environment. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The adapted PPO algorithm used in Odysseus with a lightweight turn-level CNN [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Comparison of VLM-based RL training methods with training samples limited [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Auto-Curriculum. Building on the environment knowledge and perception capabilities acquired during SFT, we further apply RL to optimize action selection in the environment, thereby improving final performance. Based on the algorithmic findings in Section 4, we adopt the adapted PPO together with positive-advantage filtering. To enable multi-task training, each training batch contains trajectories collect… view at source ↗

**Figure 6.** Figure 6: Evaluation of Odysseus under three generalization settings: in-game off-policy [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of VLM-based RL training methods. Individual runs are plotted [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison between VLM-based RL and classical RL. Individual runs are plotted [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Example trajectories of base model (top) and Odysseus (bottom). Base model fails [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Example trajectories of base model (top) and Odysseus (bottom). Base model falls [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

read the original abstract

Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20--30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario Land, a visually grounded environment requiring 100+ turns of interaction with coordinated perception, reasoning, and action. We begin with a systematic investigation of key algorithmic components and propose an adapted variant of PPO with a lightweight turn-level critic, which substantially improves training stability and sample efficiency over critic-free methods such as GRPO and Reinforce++. We further show that pretrained VLMs provide strong action priors, significantly improving sample efficiency during RL training and reducing the need for manual design choices such as action engineering, compared to classical deep RL trained from scratch. Building on these insights, we introduce Odysseus, an open training framework for VLM agents, achieving substantial gains across multiple levels of the game and at least 3 times average game progresses than frontier models. Moreover, the trained models exhibit consistent improvements under both in-game and cross-game generalization settings, while maintaining general-domain capabilities. Overall, our results identify key ingredients for making RL stable and effective in long-horizon, multi-modal settings, and provide practical guidance for developing VLMs as embodied agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They stabilized RL for 100+ turn VLM game agents with a turn-level critic on PPO plus pretrained priors, reporting 3x progress in Mario, but the attribution needs airtight ablations.

read the letter

The main point is that this work shows how to make PPO stable enough for VLMs over 100+ turns in a visual game like Super Mario Land. They add a lightweight turn-level critic and rely on the model's existing action priors instead of training from scratch or heavy engineering. That setup reportedly beats critic-free baselines like GRPO and gives at least 3x average game progress over frontier VLMs, with some in-game and cross-game generalization while keeping general capabilities intact. They also release an open training framework, which is useful for follow-up work. The systematic component checks are a clear positive; they actually test what improves stability and sample efficiency rather than just claiming it. Using the VLM priors to reduce manual action design is a practical step that fits real deployment needs. The soft spot is attribution. The abstract talks about systematic investigation and gains across levels, but if the full paper does not run tight controls (same VLM backbone, identical observation encoding, prompt templates, and reward setup) when comparing Odysseus to the baselines or to prompted frontier models, then the credit for the critic and priors could be mixed with other implementation choices. The stress-test note is fair on this; without those exact ablations the central claim is less secure. Minor if the tables are there and clean, but worth checking. This is for researchers building long-horizon VLM agents or doing multimodal RL in visual environments. A reader who needs concrete guidance on scaling RL past 30 turns would get value from the component breakdowns and results. I would send it for peer review; the empirical focus on a real gap is worth referee time even if revisions are needed on the controls.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Odysseus, an open training framework for vision-language model (VLM) agents performing long-horizon (100+ turn) decision-making in Super Mario Land. It systematically investigates RL components, proposes an adapted PPO variant with a lightweight turn-level critic for improved stability and sample efficiency over GRPO and Reinforce++, and leverages pretrained VLM action priors to reduce manual design. The framework is reported to deliver substantial gains across game levels, at least 3x average game progress relative to frontier models, plus in-game and cross-game generalization while preserving general-domain capabilities.

Significance. If the empirical claims hold under controlled conditions, the work would advance scaling of VLMs to embodied, long-horizon tasks by supplying an open framework, concrete guidance on stable multi-modal RL, and evidence that VLM priors aid efficiency. The open-source release and retention of general capabilities are explicit strengths that support reproducibility and broader applicability.

major comments (1)

[Experimental results and ablations] The central attribution of the reported 3x game-progress gains, stability, and generalization to the adapted PPO (with turn-level critic) plus VLM priors requires explicit controlled ablations. Comparisons to GRPO, Reinforce++, and prompted frontier VLMs must hold the VLM backbone, action space, observation encoding, and reward formulation fixed; without such controls (detailed in the experimental section), it remains unclear whether the gains arise from the proposed algorithmic ingredients or from unablated implementation or evaluation choices.

minor comments (2)

[Abstract] The abstract states 'substantial gains' and 'at least 3 times average game progresses' without any numerical values, baseline scores, or statistical details; including key quantitative results would make the summary self-contained.
[Results] Clarify the precise definition and measurement of 'average game progresses' (e.g., levels completed, distance traveled, or normalized score) and how it is aggregated across levels and episodes.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on the manuscript. We address the major comment below.

read point-by-point responses

Referee: [Experimental results and ablations] The central attribution of the reported 3x game-progress gains, stability, and generalization to the adapted PPO (with turn-level critic) plus VLM priors requires explicit controlled ablations. Comparisons to GRPO, Reinforce++, and prompted frontier VLMs must hold the VLM backbone, action space, observation encoding, and reward formulation fixed; without such controls (detailed in the experimental section), it remains unclear whether the gains arise from the proposed algorithmic ingredients or from unablated implementation or evaluation choices.

Authors: We agree that rigorous controls are necessary to attribute performance differences to specific algorithmic choices. The manuscript reports a systematic investigation of RL components with comparisons to GRPO and Reinforce++ as well as prompted frontier VLMs. To address the concern directly, we will revise the experimental section to include explicit ablation studies that hold the VLM backbone, action space, observation encoding, and reward formulation fixed for the RL-based methods (adapted PPO, GRPO, and Reinforce++). We will also add a detailed description of all implementation and evaluation choices. For prompted frontier VLMs, which use different closed-source backbones by definition, we will clarify that these serve as off-the-shelf baselines without RL training rather than controlled variants. These changes will make the source of the reported gains, stability improvements, and generalization clearer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical RL results grounded in external game interactions

full rationale

The paper describes an empirical training framework for VLM agents in Super Mario Land using an adapted PPO variant with a turn-level critic, pretrained VLM action priors, and comparisons to baselines such as GRPO, Reinforce++, and frontier models. Performance claims (e.g., 3x game progress, generalization) are evaluated via direct interaction with the external game environment rather than any internal derivation, fitted parameter renamed as prediction, or self-referential definition. No mathematical equations, uniqueness theorems, or ansatzes are presented that reduce to the paper's own inputs by construction. No load-bearing self-citations or renamings of known results appear in the abstract or described content. This is a standard empirical RL study whose central claims rest on observable training outcomes and external benchmarks, warranting a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are introduced or quantified in the abstract; the work relies on standard RL assumptions and pretrained model capabilities from prior literature.

pith-pipeline@v0.9.0 · 5624 in / 1017 out tokens · 43814 ms · 2026-05-09T20:19:28.020941+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

110 extracted references · 56 canonical work pages · 29 internal anchors

[1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[2]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

High-dimensional continuous control using generalized advantage estimation , author=. arXiv preprint arXiv:1506.02438 , year=

work page internal anchor Pith review arXiv
[3]

Soft Actor-Critic Algorithms and Applications

Soft actor-critic algorithms and applications , author=. arXiv preprint arXiv:1812.05905 , year=

work page internal anchor Pith review arXiv
[4]

International conference on machine learning , pages=

Asynchronous methods for deep reinforcement learning , author=. International conference on machine learning , pages=. 2016 , organization=

2016
[5]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

work page internal anchor Pith review arXiv
[6]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

10 Auke Jan Ijspeert, Jun Nakanishi, Heiko Hoffmann, Peter Pastor, and Stefan Schaal

lmgame-Bench: How Good are LLMs at Playing Games? , author=. arXiv preprint arXiv:2505.15146 , year=

work page arXiv
[9]

Advances in Neural Information Processing Systems , volume=

Minedojo: Building open-ended embodied agents with internet-scale knowledge , author=. Advances in Neural Information Processing Systems , volume=
[10]

Advances in Neural Information Processing Systems , volume=

Libero: Benchmarking knowledge transfer for lifelong robot learning , author=. Advances in Neural Information Processing Systems , volume=
[11]

2021 , url =

Mohit Shridhar and Xingdi Yuan and Marc-Alexandre C\^ot\'e and Yonatan Bisk and Adam Trischler and Matthew Hausknecht , booktitle =. 2021 , url =

2021
[12]

2020 , url =

Mohit Shridhar and Jesse Thomason and Daniel Gordon and Yonatan Bisk and Winson Han and Roozbeh Mottaghi and Luke Zettlemoyer and Dieter Fox , booktitle =. 2020 , url =

2020
[13]

Conference on robot learning , pages=

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning , author=. Conference on robot learning , pages=. 2020 , organization=

2020
[14]

Advances in Neural Information Processing Systems , volume=

Video pretraining (vpt): Learning to act by watching unlabeled online videos , author=. Advances in Neural Information Processing Systems , volume=
[15]

A Generalist Agent

A generalist agent , author=. arXiv preprint arXiv:2205.06175 , year=

work page internal anchor Pith review arXiv
[16]

Advances in neural information processing systems , volume=

Multi-game decision transformers , author=. Advances in neural information processing systems , volume=
[17]

Advances in neural information processing systems , volume=

Decision transformer: Reinforcement learning via sequence modeling , author=. Advances in neural information processing systems , volume=
[18]

Advances in neural information processing systems , volume=

Offline reinforcement learning as one big sequence modeling problem , author=. Advances in neural information processing systems , volume=
[19]

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence , author=. arXiv preprint arXiv:2603.24621 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Dota 2 with Large Scale Deep Reinforcement Learning

Dota 2 with large scale deep reinforcement learning , author=. arXiv preprint arXiv:1912.06680 , year=

work page internal anchor Pith review arXiv 1912
[21]

arXiv preprint arXiv:1804.03720 , year=

Gotta Learn Fast: A New Benchmark for Generalization in RL , author=. arXiv preprint arXiv:1804.03720 , year=

work page arXiv
[22]

arXiv preprint arXiv:1703.04908 , year=

Emergence of Grounded Compositional Language in Multi-Agent Populations , author=. arXiv preprint arXiv:1703.04908 , year=

work page arXiv
[23]

International conference on machine learning , pages=

Benchmarking deep reinforcement learning for continuous control , author=. International conference on machine learning , pages=. 2016 , organization=

2016
[24]

Journal of artificial intelligence research , volume=

The arcade learning environment: An evaluation platform for general agents , author=. Journal of artificial intelligence research , volume=
[25]

FightLadder : A benchmark for competitive multi-agent reinforcement learning

FightLadder: A benchmark for competitive multi-agent reinforcement learning , author=. arXiv preprint arXiv:2406.02081 , year=

work page arXiv
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

arXiv e-prints , pages=

Reinforce++: A simple and efficient approach for aligning large language models , author=. arXiv e-prints , pages=
[28]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

1998 , publisher=

Reinforcement learning: An introduction , author=. 1998 , publisher=

1998
[31]

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Orak: A foundational benchmark for training and evaluating llm agents on diverse video games , author=. arXiv preprint arXiv:2506.03610 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Zhang, Thomas L

Videogamebench: Can vision-language models complete popular video games? , author=. arXiv preprint arXiv:2505.18134 , year=

work page arXiv
[33]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Claude-3 Model Card , volume=

The claude 3 model family: Opus, sonnet, haiku , author=. Claude-3 Model Card , volume=
[35]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

2018 , publisher=

Improving language understanding by generative pre-training , author=. 2018 , publisher=

2018
[37]

Code Llama: Open Foundation Models for Code

Code llama: Open foundation models for code , author=. arXiv preprint arXiv:2308.12950 , year=

work page internal anchor Pith review arXiv
[38]

The eleventh international conference on learning representations , year=

React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=
[39]

Advances in Neural Information Processing Systems , volume=

Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=
[40]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Gui agents: A survey , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

2025
[41]

Scaling instructable agents across many simulated worlds

Scaling instructable agents across many simulated worlds , author=. arXiv preprint arXiv:2404.10179 , year=

work page arXiv
[42]

Sawyer, Daniel Slater, David Reichert, Davide Vercelli, Demis Hassabis, Drew A

Sima 2: A generalist embodied agent for virtual worlds , author=. arXiv preprint arXiv:2512.04797 , year=

work page arXiv
[43]

PaLM-E: An Embodied Multimodal Language Model

Palm-e: An embodied multimodal language model , author=. arXiv preprint arXiv:2303.03378 , year=

work page internal anchor Pith review arXiv
[44]

Conference on Robot Learning , pages=

Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. Conference on Robot Learning , pages=. 2023 , organization=

2023
[45]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

pi_0 : A Vision-Language-Action Flow Model for General Robot Control , author=. arXiv preprint arXiv:2410.24164 , year=

work page internal anchor Pith review arXiv
[46]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Mle-bench: Evaluating machine learning agents on machine learning engineering , author=. arXiv preprint arXiv:2410.07095 , year=

work page Pith review arXiv
[47]

NeurIPS Competition Track , year =

The PokeAgent Challenge: Competitive and Long-Context Learning at Scale , author=. NeurIPS Competition Track , year =
[48]

Advances in neural information processing systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=
[49]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Webvoyager: Building an end-to-end web agent with large multimodal models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[50]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Do As I Can and Not As I Say: Grounding Language in Robotic Affordances , author=. arXiv preprint arXiv:2204.01691 , year=

work page internal anchor Pith review arXiv
[51]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Voxposer: Composable 3d value maps for robotic manipulation with language models , author=. arXiv preprint arXiv:2307.05973 , year=

work page internal anchor Pith review arXiv
[52]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Voyager: An open-ended embodied agent with large language models , author=. arXiv preprint arXiv:2305.16291 , year=

work page internal anchor Pith review arXiv
[53]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

1992
[54]

Gemini Robotics: Bringing AI into the Physical World

Gemini robotics: Bringing ai into the physical world , author=. arXiv preprint arXiv:2503.20020 , year=

work page internal anchor Pith review arXiv
[55]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Rdt-1b: a diffusion foundation model for bimanual manipulation , author=. arXiv preprint arXiv:2410.07864 , year=

work page internal anchor Pith review arXiv
[56]

Game-tars: Pretrained foundation models for scalable generalist multimodal game agents.arXiv:2510.23691, 2025

Game-tars: Pretrained foundation models for scalable generalist multimodal game agents , author=. arXiv preprint arXiv:2510.23691 , year=

work page arXiv
[57]

2025 , month = aug, url =

Zhang, Joel , title =. 2025 , month = aug, url =

2025
[58]

Lumine: An open recipe for building generalist agents in 3d open worlds.arXiv preprint arXiv:2511.08892, 2025

Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds , author=. arXiv preprint arXiv:2511.08892 , year=

work page arXiv
[59]

NitroGen: An open foundation model for generalist gaming agents, 2026

NitroGen: An Open Foundation Model for Generalist Gaming Agents , author=. arXiv preprint arXiv:2601.02427 , year=

work page arXiv
[60]

nature , volume=

Human-level control through deep reinforcement learning , author=. nature , volume=. 2015 , publisher=

2015
[61]

nature , volume=

Mastering the game of Go with deep neural networks and tree search , author=. nature , volume=. 2016 , publisher=

2016
[62]

nature , volume=

Grandmaster level in StarCraft II using multi-agent reinforcement learning , author=. nature , volume=. 2019 , publisher=

2019
[63]

International conference on machine learning , pages=

Agent57: Outperforming the atari human benchmark , author=. International conference on machine learning , pages=. 2020 , organization=

2020
[64]

Advances in Neural Information Processing Systems , volume=

Embodiedgpt: Vision-language pre-training via embodied chain of thought , author=. Advances in Neural Information Processing Systems , volume=
[65]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

From multimodal llms to generalist embodied agents: Methods and lessons , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[66]

arXiv preprint arXiv:2505.22050 , year=

Reinforced reasoning for embodied planning , author=. arXiv preprint arXiv:2505.22050 , year=

work page arXiv
[67]

Towards general computer control: A multimodal agent for red dead redemption ii as a case study

Cradle: Empowering foundation agents towards general computer control , author=. arXiv preprint arXiv:2403.03186 , year=

work page arXiv
[68]

Advances in neural information processing systems , volume=

Fine-tuning large vision-language models as decision-making agents via reinforcement learning , author=. Advances in neural information processing systems , volume=
[69]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Alfworld: Aligning text and embodied environments for interactive learning , author=. arXiv preprint arXiv:2010.03768 , year=

work page internal anchor Pith review arXiv 2010
[70]

arXiv preprint arXiv:2505.03792 , year=

Towards efficient online tuning of vlm agents via counterfactual soft reinforcement learning , author=. arXiv preprint arXiv:2505.03792 , year=

work page arXiv
[71]

arXiv preprint arXiv:2301.03044 , year=

A survey on transformers in reinforcement learning , author=. arXiv preprint arXiv:2301.03044 , year=

work page arXiv
[72]

arXiv preprint arXiv:2510.12693 , year=

Era: Transforming vlms into embodied agents via embodied prior learning and online reinforcement learning , author=. arXiv preprint arXiv:2510.12693 , year=

work page arXiv
[73]

2025 , howpublished=

Claude Code , author=. 2025 , howpublished=

2025
[74]

G1: Bootstrapping perception and reasoning abilities of vision-language model via reinforcement learning, 2025

G1: Bootstrapping perception and reasoning abilities of vision-language model via reinforcement learning , author=. arXiv preprint arXiv:2505.13426 , year=

work page arXiv
[75]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[76]

Secrets of rlhf in large language models part i: Ppo.arXiv preprint arXiv:2307.04964, 2023

Secrets of rlhf in large language models part i: Ppo , author=. arXiv preprint arXiv:2307.04964 , year=

work page arXiv
[77]

A minimalist approach to llm reasoning: from rejection sampling to reinforce, 2025

A minimalist approach to llm reasoning: from rejection sampling to reinforce , author=. arXiv preprint arXiv:2504.11343 , year=

work page arXiv
[78]

arXiv preprint arXiv:2505.18830 , year=

On the effect of negative gradient in group relative deep reinforcement optimization , author=. arXiv preprint arXiv:2505.18830 , year=

work page arXiv
[79]

2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP) , pages=

PPO-CMA: Proximal policy optimization with covariance matrix adaptation , author=. 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP) , pages=. 2020 , organization=

2020
[80]

arXiv preprint arXiv:2306.01460 , year=

Relu to the rescue: Improve your on-policy actor-critic with positive advantages , author=. arXiv preprint arXiv:2306.01460 , year=

work page arXiv

Showing first 80 references.