LaGO: Latent Action Guidance for Online Reinforcement Learning

Kuan-Yen Liu; Ren-Jyun Huang; Ti-Rong Wu

arxiv: 2606.24669 · v1 · pith:B5N4IXSYnew · submitted 2026-06-23 · 💻 cs.AI

LaGO: Latent Action Guidance for Online Reinforcement Learning

Kuan-Yen Liu , Ren-Jyun Huang , Ti-Rong Wu This is my paper

Pith reviewed 2026-06-25 23:25 UTC · model grok-4.3

classification 💻 cs.AI

keywords LaGOlatent action guidancereinforcement learninglarge language modelsonline policy optimizationCLEVR-RobotMeta-WorldPPO

0 comments

The pith

LaGO uses a pretrained LLM as a latent action prior to softly guide online RL policy optimization instead of acting as a direct controller.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LaGO to make use of large language models in sequential decision tasks without relying on them to output exact actions, which prior approaches found unreliable. It instead extracts a latent action prior from the LLM to provide soft guidance during online policy learning with PPO. Experiments on CLEVR-Robot and Meta-World show consistent gains in both reward and success rate over vanilla PPO. A reader would care because the method offers a practical route to transfer LLM knowledge into real robot control loops. The results also indicate that larger or stronger LLMs yield stronger guidance effects.

Core claim

LaGO is a framework that uses a pretrained LLM as a latent action prior to softly guide online policy optimization, rather than treating the LLM as an explicit planner or controller. Experiments on both a discrete-control benchmark, CLEVR-Robot, and a continuous-control benchmark, Meta-World, demonstrate that LaGO consistently improves both reward and success rate over Vanilla PPO. In particular, LaGO increases the average success rate from 15.1% to 27.2% on CLEVR-Robot and from 2.7% to 15.2% on Meta-World. Our analysis further shows that stronger pretrained LLMs provide more effective guidance, suggesting that LLM knowledge can improve planning and online decision-making.

What carries the argument

Latent action prior extracted from a pretrained LLM, which supplies soft guidance to the online policy optimizer without requiring precise action outputs from the model.

If this is right

LaGO improves both reward and success rate over vanilla PPO on the two robot benchmarks tested.
The method applies to both discrete-control and continuous-control settings.
Guidance quality scales with the capability of the underlying pretrained LLM.
LLM knowledge can be injected into online RL without converting the LLM into an explicit action generator.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-prior approach might transfer to other online RL algorithms besides PPO.
As LLMs continue to improve, the performance gap between LaGO and vanilla methods could widen further on harder tasks.
The framework could be tested on additional robot suites to check whether the success-rate gains generalize beyond the two benchmarks reported.

Load-bearing premise

A pretrained LLM can supply useful latent action guidance that meaningfully aids online policy optimization without the need for it to generate precise actions.

What would settle it

Running LaGO against vanilla PPO on CLEVR-Robot or Meta-World and finding no improvement or a drop in success rate or reward would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.24669 by Kuan-Yen Liu, Ren-Jyun Huang, Ti-Rong Wu.

**Figure 1.** Figure 1: Overview of the LaGO framework. Numeric environment states are injected into the frozen LLM latent space via learned projection layers, and the resulting action distribution serves as a KL-regularized prior for online RL policy optimization. ioral priors for reinforcement learning. In principle, there are several ways to extract such a latent policy prior from a pretrained LLM. Since the main focus of this… view at source ↗

**Figure 2.** Figure 2: Impact of LLM prior quality on Meta-World. For each task category, we report reward and success rate for the direct prior policy, Vanilla PPO, and LaGO. language model leads to a more useful latent policy prior for online RL training [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Large language models (LLMs) have shown strong potential for planning and sequential decision-making, but prior work often relies on using them as direct controllers, which requires precise action generation and can be unreliable in practice. This paper proposes Latent Action Guidance for Online Reinforcement Learning (LaGO), a framework that uses a pretrained LLM as a latent action prior to softly guide online policy optimization, rather than treating the LLM as an explicit planner or controller. Experiments on both a discrete-control benchmark, CLEVR-Robot, and a continuous-control benchmark, Meta-World, demonstrate that LaGO consistently improves both reward and success rate over Vanilla PPO. In particular, LaGO increases the average success rate from 15.1% to 27.2% on CLEVR-Robot and from 2.7% to 15.2% on Meta-World. Our analysis further shows that stronger pretrained LLMs provide more effective guidance, suggesting that LLM knowledge can improve planning and online decision-making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LaGO shows modest empirical gains from using an LLM as a latent prior to guide PPO on two control benchmarks, but the abstract gives almost no experimental detail so it's impossible to judge if the gains are reliable.

read the letter

The core idea is to let a pretrained LLM supply soft latent action guidance during online policy optimization instead of forcing it to produce exact actions. That avoids some of the brittleness seen when LLMs are used as direct controllers. The paper reports that this raises average success rate from 15.1% to 27.2% on CLEVR-Robot and from 2.7% to 15.2% on Meta-World, and that larger LLMs produce better guidance. Those numbers are the main concrete result.

The approach is a modest extension of existing LLM-RL hybrids rather than a sharp departure. It does at least try to keep the LLM out of the low-level control loop, which is a sensible design choice. The observation that stronger models help more is also plausible and worth checking.

The soft spots are mostly around the experiments. The abstract mentions only vanilla PPO as a baseline, gives no information on statistical significance, number of seeds, hyperparameter sweeps, or how the latent guidance is actually extracted and injected into the policy update. Without those pieces it is hard to tell whether the reported lifts are robust or sensitive to implementation details. The circularity burden looks low because there are no derivations to inspect, but that also means the work rests entirely on the empirical claims.

This paper is aimed at people already working on LLM-augmented RL for robotics or sequential decision tasks. A reader who wants to see one more data point on latent guidance might find it useful, but anyone looking for a well-documented method or strong theoretical grounding will come away wanting more. I would send it to peer review so the experimental protocol can be examined, but the authors would need to add proper controls and ablations before it could be considered solid.

Referee Report

2 major / 0 minor

Summary. The paper proposes LaGO, a framework that uses a pretrained LLM as a latent action prior to softly guide online policy optimization in RL rather than as a direct controller requiring precise actions. On CLEVR-Robot and Meta-World, LaGO is reported to improve both reward and success rate over vanilla PPO, raising average success rates from 15.1% to 27.2% and from 2.7% to 15.2%, respectively, with stronger LLMs yielding better guidance.

Significance. If the empirical results hold under rigorous evaluation, the work offers a practical route to injecting LLM priors into online RL without demanding exact action outputs from the model. The concrete success-rate deltas on both discrete and continuous benchmarks, together with the scaling observation that stronger LLMs help, would be a useful data point for the community exploring LLM-assisted decision making.

major comments (2)

[Abstract] Abstract: the central empirical claim rests on reported success-rate gains, yet the abstract supplies no information on number of random seeds, statistical tests, variance, or the precise definition of the latent-action guidance loss; without these the numerical improvements cannot be assessed for robustness.
[Abstract] Abstract: only vanilla PPO is mentioned as baseline; the claim that LaGO 'consistently improves' over standard practice requires at least one additional modern baseline (e.g., with action priors or LLM planners) to be load-bearing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the manuscript accordingly to improve clarity and completeness of the empirical claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim rests on reported success-rate gains, yet the abstract supplies no information on number of random seeds, statistical tests, variance, or the precise definition of the latent-action guidance loss; without these the numerical improvements cannot be assessed for robustness.

Authors: We agree that the abstract should convey basic information on experimental robustness to allow readers to assess the reported gains. In the revised manuscript we will add a concise clause noting that results are averaged over 5 random seeds with standard deviation reported in the main text, and we will include a one-sentence definition of the latent-action guidance loss (the KL-regularized term that softly aligns the policy with the LLM-derived latent prior). revision: yes
Referee: [Abstract] Abstract: only vanilla PPO is mentioned as baseline; the claim that LaGO 'consistently improves' over standard practice requires at least one additional modern baseline (e.g., with action priors or LLM planners) to be load-bearing.

Authors: We acknowledge that the abstract currently references only vanilla PPO. The full experimental section already contains comparisons against additional baselines that incorporate action priors and LLM-based planners; however, to make the abstract claim self-contained we will revise it to mention at least one such modern baseline (or qualify the statement as improvement over standard PPO while directing readers to the full set of comparisons). revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical RL method paper. It introduces LaGO as a framework using a pretrained LLM as a latent action prior to guide PPO, then reports benchmark results (success rate gains on CLEVR-Robot and Meta-World). No equations, derivations, or claimed first-principles results appear in the provided text. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations are present. The central claim reduces to experimental comparison, which is externally falsifiable and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based solely on the abstract; no information is available on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5698 in / 1044 out tokens · 16065 ms · 2026-06-25T23:25:12.798486+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 15 internal anchors

[1]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

URL https://arxiv.org/abs/2204.01691. Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P., Fu, C., Arenas, M. G., Gopalakr- ishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y ., Leal, I., Lee, L., L...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

URLhttps://arxiv.org/abs/2307.15818. Bruce, J., Dennis, M., Edwards, A., Parker-Holder, J., 6 LaGO: Latent Action Guidance for Online Reinforcement Learning Shi, Y ., Hughes, E., Lai, M., Mavalankar, A., Steiger- wald, R., Apps, C., Aytar, Y ., Bechtle, S., Behbahani, F., Chan, S., Heess, N., Gonzalez, L., Osindero, S., Ozair, S., Reed, S., Zhang, J., Zol...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Bubeck, S., Chandrasekaran, V ., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y

URL https://arxiv.org/abs/2402.15391. Bubeck, S., Chandrasekaran, V ., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y . T., Li, Y ., Lund- berg, S., Nori, H., Palangi, H., Ribeiro, M. T., and Zhang, Y . Sparks of artificial general intelligence: Early experi- ments with gpt-4,

work page arXiv
[4]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

URL https://arxiv.org/ abs/2303.12712. Carta, T., Romac, C., Wolf, T., Lamprier, S., Sigaud, O., and Oudeyer, P.-Y . Grounding large language models in interactive environments with online reinforcement learning,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Driess, D., Xia, F., Sajjadi, M

URL https://arxiv.org/abs/ 2302.02662. Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y ., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V ., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., and Florence, P. Palm-e: An embodied multimodal lan...

work page arXiv
[6]

PaLM-E: An Embodied Multimodal Language Model

URLhttps://arxiv.org/abs/2303.03378. Fang, T., Zhang, H., Zhang, Z., Ma, K., Yu, W., Mi, H., and Yu, D. Webevolver: Enhancing web agent self- improvement with coevolving world model,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Google Research

URL https://arxiv.org/abs/2504.21024. Google Research. Clevr-robot environment. https://github.com/google-research/ clevr_robot_env,

work page arXiv
[8]

doi: 10.1038/s41586-025-09422-z

ISSN 1476-4687. doi: 10.1038/ s41586-025-09422-z. URL http://dx.doi.org/ 10.1038/s41586-025-09422-z. Gurnee, W. and Tegmark, M. Language models represent space and time,

work page doi:10.1038/s41586-025-09422-z
[9]

Hao, S., Gu, Y ., Ma, H., Hong, J

URL https://arxiv.org/ abs/2310.02207. Hao, S., Gu, Y ., Ma, H., Hong, J. J., Wang, Z., Wang, D. Z., and Hu, Z. Reasoning with language model is planning with world model,

work page arXiv
[10]

Reasoning with Language Model is Planning with World Model

URL https: //arxiv.org/abs/2305.14992. Hu, X., Zhang, Y ., Huang, F., Tu, J., Su, Y ., Deng, L., Liu, Y ., Liu, Y ., Liu, D., and Ho, T.-Y . Occubench: Evaluating ai agents on real-world professional tasks via language environment simulation,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

URL https: //arxiv.org/abs/2604.10866. Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., Zeng, A., Tompson, J., Mordatch, I., Chebotar, Y ., Sermanet, P., Brown, N., Jackson, T., Luu, L., Levine, S., Hausman, K., and Ichter, B. Inner monologue: Embodied reasoning through planning with language models,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

URLhttps://arxiv.org/abs/2207.05608. Jin, C. and Rinard, M. Emergent representations of program semantics in language models trained on programs,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

URLhttps://arxiv.org/abs/2305.11169. Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr- ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., and Finn, C. Open- vla: An open-source vision-language-action model,

work page arXiv
[14]

OpenVLA: An Open-Source Vision-Language-Action Model

URLhttps://arxiv.org/abs/2406.09246. Li, K., Hopkins, A. K., Bau, D., Vi´egas, F., Pfister, H., and Wattenberg, M. Emergent world representations: Explor- ing a sequence model trained on a synthetic task,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

7 LaGO: Latent Action Guidance for Online Reinforcement Learning Lin, J., Du, Y ., Watkins, O., Hafner, D., Abbeel, P., Klein, D., and Dragan, A

URLhttps://arxiv.org/abs/2210.13382. 7 LaGO: Latent Action Guidance for Online Reinforcement Learning Lin, J., Du, Y ., Watkins, O., Hafner, D., Abbeel, P., Klein, D., and Dragan, A. Learning to model the world with language,

work page arXiv
[16]

OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F

URL https://arxiv.org/abs/ 2308.01399. OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Bal- aji, S., Balcom, V ., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boi...

work page arXiv
[17]

GPT-4 Technical Report

URLhttps://arxiv.org/abs/2303.08774. Pang, J.-C., Yang, S.-H., Li, K., Zhang, J., Chen, X.-H., Tang, N., and Yu, Y . Knowledgeable agents by offline reinforcement learning from large language model roll- outs,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

URL https://arxiv.org/abs/ 2205.06175. Team, D., Zeng, B., Hua, D., Zhu, K., Dai, Y ., Li, B., Wang, Y ., Tong, C., Yang, Y ., Chang, M., Zhao, J., Liu, Z., Liang, H., Ma, X., An, R., Niu, J., Meng, Z., Bai, T., Qiang, M., Zhang, H., Xiao, Z., Guo, T., Yu, Q., Zhao, R., Li, Z., Huang, X., Pan, Y ., Tang, Y ., Shi, Y ., Ding, Y ., Chen, X., Gao, H., Shi, M...

work page internal anchor Pith review Pith/arXiv arXiv
[19]

URLhttps: //arxiv.org/abs/2604.04707. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi `ere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation lan- guage models,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

LLaMA: Open and Efficient Foundation Language Models

URL https://arxiv.org/ abs/2302.13971. Wang, G., Xie, Y ., Jiang, Y ., Mandlekar, A., Xiao, C., Zhu, Y ., Fan, L., and Anandkumar, A. V oyager: An open- ended embodied agent with large language models,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Voyager: An Open-Ended Embodied Agent with Large Language Models

URLhttps://arxiv.org/abs/2305.16291. Xiang, J., Tao, T., Gu, Y ., Shu, T., Wang, Z., Yang, Z., and Hu, Z. Language models meet world models: Embod- ied experiences enhance language models,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Yan, X., Song, Y ., Feng, X., Yang, M., Zhang, H., Ammar, H

URL https://arxiv.org/abs/2305.10626. Yan, X., Song, Y ., Feng, X., Yang, M., Zhang, H., Ammar, H. B., and Wang, J. Efficient reinforcement learning 8 LaGO: Latent Action Guidance for Online Reinforcement Learning with large language model priors,

work page arXiv
[23]

URL https: //arxiv.org/abs/2410.07927. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, ...

work page arXiv
[24]

Qwen3 Technical Report

URL https: //arxiv.org/abs/2505.09388. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . React: Synergizing reasoning and act- ing in language models,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

ReAct: Synergizing Reasoning and Acting in Language Models

URL https://arxiv. org/abs/2210.03629. Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., and Levine, S. Meta-world: A benchmark and evalua- tion for multi-task and meta reinforcement learning. In Conference on robot learning, pp. 1094–1100. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

URL https://arxiv.org/abs/2204.01691. Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P., Fu, C., Arenas, M. G., Gopalakr- ishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y ., Leal, I., Lee, L., L...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

URLhttps://arxiv.org/abs/2307.15818. Bruce, J., Dennis, M., Edwards, A., Parker-Holder, J., 6 LaGO: Latent Action Guidance for Online Reinforcement Learning Shi, Y ., Hughes, E., Lai, M., Mavalankar, A., Steiger- wald, R., Apps, C., Aytar, Y ., Bechtle, S., Behbahani, F., Chan, S., Heess, N., Gonzalez, L., Osindero, S., Ozair, S., Reed, S., Zhang, J., Zol...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Bubeck, S., Chandrasekaran, V ., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y

URL https://arxiv.org/abs/2402.15391. Bubeck, S., Chandrasekaran, V ., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y . T., Li, Y ., Lund- berg, S., Nori, H., Palangi, H., Ribeiro, M. T., and Zhang, Y . Sparks of artificial general intelligence: Early experi- ments with gpt-4,

work page arXiv

[4] [4]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

URL https://arxiv.org/ abs/2303.12712. Carta, T., Romac, C., Wolf, T., Lamprier, S., Sigaud, O., and Oudeyer, P.-Y . Grounding large language models in interactive environments with online reinforcement learning,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Driess, D., Xia, F., Sajjadi, M

URL https://arxiv.org/abs/ 2302.02662. Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y ., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V ., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., and Florence, P. Palm-e: An embodied multimodal lan...

work page arXiv

[6] [6]

PaLM-E: An Embodied Multimodal Language Model

URLhttps://arxiv.org/abs/2303.03378. Fang, T., Zhang, H., Zhang, Z., Ma, K., Yu, W., Mi, H., and Yu, D. Webevolver: Enhancing web agent self- improvement with coevolving world model,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Google Research

URL https://arxiv.org/abs/2504.21024. Google Research. Clevr-robot environment. https://github.com/google-research/ clevr_robot_env,

work page arXiv

[8] [8]

doi: 10.1038/s41586-025-09422-z

ISSN 1476-4687. doi: 10.1038/ s41586-025-09422-z. URL http://dx.doi.org/ 10.1038/s41586-025-09422-z. Gurnee, W. and Tegmark, M. Language models represent space and time,

work page doi:10.1038/s41586-025-09422-z

[9] [9]

Hao, S., Gu, Y ., Ma, H., Hong, J

URL https://arxiv.org/ abs/2310.02207. Hao, S., Gu, Y ., Ma, H., Hong, J. J., Wang, Z., Wang, D. Z., and Hu, Z. Reasoning with language model is planning with world model,

work page arXiv

[10] [10]

Reasoning with Language Model is Planning with World Model

URL https: //arxiv.org/abs/2305.14992. Hu, X., Zhang, Y ., Huang, F., Tu, J., Su, Y ., Deng, L., Liu, Y ., Liu, Y ., Liu, D., and Ho, T.-Y . Occubench: Evaluating ai agents on real-world professional tasks via language environment simulation,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

URL https: //arxiv.org/abs/2604.10866. Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., Zeng, A., Tompson, J., Mordatch, I., Chebotar, Y ., Sermanet, P., Brown, N., Jackson, T., Luu, L., Levine, S., Hausman, K., and Ichter, B. Inner monologue: Embodied reasoning through planning with language models,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

URLhttps://arxiv.org/abs/2207.05608. Jin, C. and Rinard, M. Emergent representations of program semantics in language models trained on programs,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

URLhttps://arxiv.org/abs/2305.11169. Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr- ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., and Finn, C. Open- vla: An open-source vision-language-action model,

work page arXiv

[14] [14]

OpenVLA: An Open-Source Vision-Language-Action Model

URLhttps://arxiv.org/abs/2406.09246. Li, K., Hopkins, A. K., Bau, D., Vi´egas, F., Pfister, H., and Wattenberg, M. Emergent world representations: Explor- ing a sequence model trained on a synthetic task,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

7 LaGO: Latent Action Guidance for Online Reinforcement Learning Lin, J., Du, Y ., Watkins, O., Hafner, D., Abbeel, P., Klein, D., and Dragan, A

URLhttps://arxiv.org/abs/2210.13382. 7 LaGO: Latent Action Guidance for Online Reinforcement Learning Lin, J., Du, Y ., Watkins, O., Hafner, D., Abbeel, P., Klein, D., and Dragan, A. Learning to model the world with language,

work page arXiv

[16] [16]

OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F

URL https://arxiv.org/abs/ 2308.01399. OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Bal- aji, S., Balcom, V ., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boi...

work page arXiv

[17] [17]

GPT-4 Technical Report

URLhttps://arxiv.org/abs/2303.08774. Pang, J.-C., Yang, S.-H., Li, K., Zhang, J., Chen, X.-H., Tang, N., and Yu, Y . Knowledgeable agents by offline reinforcement learning from large language model roll- outs,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

URL https://arxiv.org/abs/ 2205.06175. Team, D., Zeng, B., Hua, D., Zhu, K., Dai, Y ., Li, B., Wang, Y ., Tong, C., Yang, Y ., Chang, M., Zhao, J., Liu, Z., Liang, H., Ma, X., An, R., Niu, J., Meng, Z., Bai, T., Qiang, M., Zhang, H., Xiao, Z., Guo, T., Yu, Q., Zhao, R., Li, Z., Huang, X., Pan, Y ., Tang, Y ., Shi, Y ., Ding, Y ., Chen, X., Gao, H., Shi, M...

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

URLhttps: //arxiv.org/abs/2604.04707. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi `ere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation lan- guage models,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

LLaMA: Open and Efficient Foundation Language Models

URL https://arxiv.org/ abs/2302.13971. Wang, G., Xie, Y ., Jiang, Y ., Mandlekar, A., Xiao, C., Zhu, Y ., Fan, L., and Anandkumar, A. V oyager: An open- ended embodied agent with large language models,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Voyager: An Open-Ended Embodied Agent with Large Language Models

URLhttps://arxiv.org/abs/2305.16291. Xiang, J., Tao, T., Gu, Y ., Shu, T., Wang, Z., Yang, Z., and Hu, Z. Language models meet world models: Embod- ied experiences enhance language models,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Yan, X., Song, Y ., Feng, X., Yang, M., Zhang, H., Ammar, H

URL https://arxiv.org/abs/2305.10626. Yan, X., Song, Y ., Feng, X., Yang, M., Zhang, H., Ammar, H. B., and Wang, J. Efficient reinforcement learning 8 LaGO: Latent Action Guidance for Online Reinforcement Learning with large language model priors,

work page arXiv

[23] [23]

URL https: //arxiv.org/abs/2410.07927. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, ...

work page arXiv

[24] [24]

Qwen3 Technical Report

URL https: //arxiv.org/abs/2505.09388. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . React: Synergizing reasoning and act- ing in language models,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

ReAct: Synergizing Reasoning and Acting in Language Models

URL https://arxiv. org/abs/2210.03629. Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., and Levine, S. Meta-world: A benchmark and evalua- tion for multi-task and meta reinforcement learning. In Conference on robot learning, pp. 1094–1100. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv