pith. machine review for the scientific record. sign in

arxiv: 2311.12871 · v3 · pith:LJHUW6AYnew · submitted 2023-11-18 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

An Embodied Generalist Agent in 3D World

Pith reviewed 2026-05-17 14:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords embodied AI3D vision-languagegeneralist agentvision-language-action3D groundingnavigationmanipulationrobotics
0
0 comments X

The pith

LEO trains as a 3D embodied generalist agent through two-stage alignment on large vision-language and vision-language-action datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LEO as an embodied multi-modal generalist agent that perceives, grounds, reasons, plans and acts in 3D environments. Prior models mostly process 2D images and rarely tackle inherently 3D tasks such as grounding, embodied reasoning and acting. LEO uses a single architecture and objective trained in two stages: first aligning 3D vision with language, then tuning on action-oriented instructions. Large-scale object-level and scene-level datasets are assembled with an LLM-assisted pipeline to support this training. Experiments demonstrate proficiency on 3D captioning, question answering, reasoning, navigation and manipulation.

Core claim

LEO is trained with a unified task interface, model architecture and objective in two stages of 3D vision-language alignment followed by 3D vision-language-action instruction tuning on diverse collected datasets, allowing it to excel at perceiving, grounding, reasoning, planning and acting in the 3D world.

What carries the argument

The two-stage training process of 3D vision-language alignment followed by vision-language-action instruction tuning under a single unified interface and objective.

If this is right

  • The single agent can handle both language-only 3D tasks such as captioning and question answering and action-oriented tasks such as navigation and manipulation.
  • An LLM-assisted data pipeline produces the high-quality object-level and scene-level examples needed for 3D alignment and tuning.
  • Ablative studies isolate the contribution of each training stage to overall performance.
  • Scaling analyses indicate that larger models and datasets continue to improve results on the tested 3D tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage recipe could be applied to other sensor-rich environments once comparable datasets become available.
  • If the generalist pattern holds, separate specialist models for individual robotics skills may become unnecessary.
  • Closing the sim-to-real gap remains an open requirement before physical deployment of the trained agent.

Load-bearing premise

The collected large-scale 3D vision-language and vision-language-action datasets plus the two-stage training procedure suffice to produce generalist performance that transfers to tasks beyond the reported benchmarks.

What would settle it

Evaluating LEO on a new 3D navigation or manipulation task drawn from an environment absent from the training datasets and finding performance no better than task-specific baselines would falsify the generalist claim.

read the original abstract

Leveraging massive knowledge from large language models (LLMs), recent machine learning models show notable successes in general-purpose task solving in diverse domains such as computer vision and robotics. However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world, e.g., 3D grounding, embodied reasoning and acting. We argue these limitations significantly hinder current models from performing real-world tasks and approaching general intelligence. To this end, we introduce LEO, an embodied multi-modal generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world. LEO is trained with a unified task interface, model architecture, and objective in two stages: (i) 3D vision-language (VL) alignment and (ii) 3D vision-language-action (VLA) instruction tuning. We collect large-scale datasets comprising diverse object-level and scene-level tasks, which require considerable understanding of and interaction with the 3D world. Moreover, we meticulously design an LLM-assisted pipeline to produce high-quality 3D VL data. Through extensive experiments, we demonstrate LEO's remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, navigation and manipulation. Our ablative studies and scaling analyses further provide valuable insights for developing future embodied generalist agents. Code and data are available on project page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LEO, an embodied multi-modal generalist agent for 3D worlds. It employs a unified task interface, model architecture, and objective trained in two stages: (i) 3D vision-language alignment and (ii) 3D vision-language-action instruction tuning. Large-scale object- and scene-level datasets are collected via an LLM-assisted pipeline. The authors claim that extensive experiments demonstrate LEO's remarkable proficiency across 3D captioning, question answering, embodied reasoning, navigation, and manipulation, with additional insights from ablative studies and scaling analyses. Code and data are released.

Significance. If the central claims hold, the work would be significant for embodied AI and 3D vision-language modeling by addressing the limitations of 2D-centric approaches and providing a unified framework for perception, grounding, reasoning, planning, and acting. The two-stage training procedure and LLM-assisted data pipeline offer a concrete methodology that could inform future generalist agents. Releasing code and data supports reproducibility and is a clear strength.

major comments (2)
  1. [Abstract and Experiments] The generalist claim (abstract and §1) rests on transfer from the collected 3D VL/VLA datasets to unseen environments. The manuscript does not report explicit controls for distributional overlap between training scenes and evaluation scenes or zero-shot results on novel 3D layouts and action spaces; without these, it is difficult to rule out in-distribution memorization as the source of the reported proficiency.
  2. [Experiments] §4 (or equivalent results section): the claim of 'remarkable proficiency across a wide spectrum of tasks' requires quantitative support with baselines, ablations, and statistical significance. The abstract provides none, and the absence of these details in the evaluation of navigation and manipulation tasks weakens the cross-task generalization argument.
minor comments (2)
  1. [Abstract] The abstract would benefit from a single sentence summarizing the key quantitative improvements or the most challenging task where LEO excels.
  2. [Method] Notation for the unified objective and the two-stage losses should be introduced consistently in the method section to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract and Experiments] The generalist claim (abstract and §1) rests on transfer from the collected 3D VL/VLA datasets to unseen environments. The manuscript does not report explicit controls for distributional overlap between training scenes and evaluation scenes or zero-shot results on novel 3D layouts and action spaces; without these, it is difficult to rule out in-distribution memorization as the source of the reported proficiency.

    Authors: We agree that explicit controls for distributional overlap would provide stronger evidence for generalization. Our evaluation scenes are drawn from held-out portions of the collected datasets, which were curated to include diverse and previously unseen 3D layouts. To directly address the concern, we will add a quantitative analysis of scene overlap (using metrics such as object layout similarity) and include performance results on a subset of fully novel environments not present in any training split. Regarding zero-shot transfer to novel action spaces, the unified task interface supports some degree of generalization through instruction tuning, but we will report additional experiments on unseen action combinations to the extent supported by our data. revision: partial

  2. Referee: [Experiments] §4 (or equivalent results section): the claim of 'remarkable proficiency across a wide spectrum of tasks' requires quantitative support with baselines, ablations, and statistical significance. The abstract provides none, and the absence of these details in the evaluation of navigation and manipulation tasks weakens the cross-task generalization argument.

    Authors: The manuscript already presents quantitative results, baselines, and ablations for all tasks including navigation and manipulation, along with scaling analyses. We acknowledge, however, that the abstract does not sufficiently summarize these metrics and that statistical significance testing is not explicitly reported. We will revise the abstract to include key quantitative findings and add statistical significance measures (e.g., confidence intervals or p-values) for the navigation and manipulation results in the experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents an empirical training pipeline for LEO consisting of dataset collection via an LLM-assisted process followed by two-stage optimization (3D VL alignment then VLA instruction tuning) on object- and scene-level tasks. All load-bearing elements—model architecture, unified task interface, and reported proficiency—are defined externally to the claimed outcomes and evaluated on separate benchmarks rather than reducing to self-referential fits, renamings, or self-citation chains. No equations or first-principles derivations are invoked that collapse to the inputs by construction; the central generalist claim rests on experimental transfer from collected data, which is a standard non-circular empirical structure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate hyperparameters or new axioms; the approach rests on the standard assumption that large-scale supervised fine-tuning of multimodal transformers yields generalist capabilities, plus the unstated premise that the generated 3D datasets are representative of real-world 3D tasks.

pith-pipeline@v0.9.0 · 5616 in / 1160 out tokens · 27217 ms · 2026-05-17T14:15:53.121198+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation/DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Leveraging massive knowledge from large language models (LLMs), recent machine learning models show notable successes in general-purpose task solving in diverse domains such as computer vision and robotics. However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world, e.g., 3D grounding, embodied reasoning and acting.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.

  2. PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.

  3. POMA-3D: The Point Map Way to 3D Scene Understanding

    cs.CV 2025-11 unverdicted novelty 7.0

    POMA-3D learns self-supervised 3D scene representations from point maps and improves performance on geometric 3D tasks including navigation and scene retrieval.

  4. 3D-VLA: A 3D Vision-Language-Action Generative World Model

    cs.CV 2024-03 unverdicted novelty 7.0

    3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.

  5. ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

  6. Lifting Unlabeled Internet-level Data for 3D Scene Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Unlabeled web videos processed by designed data engines generate effective training data that yields strong zero-shot and finetuned performance on 3D detection, segmentation, VQA, and navigation.

  7. Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

    cs.CV 2026-03 unverdicted novelty 6.0

    Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...

  8. SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

    cs.CV 2026-03 unverdicted novelty 6.0

    SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.

  9. Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

    cs.CV 2026-03 unverdicted novelty 6.0

    Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at...

  10. TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization

    cs.CV 2026-03 unverdicted novelty 6.0

    TrianguLang achieves state-of-the-art feed-forward text-guided 3D localization and segmentation by using predicted geometry to gate cross-view semantic correspondences without ground-truth poses.

  11. CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

    cs.RO 2026-01 unverdicted novelty 6.0

    CLAMP pretrains 3D multi-view encoders with contrastive learning on point clouds and actions, then initializes diffusion policies for more sample-efficient fine-tuning on robotic tasks.

  12. Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception

    cs.RO 2025-11 conditional novelty 6.0

    EyeVLA transfers open-world VLM understanding to a PTZ camera control policy via hierarchical action tokens and GRPO reinforcement learning, reaching 96% task completion on 50 real scenes with only 500 training samples.

  13. C-NAV: Towards Self-Evolving Continual Object Navigation in Open World

    cs.RO 2025-10 unverdicted novelty 6.0

    C-Nav is a continual visual navigation framework with dual-path anti-forgetting via feature distillation and replay plus adaptive sampling that outperforms baselines on a new continual object navigation benchmark whil...

  14. Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    cs.CV 2025-05 unverdicted novelty 6.0

    Spatial-MLLM boosts MLLM spatial intelligence from 2D inputs via dual encoders initialized from geometry models plus space-aware sampling, claiming state-of-the-art results.

  15. ToolRL: Reward is All Tool Learning Needs

    cs.LG 2025-04 conditional novelty 6.0

    A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.

  16. Bridging Values and Behavior: A Hierarchical Framework for Proactive Embodied Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    ValuePlanner is a hierarchical architecture that uses LLMs to generate value-based subgoals and PDDL planners to produce executable actions, enabling self-directed behavior in embodied agents.

  17. A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

    cs.RO 2025-07 unverdicted novelty 5.0

    The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.

  18. WorldVLA: Towards Autoregressive Action World Model

    cs.RO 2025-06 unverdicted novelty 5.0

    WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.

  19. AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models

    cs.CV 2026-02 unverdicted novelty 3.0

    AugVLA-3D augments existing VLA models with depth-derived 3D features and action priors to improve generalization and action accuracy in 3D robotic tasks.

  20. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 20 Pith papers · 10 internal anchors

  1. [1]

    A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

    8 Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems (NeurIPS), 2022. 1, 3, 8 Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. Vqa: V...

  2. [2]

    RT-1: Robotics Transformer for Real-World Control at Scale

    1 Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 1, 8 Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn,...

  3. [3]

    Scaling Instruction-Finetuned Language Models

    1 Chen, S., Guhur, P.-L., Tapaswi, M., Schmid, C., and Laptev, I. Language conditioned spatial relation reasoning for 3d object grounding. Advances in Neural Information Processing Systems (NeurIPS), 2022. 1, 3, 8, 25, 26 Chen, S., Zhu, H., Chen, X., Lei, Y ., Yu, G., and Chen, T. End-to-end 3d dense captioning with vote2cap-detr. In Conference on Compute...

  4. [4]

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    1, 8 Fan, L., Wang, G., Jiang, Y ., Mandlekar, A., Yang, Y ., Zhu, H., Tang, A., Huang, D.-A., Zhu, Y ., and Anandkumar, A. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Informa- tion Processing Systems (NeurIPS), 2022. 8 Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., ...

  5. [5]

    Scaling Laws for Neural Language Models

    8 Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. 2, 8 Kerr, J., Kim, C. M., Goldberg, K., Kanazawa, A., and Tancik, M. Lerf: Language embedded radiance fields. In International Conference on Computer ...

  6. [6]

    Segment Anything

    8 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y ., et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023. 1, 8 Kudo, T. and Richardson, J. Sentencepiece: A sim- ple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1...

  7. [7]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    3 Ramakrishnan, S. K., Gokaslan, A., Wijmans, E., Maksymets, O., Clegg, A., Turner, J., Undersander, E., Galuba, W., Westbury, A., Chang, A. X., et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. arXiv preprint arXiv:2109.08238, 2021. 7, 30 Ramrakhya, R., Undersander, E., Batra, D., and Das, A. Habitat-web: Lear...

  8. [8]

    and Gurevych, I

    1, 2, 8 Reimers, N. and Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. In Annual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019. 6 Sanh, V ., Webson, A., Raffel, C., Bach, S., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Raja, A., Dey, M., et al. Multitask prompted training enables zero...

  9. [9]

    One Big Net For Everything

    4, 15, 30 Schmidhuber, J. One big net for everything. arXiv preprint arXiv:1802.08864, 2018. 1 Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. Laion-5b: An open large- scale dataset for training next generation image-text mod- els. Advances in Neural Information Process...

  10. [10]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    8, 29 Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y ., Wang, J., Hu, A., Shi, P., Shi, Y ., et al. mplug-owl: Modulariza- tion empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 8 Yin, Z., Wang, J., Cao, J., Shi, Z., Liu, D., Li, M., Sheng, L., Bai, L., Huang, X., Wang, Z., et al. Lamm: Language- assisted multi-mo...

  11. [11]

    OPT: Open Pre-trained Transformer Language Models

    4, 8 Yu, X., Tang, L., Rao, Y ., Huang, T., Zhou, J., and Lu, J. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 29 Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., et al. Opt: Open pre-trained...

  12. [12]

    Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions

    1, 8 Zhu, D., Chen, J., Haydarov, K., Shen, X., Zhang, W., and Elhoseiny, M. Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594, 2023a. 8 Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv...

  13. [13]

    Wipe down the door, including the handle and any glass panels

  14. [14]

    Vacuum or sweep the floor to remove any dust, dirt, or debris

  15. [15]

    Dust and clean the cabinet, including any shelves or drawers

  16. [16]

    Arrange the pillows on the couch neatly and fluff them up

  17. [17]

    Dust and clean the ceiling using a long-handled duster or appropriate cleaning tool

  18. [18]

    find bed

    Check and adjust the temperature or settings of the radiator if necessary. Neatly organize and remove dirt from the living room. I want to add some color to the room. Are there any colorful objects in the room? Yes, there is a colorful picture hanging on the wall. It is in front of a shelf and behind an armchair. It adds a nice pop of color to the room. T...

  19. [19]

    0.7-0.9” means “0.7,0.8,0.9

    As for your bag, you can place it on the floor, to the left of the bed-10. You can place your backpack on the floor, to the left of the dining table. As for your bag, you can place it on the floor, to the left of the bed. Table A.3: Examples of QA refinement. Types Raw Responses Refined Responses Object Counting Q: How many chairs are in the room? A: 3 Q:...

  20. [20]

    Tell me about the elephant in the room

    and Vicuna-13B (Chiang et al., 2023), respectively. We report the evaluation results on ScanNet and 3RScan tasks in Tab. A.15. The results show a significant gap between OPT-1.3B and Vicuna-7B and comparable performances between Vicuna-7B and Vicuna-13B. This indicates the notable improvements when scaling from smaller LLM to 7B scale and the potential sa...

  21. [21]

    Arrange and fluff the cushions and pillows on the armchair

  22. [22]

    Place decorations, such as figurines or vases, on the cabinet or shelf

  23. [23]

    Place hygiene prod- ucts, such as wipes or lo- tions, on the commode

  24. [24]

    Turn on the lights and adjust their brightness if needed

  25. [25]

    Arrange diapers in a designated storage area

  26. [26]

    Place a blanket or pil- low on the armchair for added comfort

  27. [27]

    38 An Embodied Generalist Agent in 3D World Table A.27: Examples from our datasets

    Adjust the festoon or other decorative elements if desired. 38 An Embodied Generalist Agent in 3D World Table A.27: Examples from our datasets. Continued. LEO-instruct embodied navigation USER: The task is navigation. Your goal is to find counter by mov- ing around in the scene. Past actions: <31999> <31999> <31999> <31999>. ASSISTANT: <31996> LEO-instruc...