arxiv: 2311.12871 · v3 · pith:LJHUW6AYnew · submitted 2023-11-18 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

An Embodied Generalist Agent in 3D World

Jiangyong Huang , Silong Yong , Xiaojian Ma , Xiongkun Linghu , Puhao Li , Yan Wang , Qing Li , Song-Chun Zhu

show 2 more authors

Baoxiong Jia Siyuan Huang

This is my paper

Pith reviewed 2026-05-17 14:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords embodied AI3D vision-languagegeneralist agentvision-language-action3D groundingnavigationmanipulationrobotics

0 comments

The pith

LEO trains as a 3D embodied generalist agent through two-stage alignment on large vision-language and vision-language-action datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LEO as an embodied multi-modal generalist agent that perceives, grounds, reasons, plans and acts in 3D environments. Prior models mostly process 2D images and rarely tackle inherently 3D tasks such as grounding, embodied reasoning and acting. LEO uses a single architecture and objective trained in two stages: first aligning 3D vision with language, then tuning on action-oriented instructions. Large-scale object-level and scene-level datasets are assembled with an LLM-assisted pipeline to support this training. Experiments demonstrate proficiency on 3D captioning, question answering, reasoning, navigation and manipulation.

Core claim

LEO is trained with a unified task interface, model architecture and objective in two stages of 3D vision-language alignment followed by 3D vision-language-action instruction tuning on diverse collected datasets, allowing it to excel at perceiving, grounding, reasoning, planning and acting in the 3D world.

What carries the argument

The two-stage training process of 3D vision-language alignment followed by vision-language-action instruction tuning under a single unified interface and objective.

If this is right

The single agent can handle both language-only 3D tasks such as captioning and question answering and action-oriented tasks such as navigation and manipulation.
An LLM-assisted data pipeline produces the high-quality object-level and scene-level examples needed for 3D alignment and tuning.
Ablative studies isolate the contribution of each training stage to overall performance.
Scaling analyses indicate that larger models and datasets continue to improve results on the tested 3D tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage recipe could be applied to other sensor-rich environments once comparable datasets become available.
If the generalist pattern holds, separate specialist models for individual robotics skills may become unnecessary.
Closing the sim-to-real gap remains an open requirement before physical deployment of the trained agent.

Load-bearing premise

The collected large-scale 3D vision-language and vision-language-action datasets plus the two-stage training procedure suffice to produce generalist performance that transfers to tasks beyond the reported benchmarks.

What would settle it

Evaluating LEO on a new 3D navigation or manipulation task drawn from an environment absent from the training datasets and finding performance no better than task-specific baselines would falsify the generalist claim.

read the original abstract

Leveraging massive knowledge from large language models (LLMs), recent machine learning models show notable successes in general-purpose task solving in diverse domains such as computer vision and robotics. However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world, e.g., 3D grounding, embodied reasoning and acting. We argue these limitations significantly hinder current models from performing real-world tasks and approaching general intelligence. To this end, we introduce LEO, an embodied multi-modal generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world. LEO is trained with a unified task interface, model architecture, and objective in two stages: (i) 3D vision-language (VL) alignment and (ii) 3D vision-language-action (VLA) instruction tuning. We collect large-scale datasets comprising diverse object-level and scene-level tasks, which require considerable understanding of and interaction with the 3D world. Moreover, we meticulously design an LLM-assisted pipeline to produce high-quality 3D VL data. Through extensive experiments, we demonstrate LEO's remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, navigation and manipulation. Our ablative studies and scaling analyses further provide valuable insights for developing future embodied generalist agents. Code and data are available on project page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LEO offers a practical two-stage recipe for 3D embodied generalists along with a data generation pipeline, but the lack of quantitative results in the abstract makes the performance claims hard to assess right now.

read the letter

The paper introduces LEO, which uses a two-stage process to train an embodied agent for 3D environments. First it aligns 3D vision with language, then it tunes for vision-language-action tasks using instructions. They also built an LLM-assisted pipeline to generate large-scale 3D VL datasets covering object and scene level tasks. This approach stands out because it directly tackles the shift from 2D images to native 3D inputs and tasks like grounding, reasoning, navigation, and manipulation. The unified task interface and model architecture make sense as a way to leverage existing LLM knowledge without starting from scratch. The data collection effort and the pipeline for high-quality 3D data are practical contributions that could help others working in this area. If the scaling analyses hold up, they might offer some guidance on how to build future agents. Where it gets softer is in the evaluation. The abstract highlights remarkable proficiency across many tasks but skips over any actual numbers, baselines, or detailed ablations. That leaves the strength of the claims open to question until the full results are examined. The potential issue with data overlap between training and test scenes is real – without explicit checks for zero-shot performance on novel layouts, the transfer to truly generalist behavior remains unproven in the summary. Readers who focus on embodied AI, 3D vision, or robotics applications would find this relevant. It could be useful for groups trying to extend VL models into physical interaction spaces. The paper shows clear thinking on the problem setup and engages with the literature on LLMs and robotics. It deserves a serious referee to dig into the experiments and see if the data supports the generalist positioning. I would recommend sending this to peer review. The core idea has enough substance to warrant detailed feedback, even if revisions are needed on the empirical side.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LEO, an embodied multi-modal generalist agent for 3D worlds. It employs a unified task interface, model architecture, and objective trained in two stages: (i) 3D vision-language alignment and (ii) 3D vision-language-action instruction tuning. Large-scale object- and scene-level datasets are collected via an LLM-assisted pipeline. The authors claim that extensive experiments demonstrate LEO's remarkable proficiency across 3D captioning, question answering, embodied reasoning, navigation, and manipulation, with additional insights from ablative studies and scaling analyses. Code and data are released.

Significance. If the central claims hold, the work would be significant for embodied AI and 3D vision-language modeling by addressing the limitations of 2D-centric approaches and providing a unified framework for perception, grounding, reasoning, planning, and acting. The two-stage training procedure and LLM-assisted data pipeline offer a concrete methodology that could inform future generalist agents. Releasing code and data supports reproducibility and is a clear strength.

major comments (2)

[Abstract and Experiments] The generalist claim (abstract and §1) rests on transfer from the collected 3D VL/VLA datasets to unseen environments. The manuscript does not report explicit controls for distributional overlap between training scenes and evaluation scenes or zero-shot results on novel 3D layouts and action spaces; without these, it is difficult to rule out in-distribution memorization as the source of the reported proficiency.
[Experiments] §4 (or equivalent results section): the claim of 'remarkable proficiency across a wide spectrum of tasks' requires quantitative support with baselines, ablations, and statistical significance. The abstract provides none, and the absence of these details in the evaluation of navigation and manipulation tasks weakens the cross-task generalization argument.

minor comments (2)

[Abstract] The abstract would benefit from a single sentence summarizing the key quantitative improvements or the most challenging task where LEO excels.
[Method] Notation for the unified objective and the two-stage losses should be introduced consistently in the method section to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract and Experiments] The generalist claim (abstract and §1) rests on transfer from the collected 3D VL/VLA datasets to unseen environments. The manuscript does not report explicit controls for distributional overlap between training scenes and evaluation scenes or zero-shot results on novel 3D layouts and action spaces; without these, it is difficult to rule out in-distribution memorization as the source of the reported proficiency.

Authors: We agree that explicit controls for distributional overlap would provide stronger evidence for generalization. Our evaluation scenes are drawn from held-out portions of the collected datasets, which were curated to include diverse and previously unseen 3D layouts. To directly address the concern, we will add a quantitative analysis of scene overlap (using metrics such as object layout similarity) and include performance results on a subset of fully novel environments not present in any training split. Regarding zero-shot transfer to novel action spaces, the unified task interface supports some degree of generalization through instruction tuning, but we will report additional experiments on unseen action combinations to the extent supported by our data. revision: partial
Referee: [Experiments] §4 (or equivalent results section): the claim of 'remarkable proficiency across a wide spectrum of tasks' requires quantitative support with baselines, ablations, and statistical significance. The abstract provides none, and the absence of these details in the evaluation of navigation and manipulation tasks weakens the cross-task generalization argument.

Authors: The manuscript already presents quantitative results, baselines, and ablations for all tasks including navigation and manipulation, along with scaling analyses. We acknowledge, however, that the abstract does not sufficiently summarize these metrics and that statistical significance testing is not explicitly reported. We will revise the abstract to include key quantitative findings and add statistical significance measures (e.g., confidence intervals or p-values) for the navigation and manipulation results in the experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents an empirical training pipeline for LEO consisting of dataset collection via an LLM-assisted process followed by two-stage optimization (3D VL alignment then VLA instruction tuning) on object- and scene-level tasks. All load-bearing elements—model architecture, unified task interface, and reported proficiency—are defined externally to the claimed outcomes and evaluated on separate benchmarks rather than reducing to self-referential fits, renamings, or self-citation chains. No equations or first-principles derivations are invoked that collapse to the inputs by construction; the central generalist claim rests on experimental transfer from collected data, which is a standard non-circular empirical structure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate hyperparameters or new axioms; the approach rests on the standard assumption that large-scale supervised fine-tuning of multimodal transformers yields generalist capabilities, plus the unstated premise that the generated 3D datasets are representative of real-world 3D tasks.

pith-pipeline@v0.9.0 · 5616 in / 1160 out tokens · 27217 ms · 2026-05-17T14:15:53.121198+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Leveraging massive knowledge from large language models (LLMs), recent machine learning models show notable successes in general-purpose task solving in diverse domains such as computer vision and robotics. However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world, e.g., 3D grounding, embodied reasoning and acting.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
POMA-3D: The Point Map Way to 3D Scene Understanding
cs.CV 2025-11 unverdicted novelty 7.0

POMA-3D learns self-supervised 3D scene representations from point maps and improves performance on geometric 3D tasks including navigation and scene retrieval.
3D-VLA: A 3D Vision-Language-Action Generative World Model
cs.CV 2024-03 unverdicted novelty 7.0

3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
Lifting Unlabeled Internet-level Data for 3D Scene Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Unlabeled web videos processed by designed data engines generate effective training data that yields strong zero-shot and finetuned performance on 3D detection, segmentation, VQA, and navigation.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
cs.CV 2026-03 unverdicted novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
cs.CV 2026-03 unverdicted novelty 6.0

SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
cs.CV 2026-03 unverdicted novelty 6.0

Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at...
TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization
cs.CV 2026-03 unverdicted novelty 6.0

TrianguLang achieves state-of-the-art feed-forward text-guided 3D localization and segmentation by using predicted geometry to gate cross-view semantic correspondences without ground-truth poses.
CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining
cs.RO 2026-01 unverdicted novelty 6.0

CLAMP pretrains 3D multi-view encoders with contrastive learning on point clouds and actions, then initializes diffusion policies for more sample-efficient fine-tuning on robotic tasks.
Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception
cs.RO 2025-11 conditional novelty 6.0

EyeVLA transfers open-world VLM understanding to a PTZ camera control policy via hierarchical action tokens and GRPO reinforcement learning, reaching 96% task completion on 50 real scenes with only 500 training samples.
C-NAV: Towards Self-Evolving Continual Object Navigation in Open World
cs.RO 2025-10 unverdicted novelty 6.0

C-Nav is a continual visual navigation framework with dual-path anti-forgetting via feature distillation and replay plus adaptive sampling that outperforms baselines on a new continual object navigation benchmark whil...
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
cs.CV 2025-05 unverdicted novelty 6.0

Spatial-MLLM boosts MLLM spatial intelligence from 2D inputs via dual encoders initialized from geometry models plus space-aware sampling, claiming state-of-the-art results.
ToolRL: Reward is All Tool Learning Needs
cs.LG 2025-04 conditional novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
Bridging Values and Behavior: A Hierarchical Framework for Proactive Embodied Agents
cs.AI 2026-04 unverdicted novelty 5.0

ValuePlanner is a hierarchical architecture that uses LLMs to generate value-based subgoals and PDDL planners to produce executable actions, enabling self-directed behavior in embodied agents.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
cs.RO 2025-07 unverdicted novelty 5.0

The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
WorldVLA: Towards Autoregressive Action World Model
cs.RO 2025-06 unverdicted novelty 5.0

WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models
cs.CV 2026-02 unverdicted novelty 3.0

AugVLA-3D augments existing VLA models with depth-derived 3D features and action priors to improve generalization and action accuracy in 3D robotic tasks.
A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 20 Pith papers · 10 internal anchors

[1]

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

8 Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems (NeurIPS), 2022. 1, 3, 8 Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. Vqa: V...

work page internal anchor Pith review arXiv 2022
[2]

RT-1: Robotics Transformer for Real-World Control at Scale

1 Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 1, 8 Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Scaling Instruction-Finetuned Language Models

1 Chen, S., Guhur, P.-L., Tapaswi, M., Schmid, C., and Laptev, I. Language conditioned spatial relation reasoning for 3d object grounding. Advances in Neural Information Processing Systems (NeurIPS), 2022. 1, 3, 8, 25, 26 Chen, S., Zhu, H., Chen, X., Lei, Y ., Yu, G., and Chen, T. End-to-end 3d dense captioning with vote2cap-detr. In Conference on Compute...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

1, 8 Fan, L., Wang, G., Jiang, Y ., Mandlekar, A., Yang, Y ., Zhu, H., Tang, A., Huang, D.-A., Zhu, Y ., and Anandkumar, A. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Informa- tion Processing Systems (NeurIPS), 2022. 8 Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Scaling Laws for Neural Language Models

8 Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. 2, 8 Kerr, J., Kim, C. M., Goldberg, K., Kanazawa, A., and Tancik, M. Lerf: Language embedded radiance fields. In International Conference on Computer ...

work page internal anchor Pith review Pith/arXiv arXiv 2001
[6]

Segment Anything

8 Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y ., et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023. 1, 8 Kudo, T. and Richardson, J. Sentencepiece: A sim- ple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

3 Ramakrishnan, S. K., Gokaslan, A., Wijmans, E., Maksymets, O., Clegg, A., Turner, J., Undersander, E., Galuba, W., Westbury, A., Chang, A. X., et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. arXiv preprint arXiv:2109.08238, 2021. 7, 30 Ramrakhya, R., Undersander, E., Batra, D., and Das, A. Habitat-web: Lear...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

and Gurevych, I

1, 2, 8 Reimers, N. and Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. In Annual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019. 6 Sanh, V ., Webson, A., Raffel, C., Bach, S., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Raja, A., Dey, M., et al. Multitask prompted training enables zero...

work page 2019
[9]

One Big Net For Everything

4, 15, 30 Schmidhuber, J. One big net for everything. arXiv preprint arXiv:1802.08864, 2018. 1 Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al. Laion-5b: An open large- scale dataset for training next generation image-text mod- els. Advances in Neural Information Process...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

8, 29 Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y ., Wang, J., Hu, A., Shi, P., Shi, Y ., et al. mplug-owl: Modulariza- tion empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023. 8 Yin, Z., Wang, J., Cao, J., Shi, Z., Liu, D., Li, M., Sheng, L., Bai, L., Huang, X., Wang, Z., et al. Lamm: Language- assisted multi-mo...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

OPT: Open Pre-trained Transformer Language Models

4, 8 Yu, X., Tang, L., Rao, Y ., Huang, T., Zhou, J., and Lu, J. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 29 Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., et al. Opt: Open pre-trained...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions

1, 8 Zhu, D., Chen, J., Haydarov, K., Shen, X., Zhang, W., and Elhoseiny, M. Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594, 2023a. 8 Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv...

work page arXiv 2020
[13]

Wipe down the door, including the handle and any glass panels

work page
[14]

Vacuum or sweep the floor to remove any dust, dirt, or debris

work page
[15]

Dust and clean the cabinet, including any shelves or drawers

work page
[16]

Arrange the pillows on the couch neatly and fluff them up

work page
[17]

Dust and clean the ceiling using a long-handled duster or appropriate cleaning tool

work page
[18]

find bed

Check and adjust the temperature or settings of the radiator if necessary. Neatly organize and remove dirt from the living room. I want to add some color to the room. Are there any colorful objects in the room? Yes, there is a colorful picture hanging on the wall. It is in front of a shelf and behind an armchair. It adds a nice pop of color to the room. T...

work page 2023
[19]

0.7-0.9” means “0.7,0.8,0.9

As for your bag, you can place it on the floor, to the left of the bed-10. You can place your backpack on the floor, to the left of the dining table. As for your bag, you can place it on the floor, to the left of the bed. Table A.3: Examples of QA refinement. Types Raw Responses Refined Responses Object Counting Q: How many chairs are in the room? A: 3 Q:...

work page 2023
[20]

Tell me about the elephant in the room

and Vicuna-13B (Chiang et al., 2023), respectively. We report the evaluation results on ScanNet and 3RScan tasks in Tab. A.15. The results show a significant gap between OPT-1.3B and Vicuna-7B and comparable performances between Vicuna-7B and Vicuna-13B. This indicates the notable improvements when scaling from smaller LLM to 7B scale and the potential sa...

work page 2023
[21]

Arrange and fluff the cushions and pillows on the armchair

work page
[22]

Place decorations, such as figurines or vases, on the cabinet or shelf

work page
[23]

Place hygiene prod- ucts, such as wipes or lo- tions, on the commode

work page
[24]

Turn on the lights and adjust their brightness if needed

work page
[25]

Arrange diapers in a designated storage area

work page
[26]

Place a blanket or pil- low on the armchair for added comfort

work page
[27]

38 An Embodied Generalist Agent in 3D World Table A.27: Examples from our datasets

Adjust the festoon or other decorative elements if desired. 38 An Embodied Generalist Agent in 3D World Table A.27: Examples from our datasets. Continued. LEO-instruct embodied navigation USER: The task is navigation. Your goal is to find counter by mov- ing around in the scene. Past actions: <31999> <31999> <31999> <31999>. ASSISTANT: <31996> LEO-instruc...

work page